Security Compliance and Mistakes
1. Security & Access Control
Scope DR Plans to the Correct Project
DRaaS plans are scoped to a project in your account. Use projects to isolate DR resources:
- Create separate projects for production vs. staging environments
- Restrict who can manage DR plans for production projects
- Avoid creating DR plans for staging environments under the same project as production
Protect DR Plan Operations with Access Control
The ability to trigger a recovery is one of the most consequential operations in your infrastructure. Ensure that only authorized personnel can:
- Start a recovery
- Delete a DR plan
- Start or stop a DR drill
Audit logs capture who performed each action and when. Review them periodically to confirm only expected users are making DR changes.
Secure Target VM Access Credentials
The target VM uses the same credentials as your source VM (same OS image, same SSH key). Ensure:
- SSH keys are stored securely
- The target VM's expected IP is documented for use during recovery
- Access credentials are available somewhere independent of the source region (in case the source region and your credential store are both unavailable simultaneously)
Treat the Recovered VM as a New Instance
After a Recovery, your recovered VM is now your primary system. Harden it as you would any production system:
- Rotate SSH keys if they may have been compromised
- Update any secrets or API keys that reference the old VM's IP
- Review security group rules — ensure only expected ports are open
2. Compliance & Governance
Maintain a DR Policy Document
Many compliance frameworks (ISO 27001, SOC 2, RBI guidelines) require a formal Business Continuity and Disaster Recovery (BCDR) policy. Your DRaaS configuration should be documented in this policy, including:
- Which systems are protected and at what RPO/RTO
- The retention period for recovery points and the justification
- The drill schedule and the process for documenting results
- The escalation path and decision criteria for declaring a recovery
Use Audit Logs for Compliance Evidence
The DRaaS audit log captures a complete record of all plan creation, modification, and operational events. Export these logs regularly (monthly at minimum) and retain them according to your compliance requirements.
Audit evidence to collect:
- Proof that DR plans exist for critical systems
- Proof that drills were conducted (drill start/stop events)
- Proof that RPO/retention settings were deliberately configured and maintained
Enforce Minimum Drill Frequency
For compliance frameworks that require periodic testing:
| Framework | Typical Requirement |
|---|---|
| ISO 27001 | At least annual testing of BCP/DRP |
| SOC 2 | Testing aligned with recovery objectives |
| RBI IT Framework | Annual DR drill minimum |
| DPDP Act (India) | Appropriate technical safeguards including recovery capability |
Schedule drills on a calendar ahead of time so they are not overlooked. Assign responsibility to a specific team member.
Review RPO/RTO Against Regulatory Requirements
Some regulations specify maximum acceptable data loss (RPO) or downtime (RTO) for certain data categories. Verify that your configured RPO and RTO align with the regulatory requirements that apply to your workloads.
3. Common Mistakes to Avoid
Mistake 1: Creating a DR Plan Without Testing It
Many teams create a DR plan and assume it works. It may not. A misconfigured recovery point, an OS image that differs slightly at the target, or a missing volume can all prevent a successful recovery.
Best practice: Create the plan, wait for the first few recovery points to complete, then immediately run a DR drill to confirm end-to-end recovery works.
Mistake 2: Not Including All Required Volumes
Creating a DR plan without including a critical data volume means that volume will not be replicated. The plan will appear healthy, but a recovery will leave you without that data.
Best practice: List all volumes your application depends on when creating the DR plan. Review the plan details to confirm all volumes appear in the target mapping.
Mistake 3: Using a Vague RPO Without Knowing Data Change Rates
Setting a 1-hour RPO on a VM that barely changes wastes money. Setting a 24-hour RPO on a high-transaction database creates unacceptable risk.
Best practice: Review the size of your first few recovery points. If they are very small, your data change rate is low and you can likely increase the RPO interval. Adjust based on actual data, not assumptions.
Mistake 4: Not Having the Target VM's IP Documented Before a Disaster
During a real regional failure, the E2E console for the source region may be inaccessible. If you do not know your target VM's IP address in advance, you cannot connect to it after recovery.
Best practice: Retrieve the target VM's IP from the plan details endpoint while the source region is healthy, and store it in your runbook alongside other recovery information.
Mistake 5: Failing Over Without a Checklist
Under the stress of a real outage, it is easy to forget steps like updating DNS, restarting application services in the right order, or notifying customers.
Best practice: Maintain a written post-recovery checklist (for example, following the guidance in the Disaster Recovery Drills and Recovery Runbook) and store it somewhere accessible outside the E2E platform.
Mistake 6: Not Protecting New Volumes Added After Plan Creation
Existing plans do not automatically include newly attached volumes. The new volume will not be replicated. When attaching a volume to the source VM, you must confirm the target VM in the attachment pop-up.
Best practice: Whenever you attach a new volume to a protected VM, you must confirm the target VM in the attachment pop-up.
Mistake 7: Treating Recovery as an Automated Process That Needs No Human Oversight
DRaaS automates the infrastructure recovery — the target VM powers on with your data restored. However, application recovery (starting services, checking data integrity, updating DNS, notifying teams) is not automated and requires human action.
Best practice: Do not assume that a successful recovery means your service is restored. Work through your post-recovery checklist for every recovery, drill or real.