Disaster Recovery Drills and Recovery Runbook
1. DR Drills — Test Regularly
A DR plan that has never been tested is a plan you cannot trust. Many organizations discover DR issues only when an actual disaster occurs — at exactly the wrong moment.
Recommended Drill Frequency
| Workload Tier | Minimum Drill Frequency |
|---|---|
| Critical | every 2 weeks |
| Important | every 1 month |
| Standard | semi-annually (6 months) |
| Low | Optionally — only if the environment is used in real production scenarios |
What to Validate During a Drill
Do not just verify that the target VM powers on. Run through a comprehensive checklist:
Infrastructure checks:
- Target VM is accessible via SSH or console
- All expected volumes are attached and mounted correctly
- Networking is functional (can reach the internet / internal services)
- Security groups are correctly applied
Application checks:
- Application services start correctly
- Application can read from and write to the database
- External dependencies (APIs, queues, DNS) resolve correctly
- Application logs show no critical errors on startup
Data integrity checks:
- Spot-check a recent known record or transaction that should be in the recovery point
- Verify file system integrity (no corruption warnings on mount)
- Check application-specific data consistency (e.g., pending order counts, session data)
RTO verification:
- Record the time from "start drill" to "application functional" — is it within your business's acceptable recovery window?
Document Drill Results
After every drill, record:
- Date and drill start/end times
- Recovery point used (ID, timestamp, how old it was)
- Time to application functionality (actual RTO)
- Issues discovered
- Corrective actions taken
- Confirmation that the DR plan resumes normally after stopping the drill
This documentation is essential for compliance audits and for improving your DR process over time.
Perform Drills After Major Infrastructure Changes
In addition to the scheduled drill cadence, always run a drill after:
- Adding or removing volumes from the source VM
- OS upgrades on the source VM
- Application upgrades that change the data format
- Changes to security groups
- Changes to dependent services (database engine upgrades, etc.)
2. Preparing for a Real Recovery
When an actual disaster occurs, you may be under significant time pressure. Prepare in advance so you can act quickly and confidently.
Pre-Recovery Preparation Checklist (Do These Ahead of Time)
Document the following and store them somewhere accessible when E2E console is unavailable:
- DR plan ID for each protected VM
- Target VM IP addresses (check from plan details while the source region is healthy)
- SSH key or credentials to access the target VM
- List of dependent services that need to know about the IP change
- Contact list of team members to notify
- Escalation path to E2E Networks support
Identify Your Recovery Decision Criteria in Advance
Agree on what constitutes a declaration of disaster that justifies recovery. Common criteria:
- Source region is completely inaccessible for more than X minutes
- E2E Networks confirms a region-level failure with no ETA for recovery
- Your monitoring shows total loss of connectivity to the source VM from multiple geographic vantage points
Waiting too long to declare a recovery increases downtime. Failing over too early when the source region recovers shortly after can create data inconsistencies. Define the threshold ahead of time.
Decide on Recovery Point Selection Ahead of Time
During a crisis, deciding which recovery point to restore from is high-stakes and time-sensitive. Establish a decision rule in advance:
- Default: Use the most recent SUCCESSFUL recovery point (minimizes data loss)
- Exception: If there is reason to believe the most recent recovery point contains bad data (e.g., a failed deployment was the last event before the outage), roll back one or two recovery points
3. Post-Recovery Checklist
After triggering a recovery, work through this checklist systematically.
Immediately After Recovery Completes
- Confirm the target VM status is Running in the target region
- SSH/console access to the target VM is working
- All attached volumes are mounted and accessible
Application Recovery
- Run a basic smoke test of your application (key user flows, API health endpoint)
- Check application logs for startup errors
- Verify data consistency — run your application's built-in health checks if available
Network & Routing Updates
- Update any hardcoded IP addresses in configuration files
- If using an API gateway, update the backend target
- Notify your CDN provider if they cache origin server IPs
Notifications
- Notify internal stakeholders (engineering, operations, management)
- Update your status page if you have one
- Notify customers if the outage was customer-facing
- Declare the incident resolved once all checks pass
Post-Recovery Review (Within 48 Hours)
- Conduct a post-mortem: What failed? Why? What could prevent it?
- Assess data loss: What was the actual gap between the last recovery point and the outage?
- Review your DR plan: Should RPO be reduced? Should retention be longer?
- Decide what to do with the original source VM (now in the failed/restored region)
- Create a new DR plan for the recovered VM if ongoing protection is required