--- title: Disaster Recovery Drills and Recovery Runbook --- ## 1. DR Drills — Test Regularly A DR plan that has never been tested is a plan you cannot trust. Many organizations discover DR issues only when an actual disaster occurs — at exactly the wrong moment. ### Recommended Drill Frequency | Workload Tier | Minimum Drill Frequency | | ------------- | ------------------------------------------------------------------------- | | Critical | every 2 weeks | | Important | every 1 month | | Standard | semi-annually (6 months) | | Low | Optionally — only if the environment is used in real production scenarios | ### What to Validate During a Drill Do not just verify that the target VM powers on. Run through a comprehensive checklist: **Infrastructure checks:** - [ ] Target VM is accessible via SSH or console - [ ] All expected volumes are attached and mounted correctly - [ ] Networking is functional (can reach the internet / internal services) - [ ] Security groups are correctly applied **Application checks:** - [ ] Application services start correctly - [ ] Application can read from and write to the database - [ ] External dependencies (APIs, queues, DNS) resolve correctly - [ ] Application logs show no critical errors on startup **Data integrity checks:** - [ ] Spot-check a recent known record or transaction that should be in the recovery point - [ ] Verify file system integrity (no corruption warnings on mount) - [ ] Check application-specific data consistency (e.g., pending order counts, session data) **RTO verification:** - [ ] Record the time from "start drill" to "application functional" — is it within your business's acceptable recovery window? ### Document Drill Results After every drill, record: - Date and drill start/end times - Recovery point used (ID, timestamp, how old it was) - Time to application functionality (actual RTO) - Issues discovered - Corrective actions taken - Confirmation that the DR plan resumes normally after stopping the drill This documentation is essential for compliance audits and for improving your DR process over time. ### Perform Drills After Major Infrastructure Changes In addition to the scheduled drill cadence, always run a drill after: - Adding or removing volumes from the source VM - OS upgrades on the source VM - Application upgrades that change the data format - Changes to security groups - Changes to dependent services (database engine upgrades, etc.) --- ## 2. Preparing for a Real Recovery When an actual disaster occurs, you may be under significant time pressure. Prepare in advance so you can act quickly and confidently. ### Pre-Recovery Preparation Checklist (Do These Ahead of Time) **Document the following and store them somewhere accessible when E2E console is unavailable:** - [ ] DR plan ID for each protected VM - [ ] Target VM IP addresses (check from plan details while the source region is healthy) - [ ] SSH key or credentials to access the target VM - [ ] List of dependent services that need to know about the IP change - [ ] Contact list of team members to notify - [ ] Escalation path to E2E Networks support ### Identify Your Recovery Decision Criteria in Advance Agree on what constitutes a declaration of disaster that justifies recovery. Common criteria: - Source region is completely inaccessible for more than X minutes - E2E Networks confirms a region-level failure with no ETA for recovery - Your monitoring shows total loss of connectivity to the source VM from multiple geographic vantage points Waiting too long to declare a recovery increases downtime. Failing over too early when the source region recovers shortly after can create data inconsistencies. Define the threshold ahead of time. ### Decide on Recovery Point Selection Ahead of Time During a crisis, deciding which recovery point to restore from is high-stakes and time-sensitive. Establish a decision rule in advance: - **Default:** Use the most recent **SUCCESSFUL** recovery point (minimizes data loss) - **Exception:** If there is reason to believe the most recent recovery point contains bad data (e.g., a failed deployment was the last event before the outage), roll back one or two recovery points --- ## 3. Post-Recovery Checklist After triggering a recovery, work through this checklist systematically. ### Immediately After Recovery Completes - [ ] Confirm the target VM status is **Running** in the target region - [ ] SSH/console access to the target VM is working - [ ] All attached volumes are mounted and accessible ### Application Recovery - [ ] Run a basic smoke test of your application (key user flows, API health endpoint) - [ ] Check application logs for startup errors - [ ] Verify data consistency — run your application's built-in health checks if available ### Network & Routing Updates - [ ] Update any hardcoded IP addresses in configuration files - [ ] If using an API gateway, update the backend target - [ ] Notify your CDN provider if they cache origin server IPs ### Notifications - [ ] Notify internal stakeholders (engineering, operations, management) - [ ] Update your status page if you have one - [ ] Notify customers if the outage was customer-facing - [ ] Declare the incident resolved once all checks pass ### Post-Recovery Review (Within 48 Hours) - [ ] Conduct a post-mortem: What failed? Why? What could prevent it? - [ ] Assess data loss: What was the actual gap between the last recovery point and the outage? - [ ] Review your DR plan: Should RPO be reduced? Should retention be longer? - [ ] Decide what to do with the original source VM (now in the failed/restored region) - [ ] Create a new DR plan for the recovered VM if ongoing protection is required ---