Disaster Recovery Drills and Recovery Runbook

1. DR Drills — Test Regularly

A DR plan that has never been tested is a plan you cannot trust. Many organizations discover DR issues only when an actual disaster occurs — at exactly the wrong moment.

Recommended Drill Frequency

Workload Tier	Minimum Drill Frequency
Critical	every 2 weeks
Important	every 1 month
Standard	semi-annually (6 months)
Low	Optionally — only if the environment is used in real production scenarios

What to Validate During a Drill

Do not just verify that the target VM powers on. Run through a comprehensive checklist:

Infrastructure checks:

Target VM is accessible via SSH or console
All expected volumes are attached and mounted correctly
Networking is functional (can reach the internet / internal services)
Security groups are correctly applied

Application checks:

Application services start correctly
Application can read from and write to the database
External dependencies (APIs, queues, DNS) resolve correctly
Application logs show no critical errors on startup

Data integrity checks:

Spot-check a recent known record or transaction that should be in the recovery point
Verify file system integrity (no corruption warnings on mount)
Check application-specific data consistency (e.g., pending order counts, session data)

RTO verification:

Record the time from "start drill" to "application functional" — is it within your business's acceptable recovery window?

Document Drill Results

After every drill, record:

Date and drill start/end times
Recovery point used (ID, timestamp, how old it was)
Time to application functionality (actual RTO)
Issues discovered
Corrective actions taken
Confirmation that the DR plan resumes normally after stopping the drill

This documentation is essential for compliance audits and for improving your DR process over time.

Perform Drills After Major Infrastructure Changes

In addition to the scheduled drill cadence, always run a drill after:

Adding or removing volumes from the source VM
OS upgrades on the source VM
Application upgrades that change the data format
Changes to security groups
Changes to dependent services (database engine upgrades, etc.)

2. Preparing for a Real Recovery

When an actual disaster occurs, you may be under significant time pressure. Prepare in advance so you can act quickly and confidently.

Pre-Recovery Preparation Checklist (Do These Ahead of Time)

Document the following and store them somewhere accessible when E2E console is unavailable:

DR plan ID for each protected VM
Target VM IP addresses (check from plan details while the source region is healthy)
SSH key or credentials to access the target VM
List of dependent services that need to know about the IP change
Contact list of team members to notify
Escalation path to E2E Networks support

Identify Your Recovery Decision Criteria in Advance

Agree on what constitutes a declaration of disaster that justifies recovery. Common criteria:

Source region is completely inaccessible for more than X minutes
E2E Networks confirms a region-level failure with no ETA for recovery
Your monitoring shows total loss of connectivity to the source VM from multiple geographic vantage points

Waiting too long to declare a recovery increases downtime. Failing over too early when the source region recovers shortly after can create data inconsistencies. Define the threshold ahead of time.

Decide on Recovery Point Selection Ahead of Time

During a crisis, deciding which recovery point to restore from is high-stakes and time-sensitive. Establish a decision rule in advance:

Default: Use the most recent SUCCESSFUL recovery point (minimizes data loss)
Exception: If there is reason to believe the most recent recovery point contains bad data (e.g., a failed deployment was the last event before the outage), roll back one or two recovery points

3. Post-Recovery Checklist

After triggering a recovery, work through this checklist systematically.

Immediately After Recovery Completes

Confirm the target VM status is Running in the target region
SSH/console access to the target VM is working
All attached volumes are mounted and accessible

Application Recovery

Run a basic smoke test of your application (key user flows, API health endpoint)
Check application logs for startup errors
Verify data consistency — run your application's built-in health checks if available

Network & Routing Updates

Update any hardcoded IP addresses in configuration files
If using an API gateway, update the backend target
Notify your CDN provider if they cache origin server IPs

Notifications

Notify internal stakeholders (engineering, operations, management)
Update your status page if you have one
Notify customers if the outage was customer-facing
Declare the incident resolved once all checks pass

Post-Recovery Review (Within 48 Hours)

Conduct a post-mortem: What failed? Why? What could prevent it?
Assess data loss: What was the actual gap between the last recovery point and the outage?
Review your DR plan: Should RPO be reduced? Should retention be longer?
Decide what to do with the original source VM (now in the failed/restored region)
Create a new DR plan for the recovered VM if ongoing protection is required

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

1. DR Drills — Test Regularly
2. Preparing for a Real Recovery
3. Post-Recovery Checklist

1. DR Drills — Test Regularly​

Recommended Drill Frequency​

What to Validate During a Drill​

Document Drill Results​

Perform Drills After Major Infrastructure Changes​

2. Preparing for a Real Recovery​

Pre-Recovery Preparation Checklist (Do These Ahead of Time)​

Identify Your Recovery Decision Criteria in Advance​

Decide on Recovery Point Selection Ahead of Time​

3. Post-Recovery Checklist​

Immediately After Recovery Completes​

Application Recovery​

Network & Routing Updates​

Notifications​

Post-Recovery Review (Within 48 Hours)​

1. DR Drills — Test Regularly

Recommended Drill Frequency

What to Validate During a Drill

Document Drill Results

Perform Drills After Major Infrastructure Changes

2. Preparing for a Real Recovery

Pre-Recovery Preparation Checklist (Do These Ahead of Time)

Identify Your Recovery Decision Criteria in Advance

Decide on Recovery Point Selection Ahead of Time

3. Post-Recovery Checklist

Immediately After Recovery Completes

Application Recovery

Network & Routing Updates

Notifications

Post-Recovery Review (Within 48 Hours)