---
title: Disaster Recovery Drills and Recovery Runbook
---

## 1. DR Drills — Test Regularly

A DR plan that has never been tested is a plan you cannot trust. Many organizations discover DR issues only when an actual disaster occurs — at exactly the wrong moment.

### Recommended Drill Frequency

| Workload Tier | Minimum Drill Frequency                                                   |
| ------------- | ------------------------------------------------------------------------- |
| Critical      | every 2 weeks                                                             |
| Important     | every 1 month                                                             |
| Standard      | semi-annually (6 months)                                                  |
| Low           | Optionally — only if the environment is used in real production scenarios |

### What to Validate During a Drill

Do not just verify that the target VM powers on. Run through a comprehensive checklist:

**Infrastructure checks:**

- [ ] Target VM is accessible via SSH or console
- [ ] All expected volumes are attached and mounted correctly
- [ ] Networking is functional (can reach the internet / internal services)
- [ ] Security groups are correctly applied

**Application checks:**

- [ ] Application services start correctly
- [ ] Application can read from and write to the database
- [ ] External dependencies (APIs, queues, DNS) resolve correctly
- [ ] Application logs show no critical errors on startup

**Data integrity checks:**

- [ ] Spot-check a recent known record or transaction that should be in the recovery point
- [ ] Verify file system integrity (no corruption warnings on mount)
- [ ] Check application-specific data consistency (e.g., pending order counts, session data)

**RTO verification:**

- [ ] Record the time from "start drill" to "application functional" — is it within your business's acceptable recovery window?

### Document Drill Results

After every drill, record:

- Date and drill start/end times
- Recovery point used (ID, timestamp, how old it was)
- Time to application functionality (actual RTO)
- Issues discovered
- Corrective actions taken
- Confirmation that the DR plan resumes normally after stopping the drill

This documentation is essential for compliance audits and for improving your DR process over time.

### Perform Drills After Major Infrastructure Changes

In addition to the scheduled drill cadence, always run a drill after:

- Adding or removing volumes from the source VM
- OS upgrades on the source VM
- Application upgrades that change the data format
- Changes to security groups
- Changes to dependent services (database engine upgrades, etc.)

---

## 2. Preparing for a Real Recovery

When an actual disaster occurs, you may be under significant time pressure. Prepare in advance so you can act quickly and confidently.

### Pre-Recovery Preparation Checklist (Do These Ahead of Time)

**Document the following and store them somewhere accessible when E2E console is unavailable:**

- [ ] DR plan ID for each protected VM
- [ ] Target VM IP addresses (check from plan details while the source region is healthy)
- [ ] SSH key or credentials to access the target VM
- [ ] List of dependent services that need to know about the IP change
- [ ] Contact list of team members to notify
- [ ] Escalation path to E2E Networks support

### Identify Your Recovery Decision Criteria in Advance

Agree on what constitutes a declaration of disaster that justifies recovery. Common criteria:

- Source region is completely inaccessible for more than X minutes
- E2E Networks confirms a region-level failure with no ETA for recovery
- Your monitoring shows total loss of connectivity to the source VM from multiple geographic vantage points

Waiting too long to declare a recovery increases downtime. Failing over too early when the source region recovers shortly after can create data inconsistencies. Define the threshold ahead of time.

### Decide on Recovery Point Selection Ahead of Time

During a crisis, deciding which recovery point to restore from is high-stakes and time-sensitive. Establish a decision rule in advance:

- **Default:** Use the most recent **SUCCESSFUL** recovery point (minimizes data loss)
- **Exception:** If there is reason to believe the most recent recovery point contains bad data (e.g., a failed deployment was the last event before the outage), roll back one or two recovery points

---

## 3. Post-Recovery Checklist

After triggering a recovery, work through this checklist systematically.

### Immediately After Recovery Completes

- [ ] Confirm the target VM status is **Running** in the target region
- [ ] SSH/console access to the target VM is working
- [ ] All attached volumes are mounted and accessible

### Application Recovery

- [ ] Run a basic smoke test of your application (key user flows, API health endpoint)
- [ ] Check application logs for startup errors
- [ ] Verify data consistency — run your application's built-in health checks if available

### Network & Routing Updates

- [ ] Update any hardcoded IP addresses in configuration files
- [ ] If using an API gateway, update the backend target
- [ ] Notify your CDN provider if they cache origin server IPs

### Notifications

- [ ] Notify internal stakeholders (engineering, operations, management)
- [ ] Update your status page if you have one
- [ ] Notify customers if the outage was customer-facing
- [ ] Declare the incident resolved once all checks pass

### Post-Recovery Review (Within 48 Hours)

- [ ] Conduct a post-mortem: What failed? Why? What could prevent it?
- [ ] Assess data loss: What was the actual gap between the last recovery point and the outage?
- [ ] Review your DR plan: Should RPO be reduced? Should retention be longer?
- [ ] Decide what to do with the original source VM (now in the failed/restored region)
- [ ] Create a new DR plan for the recovered VM if ongoing protection is required


---