Strategy and Design
1. Planning Your Disaster Recovery Strategy
A DRaaS plan is only as good as the strategy behind it. Before creating your first plan, invest time in defining your recovery objectives.
Define Business Continuity Requirements First
Answer these questions before configuring DRaaS:
| Question | Why It Matters |
|---|---|
| How long can this service be down before it causes significant business impact? | Sets your RTO target (DRaaS delivers ~5 minutes) |
| How much data can you afford to lose in a worst-case failure? | Sets your RPO target and therefore your replication frequency |
| Do you have compliance or regulatory requirements for DR? | May dictate minimum retention periods and mandatory drill frequency |
| What is the business cost of an hour of downtime? | Helps justify the right DR tier and RPO investment |
Tier Your Workloads
Not every VM needs the same level of protection. Classify your VMs to avoid over-spending on non-critical systems.
| Tier | Description | Recommended RPO | Recommended Retention |
|---|---|---|---|
| Critical | Revenue-generating, customer-facing, or compliance-mandated | 1–4 hours | 30–90 days |
| Important | Internal tools, dev/staging with production data | 8–12 hours | 14–30 days |
| Standard | Non-critical internal systems, batch jobs | 24–48 hours | 7–14 days |
| Low | Dev/test environments, expendable data | 72–240 hours | 1–7 days |
Tip: Start by protecting Critical-tier VMs first. Add lower tiers incrementally as you become comfortable with the DR workflow.
Document Your DR Plan Outside the Platform
Keep a written runbook that does not depend on the E2E console being accessible. If your source region is down, you need instructions that work from any device.
Your runbook should include:
- DR plan IDs for each protected VM
- Target VM IPs and SSH access details
- Contact list for team notification during an incident
2. Choosing the Right RPO
The RPO you configure determines how often DRaaS ships a new recovery point. A lower RPO means less potential data loss but higher storage costs.
RPO Decision Matrix
| If your workload... | Recommended RPO |
|---|---|
| Processes financial transactions, orders, or user data continuously | 1–2 hours |
| Has a database that is written to frequently throughout the day | 2–4 hours |
| Receives batch updates a few times per day | 4–8 hours |
| Has data that changes mainly during business hours | 8–12 hours |
| Is a static or near-static service | 24–72 hours |
| Is a dev/test environment where data loss is acceptable | 72–240 hours |
RPO Configuration Tips
Set RPO to align with your data change rate, not just your RTO. A 1-hour RPO on a VM that barely changes wastes storage and budget. A 24-hour RPO on a database that processes thousands of transactions per hour leaves you dangerously exposed.
Start conservatively, then tune. If you are unsure, start with a 4-hour RPO. After 2–4 weeks, review your recovery points: if they are all very small in size, you can safely increase the RPO interval. If they are large, your data changes frequently and you may want a shorter RPO.
Avoid changing RPO frequently. Each RPO update triggers a scheduler change. Pick a value that works for your workload and only adjust it when your workload genuinely changes.
3. Choosing the Right Retention Period
Retention determines how many historical recovery points you can restore from. A longer retention window is your safety net for scenarios like:
- A database corruption that was not noticed for several days
- Ransomware that encrypts data over time before being detected
- A bad deployment that went unnoticed for days
Retention Decision Guide
| Scenario | Recommended Retention |
|---|---|
| Regulatory or compliance requirement (e.g., RBI, SEBI, ISO 27001) | Per regulation — often 30–90 days minimum |
| Risk of delayed data corruption (ransomware, silent data issues) | 30–90 days |
| Standard production workload with good monitoring | 14–30 days |
| Dev/staging environments | 7 days |
| Environments with very large disks (cost-sensitive) | 7 days with manual recovery points for key milestones |
Balance Retention Against Cost
Each stored recovery point consumes space (billed per GB per hour). A 90-day retention with a 1-hour RPO creates an enormous number of snapshots. Right-size your retention:
- Increase RPO + increase retention for cost-neutral coverage of longer time windows (e.g., 12-hour RPO, 30-day retention instead of 1-hour RPO, 7-day retention)
- Reduce retention for test/dev environments — there is rarely a compliance reason to keep 30 days of recovery points for a staging server
Use Manual Recovery Points for Long-Lived Milestones
Automatic retention purges old snapshots after your configured window. If you want to preserve a specific state indefinitely (before a major release, at the end of a quarter), create a manual recovery point and note its ID. Manual recovery points follow the same retention rules — you must re-create or rename them to track them. Consider documenting important manual recovery point IDs in your external runbook.