--- title: Strategy and Design --- ## 1. Planning Your Disaster Recovery Strategy A DRaaS plan is only as good as the strategy behind it. Before creating your first plan, invest time in defining your recovery objectives. ### Define Business Continuity Requirements First Answer these questions before configuring DRaaS: | Question | Why It Matters | | ------------------------------------------------------------------------------- | ------------------------------------------------------------------- | | How long can this service be down before it causes significant business impact? | Sets your **RTO target** (DRaaS delivers ~5 minutes) | | How much data can you afford to lose in a worst-case failure? | Sets your **RPO target** and therefore your replication frequency | | Do you have compliance or regulatory requirements for DR? | May dictate minimum retention periods and mandatory drill frequency | | What is the business cost of an hour of downtime? | Helps justify the right DR tier and RPO investment | ### Tier Your Workloads Not every VM needs the same level of protection. Classify your VMs to avoid over-spending on non-critical systems. | Tier | Description | Recommended RPO | Recommended Retention | | ------------- | ----------------------------------------------------------- | --------------- | --------------------- | | **Critical** | Revenue-generating, customer-facing, or compliance-mandated | 1–4 hours | 30–90 days | | **Important** | Internal tools, dev/staging with production data | 8–12 hours | 14–30 days | | **Standard** | Non-critical internal systems, batch jobs | 24–48 hours | 7–14 days | | **Low** | Dev/test environments, expendable data | 72–240 hours | 1–7 days | > **Tip:** Start by protecting Critical-tier VMs first. Add lower tiers incrementally as you become comfortable with the DR workflow. ### Document Your DR Plan Outside the Platform Keep a written runbook that does not depend on the E2E console being accessible. If your source region is down, you need instructions that work from any device. Your runbook should include: - DR plan IDs for each protected VM - Target VM IPs and SSH access details - Contact list for team notification during an incident --- ## 2. Choosing the Right RPO The RPO you configure determines how often DRaaS ships a new recovery point. A lower RPO means less potential data loss but higher storage costs. ### RPO Decision Matrix | If your workload... | Recommended RPO | | ------------------------------------------------------------------- | --------------- | | Processes financial transactions, orders, or user data continuously | 1–2 hours | | Has a database that is written to frequently throughout the day | 2–4 hours | | Receives batch updates a few times per day | 4–8 hours | | Has data that changes mainly during business hours | 8–12 hours | | Is a static or near-static service | 24–72 hours | | Is a dev/test environment where data loss is acceptable | 72–240 hours | ### RPO Configuration Tips **Set RPO to align with your data change rate, not just your RTO.** A 1-hour RPO on a VM that barely changes wastes storage and budget. A 24-hour RPO on a database that processes thousands of transactions per hour leaves you dangerously exposed. **Start conservatively, then tune.** If you are unsure, start with a 4-hour RPO. After 2–4 weeks, review your recovery points: if they are all very small in size, you can safely increase the RPO interval. If they are large, your data changes frequently and you may want a shorter RPO. **Avoid changing RPO frequently.** Each RPO update triggers a scheduler change. Pick a value that works for your workload and only adjust it when your workload genuinely changes. --- ## 3. Choosing the Right Retention Period Retention determines how many historical recovery points you can restore from. A longer retention window is your safety net for scenarios like: - A database corruption that was not noticed for several days - Ransomware that encrypts data over time before being detected - A bad deployment that went unnoticed for days ### Retention Decision Guide | Scenario | Recommended Retention | | ----------------------------------------------------------------- | ----------------------------------------------------- | | Regulatory or compliance requirement (e.g., RBI, SEBI, ISO 27001) | Per regulation — often 30–90 days minimum | | Risk of delayed data corruption (ransomware, silent data issues) | 30–90 days | | Standard production workload with good monitoring | 14–30 days | | Dev/staging environments | 7 days | | Environments with very large disks (cost-sensitive) | 7 days with manual recovery points for key milestones | ### Balance Retention Against Cost Each stored recovery point consumes space (billed per GB per hour). A 90-day retention with a 1-hour RPO creates an enormous number of snapshots. Right-size your retention: - **Increase RPO + increase retention** for cost-neutral coverage of longer time windows (e.g., 12-hour RPO, 30-day retention instead of 1-hour RPO, 7-day retention) - **Reduce retention for test/dev environments** — there is rarely a compliance reason to keep 30 days of recovery points for a staging server ### Use Manual Recovery Points for Long-Lived Milestones Automatic retention purges old snapshots after your configured window. If you want to preserve a specific state indefinitely (before a major release, at the end of a quarter), create a **manual recovery point** and note its ID. Manual recovery points follow the same retention rules — you must re-create or rename them to track them. Consider documenting important manual recovery point IDs in your external runbook. ---