Plan for resiliency and availability

Planning for resiliency and availability requires making strategic decisions about how robust your system architecture needs to be in terms of failure handling, performance degradation, and recovery capabilities. These decisions directly impact your ability to maintain service during outages, handle increased load, and recover quickly from failures.

Effective resiliency planning helps you minimize downtime, protect against data loss, and maintain user experience even when components fail. This planning process involves understanding your failure domains, setting realistic availability goals, and aligning your technical architecture with your business continuity requirements.

Identify critical applications and infrastructure

Start by identifying all applications and infrastructure components where availability is critical to your business operations. This includes systems that directly impact revenue, customer experience, or regulatory compliance.

Assess each application based on its business impact, user dependencies, and integration requirements. Applications that serve external customers, handle financial transactions, or support core business processes typically require higher availability levels than internal tools or development environments.

Document the dependencies between your applications and infrastructure components to understand how failures in one area could cascade to other systems. This dependency mapping helps you prioritize which systems need the most robust resiliency measures.

Calculate failure domain costs

Calculate the cost of your failure domain strategy to understand the financial impact of different resiliency approaches. This includes both the cost of implementing resiliency measures and the cost of potential failures if those measures are not in place.

Consider the direct costs of implementing redundancy, such as additional infrastructure, monitoring tools, and operational overhead. Also factor in the indirect costs of increased complexity, such as additional testing requirements and more complex deployment processes.

Compare these implementation costs against the potential costs of downtime, including lost revenue, customer churn, regulatory penalties, and damage to your brand reputation. This cost-benefit analysis helps you justify resiliency investments and prioritize which measures provide the best return on investment.

Define uptime goals

Decide on your uptime goals based on your business requirements, user expectations, and competitive landscape. These goals should be expressed as availability percentages and recovery time objectives that align with your business continuity plan.

Set different availability targets for different types of applications based on their criticality. Mission-critical applications might require 99.9% or higher availability, while less critical systems might be acceptable at 99% or lower. Consider seasonal variations and peak usage periods when setting these goals.

Establish recovery time objectives (RTO) and recovery point objectives (RPO) for each application. RTO defines how quickly you need to restore service after a failure, while RPO defines how much data loss is acceptable. These objectives should be realistic given your technical capabilities and budget constraints.

Align with business continuity requirements

Compare your architecture and failure recovery plans to your business continuity plan (BCP) to ensure technical capabilities support business requirements. This alignment ensures that your resiliency measures actually protect the business functions that matter most.

Review your BCP to understand the maximum acceptable downtime for different business processes and the resources required to maintain critical operations during a disaster. Ensure your technical architecture can support these requirements within the specified timeframes.

Validate that your failure recovery procedures are documented, tested, and understood by the teams responsible for executing them. Regular testing of your recovery procedures helps identify gaps and ensures your teams can execute them effectively during actual incidents.

Next steps

In this section of Design resilient systems, you learned about planning for resiliency and availability, including identifying critical applications, calculating failure domain costs, defining uptime goals, and aligning with business continuity requirements. Plan for resiliency and availability is part of the Design resilient systems pillar.

Refer to the following documents to learn more about designing resilient systems:

Distributed systems to understand fundamental resiliency concepts
Plan for failover to configure automatic failover mechanisms

If you are interested in learning more about resiliency and availability, you can check out the following resources:

Fault tolerance and fault isolation - AWS guidance on fault tolerance
Designing resilient systems - Google Cloud guidance on resilient system design
Getting Started with Reliability on Azure - Azure guidance on application reliability
Thinking like an architect: Understanding failure domains - IBM guidance on failure domain concepts
Uptime versus Availability - Guide to measuring and improving reliability
Business continuity versus disaster recovery - IBM guide to business continuity planning
Business Continuity Plan (BCP) - AWS guidance on business continuity planning