Design fault-tolerant systems

15min

A fault-tolerant system prevents disruptions from a single point of failure, and ensures high availability and business continuity for your mission-critical applications or infrastructure. When you design and implement fault-tolerant systems, you reduce costs, intelligently scale, and minimize downtime.

Fault-tolerant systems also enhance your business outcomes and increase customer satisfaction in the following ways:

Ensures critical systems remain available even during failure to support business continuity and minimize the effects of disruptions.
Failover and redundancy help reduce risk of data loss, corruption, and damage from security breaches.
Improve the quality of service and deliver better user experience through consistent performance and responsiveness.

HashiCorp tools feature fault-tolerance properties which enhance the overall resiliency and fault tolerance of applications they integrate with. Use the guidance and resources here to help design your fault-tolerant systems with HashiCorp tools.

Plan for resiliency and availability

As you plan for resiliency and availability, you must decide how robust your system architecture needs to be in terms of failure, degradation, and performance. The following are key considerations that you should include when planning for availability and resilience in your deployment:

Identify all applications and infrastructure where availability is critical.
Calculate the cost of your failure domain strategy.
Decide your uptime goals.
Compare your architecture and failure recovery plans to the business requirements (BCP).

HashiCorp resources:

External resources:

Consul

Consul has a range of features that operate both locally and remotely that can help you offer a resilient service across datacenters. Each Consul datacenter depends on a set of Consul voting server agents. The voting servers ensure Consul has a consistent, fault-tolerant state by requiring a majority of voting servers, known as a quorum, to agree upon state changes.

Consider the following factors when you use Consul to design your resilient architectures:

Cluster quorum: Consul uses the Raft protocol to achieve consensus with a quorum (or majority) of operational servers, and can tolerate failure in one or more servers depending on quorum size. A Consul cluster will enter a read-only state to prevent data inconsistency if it loses quorum. You can also use Redundancy Zones in your Consul deployments to run one voter and any number of non-voters in each defined zone.
Data distribution and replication: Consul replicates all data across cluster servers. Any server in the cluster can respond to read requests, but writes require consensus from a quorum of servers. Consul also automatically synchronizes data across the cluster when a failed server recovers.
Cluster leader election: When the cluster leader fails, Consul automatically conducts an election to elect a new leader. The leader election requires a quorum of voter participation by cluster servers, and quorum constraints prevent split-brain scenarios.
Service discovery and resilience: Consul clients keep local caches of service information, and can continue basic service discovery even when temporarily disconnected. Health checks also continue to function during partial cluster outages, and the anti-entropy system automatically repairs inconsistencies between the client's local state and cluster state.
Network partition handling: A Consul cluster uses a gossip protocol to detect network partitions, and uses quorum requirements to enforce data consistency during network partitions. Servers in the minority partition enter read-only mode to prevent split-brain scenarios, and when the partition heals, the cluster automatically resynchronizes data.
Multi-datacenter support: Each Consul datacenter independently operates with its own consensus group, and cross-datacenter replication continues even if some datacenters are unreachable. Consul also uses a WAN gossip pool to keep datacenter connectivity details.
Backup and recovery: You can recover from catastrophic failures with data snapshots.

HashiCorp resources:

Nomad

Nomad is a simple and flexible scheduler and orchestrator that deploys and manages containers and non-containerized applications across on-premises and clouds at scale. Your Nomad deployments can achieve fault tolerance, and offer resilience and availability to your use cases through the following key properties:

Cluster quorum: Nomad uses the Raft protocol to achieve consensus with a quorum (or majority) of operational servers, and can tolerate failure in one or more servers depending on quorum size. If Nomad loses quorum, it enters a read-only state to prevent data inconsistency.
Node failure handling: Nomad relies on a heartbeat mechanism to automatically detect node failure. Failed nodes which do not heartbeat get marked as down and the Nomad maintains the desired state by automatically rescheduling new instances of failed jobs on to operational nodes.
Job scheduling resilience: Job scheduling supports automatic restarts for failed tasks with configurable restart attempts. Jobs can specify affinities and constraints to effectively spread across failure domains. Nomad's system jobs ensure instances run on all eligible nodes to support coverage, and rolling updates enable zero-downtime deployments with automatic rollbacks.
Data persistence: Nomad's implementation of the Raft protocol ensures consistent state across cluster nodes by replicating server state across the cluster. Snapshot functionality enables back up and restoration of cluster state, while client nodes keep local state for running allocations.
Automatic scaling: The Nomad Autoscaler is a separate tool that enables horizontal application and cluster scaling for Nomad clusters. Horizontal application autoscaling is the process of automatically controlling the number of instances of an application to gain work throughput to meet service-level agreements (SLA). Horizontal cluster autoscaling is the process of adding or removing Nomad clients from a cluster to ensure there is an appropriate amount of cluster resource for the scheduled applications.

HashiCorp resources:

Vault

Vault is an identity-based secret and encryption management system that secures, stores, and controls access to tokens, passwords, certificates, and encryption keys for protecting secrets and other sensitive data.

Vault clusters have some important resiliency and availability properties you should consider when you are designing fault-tolerant systems:

High availability architecture: A cluster of Vault servers can run in high availability (HA) mode with an active server and standby servers. When operating in HA mode, Vault can automatically failover a non-operational active server, and use leader election to choose a new active server.
Automated Integrated Storage management: Vault can automate the cluster management for Integrated Storage with the Autopilot feature to check server health, stabilize quorum, and periodically clean up failed servers.
Performance standbys: By default, Vault standby servers forward all requests they receive to the active server. Vault Enterprise offers extra functionality that allow HA servers to service read requests which do not mutate storage directly from a standby node. These performance standby servers can distribute the load, enabling your cluster to scale horizontally and reduce latency for read-heavy use cases.
Replication: Scaling your Vault cluster to meet performance demands is critical to ensure workloads operate efficiently. Vault Enterprise and HCP Vault Dedicated support multi-region deployments so you can replicate data to regional Vault clusters to support local workloads.

HashiCorp resources:

Consider distributed systems principles

There is significant overlap between the design of distributed systems and fault-tolerant systems as several fault-tolerant mechanisms are inherently distributed. Fault-tolerant designs use a range of concepts and techniques associated with distributed systems, such as consensus algorithms, redundancy, and state replication.

You should consider the following key principles of distributed systems when you are designing fault-tolerant applications and infrastructure.

Tip

These distributed systems concepts and considerations apply to Consul, Nomad, and Vault.

Identify potential faults

As you design a fault-tolerant system, your starting point should be to identify the types of failures which are possible, their effects, and mitigation strategies. Consider the following common faults in your designs:

Hardware failures in compute or storage systems
Software bugs that cause outages or otherwise block production
Network partitions between datacenters or individual cluster nodes
Upgrade issues that introduce regressions or unexpected configuration issues

HashiCorp Consul resources:

HashiCorp Nomad resources:

Failure scenarios

HashiCorp Vault resources:

Set up fault tolerance with Vault redundancy zones

External resources:

Implement redundancy and replication

Your fault-tolerant system design can benefit from redundancy in key hardware and software components to ensure maximum availability and ideal user experience. State replication enables availability and performance by ensuring that a system can continue to operate when components fail.

Some ways in which you can build redundancy into your designs include:

Duplicate critical service instances
Redundant data storage solutions from different vendors
Several network paths for communications

HashiCorp Consul resources:

Consul Multi-Cluster reference architecture

HashiCorp Nomad resources:

Federation

HashiCorp Vault resources:

Use robust networking and communication protocols

Your fault-tolerant applications or infrastructure can benefit from choosing best in class robust networking and communication protocols. Choose solutions which ensure your networks and protocols can survive failure, such as the following:

Load balancing and distribution solutions
Network isolation and segmentation techniques
Caching and content delivery solutions
Reliable multicast networking
Connection-oriented networking and service mesh solutions
Message queue and buffering solutions

HashiCorp resources:

What is zero trust security and zero trust networking?

Scale and tune performance

Scale and performance are critical to fault-tolerant systems, and the interplay between them can complex, complementary, and conflicting relationships For example, the same redundant components which enable fault-tolerance can also support load balancing to improve performance. When you geographically distributed replicas for disaster recovery, they can also support latency reduction goals. You can also benefit from excess production capacity need for fault-tolerance to offer extra performance headroom.

Some examples of conflicting relationships and trade-offs between fault-tolerance and performance include consistency versus performance; stronger consistency guarantees for fault-tolerance often require extra network round trips. Synchronous replication for durability can affect latency, and complex recovery mechanisms can slow down regular operations. Resource overhead for redundancy results in costs which can further improve performance, health checking and monitoring consumer processing and bandwidth, while state synchronization adds network overhead.

Examples of design strategies you can use for scale and performance include:

Bulkheads and circuit breakers: To isolate components so that failures do not cascade, implement back-pressure mechanisms to prevent overload, and design for graceful degradation.
Caching and state management: To improve resilience and performance with local state to reduce network dependency, and cache hierarchies to offer fallback options.
Monitoring and adaptation: To catch potential failures, and inform automatic scaling and recovery systems which respond to overload and failure. Your capacity planning should account for both regular operation and operation in different failure modes.
Testing and validation: Load test and include testing of failure scenarios. Use chaos engineering principles to help understand behavior at scale. Include recovery scenarios in your performance benchmarks.

HashiCorp Consul resources:

HashiCorp Nomad resources:

HashiCorp Vault resources:

HashiCorp resources:

External resources:

Secure distributed systems

Distributed systems involve several components communicating across networks, which presents an expanded attack surfaces with many potential points of vulnerability. A breach in one component can lead to a cascade through the system, potentially compromising data integrity, confidentiality, and availability. You must implement a robust design that include security measures to prevent data breaches and service disruptions without impairing performance and reliability.

HashiCorp resources:

Understand quorum

When you deploy a distributed system like Consul, Nomad, or Vault, one of your primary design considerations should be cluster quorum. Quorum refers to the minimum number of nodes in a distributed system required to reach consensus regarding shared state or configuration. Without quorum, the system becomes unreliable and can experience leader election issues, data inconsistencies, or complete loss of consensus.

In a Consul or Vault cluster using high availability integrated storage, quorum is a majority of members from a peer set. For a set of size N, quorum requires at least (N/2)+1 members. For example, if there are 5 members in the peer set, you would need 3 nodes to form a quorum. If a quorum of Consul or Vault nodes is unavailable for any reason, the cluster becomes unavailable and can't commit new logs.

We recommend using three to five servers for both Vault or Consul deployments, including cloud platforms like HashiCorp Cloud Platform (HCP). We recommend the non-voting or read replication feature available in the Enterprise or HCP editions for large deployments that need to scale reads without impacting write latency. A single server deployment is highly discouraged due to inevitable data loss in a failure scenario.

The following are quorum sizes and node failure tolerances for a range of cluster node counts.

Servers	Quorum	Node failure tolerance
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

Understand state management and disaster recovery

Disaster recovery considerations are an important part of your organization's overall business continuity planning. You should consider both Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).

At a minimum, your recovery plan should include the following:

State and change management.
Immutable infrastructure for quick and replicable deployment.
Automated backups stored on mounted or external storage, instead of local or ephemeral storage.

HashiCorp Consul resources:

Backup Consul data and state with the RTO and RPO set forth within each customer's disaster recovery policies.
Understand disaster recovery for Consul on Kubernetes.
Learn more about Consul disaster recovery considerations.
How to perform disaster recovery for Consul clusters, including loss of quorum.
How to recover from disaster in a federated primary datacenter.

HashiCorp Nomad resources:

HashiCorp Terraform resources:

Automate backups of your Terraform Enterprise deployment to ensure business continuity.
Understand the Terraform Enterprise backup - recommended pattern.

HashiCorp Vault resources:

Learn how to enable automated backups of cluster data in Vault Enterprise.
Protect Vault cluster from data loss with backups
Vault Enterprise supports multi-datacenter deployments where clusters replicate data across datacenters disaster recovery. Learn how to enable disaster recovery replication, and recover from catastrophic failure with disaster recovery replication.
Recover from lost quorum in Vault clusters.

External resources:

Zero-downtime deployments

Recover Terraform Enterprise