Well-Architected Framework
Design your databases
Database capacity planning prevents service outages and optimizes infrastructure costs. Your databases are one of the most critical components of your infrastructure but can also be the hardest to scale. Unlike stateless infrastructure like your compute instances, performance issues in a database can be difficult to diagnose and fix. Proactive capacity planning helps you identify how your services use your databases, how that changes as the load increases, and can help you develop a plan on how to scale your database infrastructure before it causes issues for your applications.
Why design your databases
Proper database design and capacity planning addresses the following strategic challenges:
- Prevent data loss and service outages: Inadequate database capacity causes queries to fail, timeouts to occur, and can lead to data corruption or loss during high-traffic periods.
- Reduce database infrastructure costs: Over-provisioning database resources to avoid performance issues leads to paying for expensive compute and storage capacity that remains unused.
- Enable application performance: Poorly designed database schemas and missing optimization strategies create bottlenecks that slow down every application feature that depends on data access.
- Minimize recovery complexity: Database failures without proper backup and replication strategies require complex manual recovery procedures that extend downtime and risk data loss.
The database design workflow follows these steps:
- Plan for known capacity: Run load and stress tests to understand storage requirements, query patterns, and performance bottlenecks.
- Design to meet demand: Configure indexing, read replicas, caching, and partitioning strategies for your workload profile.
- Respond to scale issues: Monitor database metrics and optimize queries to address performance issues before they affect applications.
Select your database type
Choose a database type based on your data structure and access patterns:
Use relational databases (PostgreSQL, MySQL) when you need structured data with defined relationships, complex queries with JOINs, and strong consistency guarantees. Use document databases like MongoDB and DynamoDB when you have flexible schemas, hierarchical data structures, or need horizontal scaling for high-throughput workloads.
Use managed database services like AWS RDS, Azure Database, and Google Cloud SQL when you want automated backups, patching, and scaling with reduced operational overhead. Use self-managed databases when you need full control over configuration, require specific extensions, or have strict compliance requirements.
Plan for your known capacity
Planning your database's capacity needs is a complex question of both storage and performance. Understanding how your applications and services use your databases is critical to properly build and scale your databases.
As you design the infrastructure for your database, you should ask yourself questions like the following:
- How much data do you expect the database to hold?
- How do your storage needs grow over time?
- How do your services interact with the database? Do they mostly send reads, writes, or a mix of both?
- What do queries to your database look like? Are they complex and require lots of resources to process, or are they simple and quick to process?
- How quickly must the database respond to queries?
- How strict are your consistency requirements? Does your data need strong consistency or can it be eventually consistent?
Use load and stress testing to answer these questions and inform your database design decisions.
Before users start sending traffic to your services, test your database while monitoring its response to identify bottlenecks and plan scaling strategies. Use load testing tools to simulate realistic workloads:
- Load tests: Test your database under normal workloads using tools like pgbench for PostgreSQL, mysqlslap for MySQL, or Apache JMeter for general database testing. Monitor metrics like response time, transactions per second, and cache hit ratios to identify slow queries, indexing issues, and connection pool limitations.
- Stress tests: Test your database under extremely high, unnatural workloads to see how it handles sudden traffic spikes, large operations, or complex queries. These tests help identify how your infrastructure may break under extreme conditions.
You can use Terraform to create temporary database test environments that mirror your production setup, run your tests, then destroy the environment to save costs. When you automate Terraform into your CI/CD pipeline, you can automatically create test environments, run database load tests, and destroy the infrastructure before deploying to production.
After identifying your database capacity limits and workload patterns through testing, design your database architecture to handle your specific access patterns efficiently.
Design to meet demand
Your early design decisions to architect your databases have a big effect on how you manage, run, and scale them as the load on them increases. By understanding what the workload on your databases looks like, you can plan how to best optimize your infrastructure and database configuration. Some examples of these design decisions include the following:
- Indexing: Database indexing improves query performance by trading disk storage and write speeds for faster read speeds. Create indexes for frequently read data, but make sure to test how this affects your write speed.
- Read replicas: Read replicas are read-only copies of your database, and are useful for reducing the load on your database if your services make many read requests.
- Caching: Caching reduces database load by storing frequently accessed data in memory. Caching is particularly effective for data that changes infrequently.
- Connection pooling: Connection pools maintain a set of reusable connections, reducing overhead and improving performance.
- Partitioning and sharding: Partitioning and sharding divide large datasets into smaller, easier to manage datasets. For very large datasets, partitioning divides data within a single database, while sharding distributes data across multiple databases.
Choose optimization strategies based on your workload profile:
Use indexing when you have specific columns frequently used in WHERE clauses or JOIN operations and can tolerate slower write speeds. Use read replicas when your application has a high read-to-write ratio (80% reads or more) and you need to distribute read traffic. Use caching when you have frequently accessed data that changes infrequently, such as user profiles or configuration data.
Use partitioning when you have a single large table that you can divide by time ranges or categories within one database. Use sharding when your dataset exceeds the capacity of a single database instance and you need to distribute data across multiple databases.
Some cloud providers let you create database read replicas using the same resource that you used to create the original database.
Before configuring read replicas, you need existing network infrastructure including a VPC and subnets where the database will run. The following example assumes these resources exist and references a password resource for secure credential management. This Terraform configuration creates a read-write database named rw-database, then creates a separate read replica named ro-database:
resource "aws_db_instance" "primary" {
identifier = "rw-database"
engine = "postgres"
engine_version = "18.1"
instance_class = "db.t3.micro"
allocated_storage = 20
storage_type = "gp3"
username = "postgres"
password = random_password.db_password.result
backup_retention_period = 7
skip_final_snapshot = true
}
resource "aws_db_instance" "replica" {
identifier = "ro-database"
replicate_source_db = aws_db_instance.primary.identifier
instance_class = "db.t3.micro"
skip_final_snapshot = true
tags = {
Name = "Read Replica"
}
}
This configuration creates a primary PostgreSQL database with automatic backups and a read replica that stays synchronized with the primary database. The read replica can handle read traffic to reduce load on the primary database, improving overall application performance.
After configuring your database architecture with optimization strategies, implement monitoring to track performance metrics and identify when to scale.
Respond to scale issues
Monitor your database metrics to understand performance trends and capacity limits. With proper monitoring, alerting, and key performance metrics, you can address approaching performance issues before they have a negative impact on your applications and services.
- Monitoring and alerting: Monitoring key metrics of your database, such as storage space, query latency, and connection count, lets you understand when you are approaching the bottlenecks you identified during load and stress testing. With early alerts, you can respond to potential issues before they affect your applications and services.
- Query optimization: While adding more resources to your database, such as disk or additional nodes, can help address performance issues, they can also sometimes hide more fundamental issues. Investigate slow queries to identify potential improvements you can make so that your database can make better use of your existing infrastructure instead of increasing cost.
- Capacity scaling: As the number of services, users, and queries to your database grows, your database's capacity will eventually need to grow with it. Use your monitoring and alerts to know when you are approaching different limits in your database before you exceed capacity. Vertical scaling increases the instance size (more CPU, memory, or storage). For vertical scaling, increase instance size when CPU or memory usage consistently exceeds 70%. Horizontal scaling adds more database instances to distribute workload. For horizontal scaling, add read replicas when read load exceeds 60% of capacity, or implement sharding when a single database instance cannot handle your data volume.
HashiCorp resources
- Read the Define your database infrastructure Well-Architected Framework documentation.
- Learn how to Manage AWS RDS instances and how to Upgrade AWS RDS major versions with Terraform.
Next steps
In this section of Select and design infrastructure, you learned how to profile the workload on your database with load and stress tests, react to growing database needs, and design to maximize your database performance as your needs change. Select and design infrastructure is part of the Optimize systems pillar.
To learn more about how to design and scale your infrastructure, refer to the following resources: