Module: Requirements

Capacity Estimation

System Design Fundamentals: Requirements -> Capacity Estimation

This document outlines the process of moving from gathering system requirements to estimating the capacity needed to support those requirements. This is a crucial step in system design, as it informs infrastructure choices, scalability planning, and cost projections.


1. Understanding Requirements

Before diving into capacity estimation, we need a clear understanding of the system's requirements. These are typically gathered through:

  • Functional Requirements: What the system does. (e.g., Users can upload images, search for products, make purchases)
  • Non-Functional Requirements (NFRs): How well the system performs. These are critical for capacity planning. Key NFRs include:
    • Performance: Latency, throughput, response times. (e.g., 99th percentile response time for search < 200ms)
    • Scalability: Ability to handle increasing load. (e.g., System should handle 10x peak load within 6 months)
    • Availability: Uptime percentage. (e.g., 99.99% availability)
    • Reliability: Data consistency and fault tolerance.
    • Security: Protection of data and system integrity.
    • Maintainability: Ease of updates and bug fixes.

Example:

Let's say we're designing a photo-sharing application. Some key requirements might be:

  • Functional: Users can upload photos, view photos, like photos, comment on photos.
  • NFRs:
    • Peak Users: 1 million concurrent users during peak hours.
    • Average Uploads/User/Day: 5 photos
    • Average View/User/Day: 50 photos
    • Storage per Photo: 2MB
    • 99th Percentile Upload Latency: < 5 seconds
    • 99th Percentile View Latency: < 1 second
    • Availability: 99.9%

2. Capacity Estimation: The Process

Capacity estimation is an iterative process. We start with rough estimates and refine them as we learn more. Here's a breakdown:

A. Define Key Metrics:

Identify the metrics that will drive your capacity needs. These are often tied to your NFRs. Examples:

  • Requests per Second (RPS): The number of requests the system needs to handle per second.
  • Data Storage (GB/TB/PB): The amount of data the system needs to store.
  • Network Bandwidth (Gbps): The amount of data that needs to be transferred over the network.
  • CPU Utilization: The percentage of CPU resources being used.
  • Memory Utilization: The percentage of memory resources being used.
  • Disk I/O: The rate of read/write operations to disk.

B. Estimate Peak Load:

This is the most challenging part. Consider:

  • Daily Active Users (DAU): The number of unique users who interact with the system each day.
  • Monthly Active Users (MAU): The number of unique users who interact with the system each month.
  • Concurrent Users: The number of users actively using the system at the same time. This is often a fraction of DAU. (e.g., 10% of DAU are concurrent)
  • Peak vs. Average Load: Peak load is significantly higher than average load. Plan for the peak! Consider daily/weekly/monthly patterns.
  • Growth: Factor in expected growth in users and usage.

C. Calculate Resource Requirements:

Based on the key metrics and peak load, calculate the resources needed for each component of the system. This often involves breaking down the system into its constituent parts.

Example (Continuing Photo-Sharing App):

  • Storage:

    • Total Photos/Day: 1 million users * 5 photos/user = 5 million photos
    • Storage/Day: 5 million photos * 2MB/photo = 10 GB
    • Storage/Year: 10 GB/day * 365 days = 3.65 TB
    • Add buffer for growth and redundancy: Plan for 5TB - 10TB initially.
  • Image Uploads (RPS):

    • Assuming peak upload activity occurs over 2 hours: 1 million concurrent users * (5 photos/user/day) / (2 hours * 3600 seconds/hour) = ~34.7 RPS
    • Add buffer for spikes: Plan for 50-100 RPS.
  • Image Views (RPS):

    • 1 million concurrent users * (50 photos/user/day) / (2 hours * 3600 seconds/hour) = ~347 RPS
    • Add buffer for spikes: Plan for 500-1000 RPS.
  • Database: The database needs to handle both upload and view requests, plus other operations (likes, comments, user data). Estimate the database read/write load based on these operations.

  • Caching: Caching frequently accessed data (e.g., popular photos) can significantly reduce the load on the database. Estimate cache hit ratio and cache size.

D. Consider System Architecture:

The architecture of your system will heavily influence capacity planning.

  • Microservices: Each microservice can be scaled independently.
  • Load Balancing: Distributes traffic across multiple servers.
  • Caching: Reduces load on backend systems.
  • Database Sharding: Distributes data across multiple database servers.
  • Content Delivery Network (CDN): Caches static content closer to users.

3. Tools and Techniques

  • Queuing Theory: Helps model and analyze waiting times and resource utilization.
  • Little's Law: Relates average number of items in a system (L), arrival rate (λ), and average time an item spends in the system (W): L = λW
  • Benchmarking: Testing the performance of your system under realistic load.
  • Monitoring: Tracking key metrics in production to identify bottlenecks and optimize performance.
  • Cloud Provider Calculators: AWS, Azure, and GCP provide calculators to estimate costs based on resource requirements.
  • Spreadsheets/Modeling Tools: For performing calculations and creating capacity plans.

4. Iteration and Refinement

Capacity estimation is not a one-time activity. It's an iterative process:

  1. Initial Estimate: Start with rough estimates based on available data.
  2. Prototype/Proof of Concept: Build a small-scale prototype to validate your assumptions.
  3. Benchmarking: Test the prototype under load to identify bottlenecks.
  4. Refine Estimates: Adjust your estimates based on the results of benchmarking.
  5. Monitor Production: Continuously monitor your system in production to identify areas for improvement.
  6. Re-evaluate: Regularly re-evaluate your capacity plan as your system grows and evolves.

Key Takeaways:

  • Start with Requirements: Clear requirements are the foundation of accurate capacity estimation.
  • Plan for Peak Load: Don't underestimate the importance of handling peak traffic.
  • Iterate and Refine: Capacity estimation is an ongoing process.
  • Consider Architecture: Your system architecture will significantly impact capacity needs.
  • Use Tools and Techniques: Leverage available tools and techniques to improve accuracy.

This document provides a foundational overview of requirements to capacity estimation. The specific details will vary depending on the complexity of the system and the available data. Remember to document your assumptions and calculations clearly.