100
LVL 04 — SENIOR-IN-TRAININGSESSION 100DAY 100

SYSTEM DESIGN

🎫 PIXELCRAFT-086
📋Architecture | 🔴 Expert | Priority: 🟡 Medium

The CTO asks: "PixelCraft has 100 users. What if it had 1 million? What breaks? What do we change?" Design the scaled architecture. Whiteboard the system that handles 1M users.
CONCEPTS.UNLOCKED
↔️
Horizontal vs Vertical Scaling
Add more machines vs bigger machines. Vertical: bigger CPU, more RAM (has a ceiling). Horizontal: more servers behind a load balancer (scales infinitely). Always design for horizontal.
⚖️
Load Balancers
Distribute requests across servers. User hits the load balancer → routed to the least-busy server. If one server dies, traffic routes to the others. No single point of failure.
🌍
CDN
Serve static files from edge locations. Your JS/CSS/images cached on 200+ servers worldwide. A user in Tokyo gets assets from Tokyo, not Virginia. Latency drops from 200ms to 20ms.
🗄️
Database Scaling
Read replicas + sharding. Read replicas: distribute read load across copies. Sharding: split data across multiple machines (users A-M on shard 1, N-Z on shard 2). Writes to primary, reads from replicas.
🧩
Microservices
Split the monolith into independent services. Auth service, image service, analytics service — each deployed, scaled, and maintained independently. Trade complexity for flexibility.
📐
CAP Theorem
Consistency + Availability + Partition tolerance — pick two. During a network partition, choose: all nodes see the same data (C) or every request gets a response (A). Every distributed system makes this choice.
HANDS-ON.TASKS
01
Identify Bottlenecks at 1M Users
// Current: 100 users 1 Express server 1 MongoDB instance 1 Redis instance Files on local disk // At 1,000,000 users: 1 server → ~1000 concurrent connections → need 10+ servers behind LB 1 MongoDB → 1M image records, 100GB+ → need replica set + sharding File storage → 10TB+ of images → need object storage (S3) Real-time → 10,000 concurrent WS → need multiple WS servers with Redis adapter Background jobs → 100K thumbnails/day → need 5+ worker instances
02
Design the Scaled Architecture
Users → CDN (Cloudflare) → Load Balancer (Nginx) ├── Web Servers ×10 │ (Express, stateless) ├── WebSocket Servers ×3 │ (Socket.io + Redis adapter) ├── Background Workers ×5 │ (thumbnails, email) │ ├── MongoDB Cluster │ (primary + 2 replicas) ├── PostgreSQL │ (analytics, read replicas) ├── Redis Cluster │ (cache + sessions + rate limit) ├── S3 │ (image file storage) └── Message Queue (BullMQ/SQS) (async job processing)
03
Key Design Decisions
// 1. Stateless servers // Any request → any server // Session in Redis, not in memory // Scale by adding more servers // 2. CDN for static assets // CSS/JS served from 200+ locations // Origin server barely touched // 3. S3 for images // Infinite storage // Built-in redundancy (99.999999999%) // Served via CDN (CloudFront) // 4. Read replicas // Reads: 90% of all queries // → spread across 2+ replicas // Writes: only to primary // 5. Message queues // "Accept upload" decoupled from // "generate thumbnail" // User gets 202 immediately // Worker processes asynchronously
04
CAP Theorem Exercise
// Network partition happens between // primary and replica databases. // OPTION A: Consistency (CP) // Refuse writes until healed. // Users see errors. // But data is never stale. // → MongoDB default (CP) // OPTION B: Availability (AP) // Accept writes on both sides. // Users always get responses. // But data may be temporarily stale. // → DynamoDB, Cassandra (AP) // For PixelCraft: // Image metadata → CP // (don't lose or duplicate images) // Analytics events → AP // (slightly stale counts are fine) // Real-time cursors → AP // (brief desync is acceptable)
The "right" answer depends on the data. Financial transactions need CP. Social media feeds can tolerate AP. System design is about making these tradeoffs explicitly.
05
Cost Estimation
// At 1M users, monthly estimate: Web servers (10 × $20) $200 WS servers (3 × $40) $120 Workers (5 × $20) $100 MongoDB Atlas (M30) $600 PostgreSQL (RDS) $200 Redis (ElastiCache) $150 S3 storage (10TB) $230 S3 transfer (5TB) $450 CDN (Cloudflare Pro) $25 Domain + DNS $15 Sentry (Team) $26 ────────────────────────── Total: ~$2,100/mo // Revenue needed to sustain: // 2,100 paying users at $1/mo // or 210 users at $10/mo // Rule of thumb: infrastructure // should be <20% of revenue
CS.DEEP-DIVE

CAP theorem: the fundamental constraint of distributed systems.

During a network partition, you must choose: Consistency (all nodes see the same data) or Availability (every request gets a response). You can't have both.

// CAP theorem (Brewer, 2000):

Consistency
  All nodes see the same data
  at the same time

Availability
  Every request gets a response
  (success or failure)

Partition tolerance
  System works despite network
  failures between nodes

// In practice:
CP: MongoDB, HBase, Redis
  Data is always correct
  May refuse requests

AP: Cassandra, DynamoDB, CouchDB
  Always responds
  Data may be briefly stale

// Every distributed system makes
// this choice, explicitly or not.
"Scale Lab"
[A]Design for 100M users: what changes from the 1M architecture? Multi-region deployment, database sharding strategy, eventual consistency, global load balancing. Draw the architecture diagram.
[B]Practice system design interview: "Design Instagram" or "Design Google Drive." Use the same framework: requirements → bottlenecks → architecture → tradeoffs. Time yourself: 45 minutes.
[C]Research: read about how real companies scale. How does Netflix serve 250M users? How does Shopify handle Black Friday traffic spikes? Write a case study analyzing one company's architecture.
REF.MATERIAL
ARTICLE
Donne Martin
The most comprehensive system design resource: scalability, load balancing, caching, databases, queues. 280K+ stars. The system design bible.
SYSTEM DESIGNESSENTIAL
VIDEO
NeetCode
System design fundamentals: horizontal scaling, load balancers, CDNs, databases, caching, and message queues. Clear and visual.
SYSTEM DESIGNBEGINNER
ARTICLE
Wikipedia
The theoretical foundation: consistency, availability, partition tolerance. Brewer's conjecture, formal proof, and practical implications for distributed systems.
CAPTHEORYCS
ARTICLE
Alex Hyett
Monolith, microservices, event-driven, serverless: when to use each architecture pattern. Tradeoffs and real-world decision frameworks.
ARCHITECTUREPATTERNS
VIDEO
Hussein Nasser
Step-by-step scaling: from single server to load balancers, replicas, CDNs, and queues. The progression from 100 to 1M users.
SCALINGPRACTICAL
// LEAVE EXCITED BECAUSE
You can whiteboard a system that handles 1 million users. Load balancers, CDNs, replicas, queues, CAP tradeoffs — you understand them all. This is the #1 skill tested in senior engineering interviews. Session 100. You made it.