// Current: 100 users
1 Express server
1 MongoDB instance
1 Redis instance
Files on local disk
// At 1,000,000 users:
1 server → ~1000 concurrent connections
→ need 10+ servers behind LB
1 MongoDB → 1M image records, 100GB+
→ need replica set + sharding
File storage → 10TB+ of images
→ need object storage (S3)
Real-time → 10,000 concurrent WS
→ need multiple WS servers
with Redis adapter
Background jobs → 100K thumbnails/day
→ need 5+ worker instances
Users
→ CDN (Cloudflare)
→ Load Balancer (Nginx)
├── Web Servers ×10
│ (Express, stateless)
├── WebSocket Servers ×3
│ (Socket.io + Redis adapter)
├── Background Workers ×5
│ (thumbnails, email)
│
├── MongoDB Cluster
│ (primary + 2 replicas)
├── PostgreSQL
│ (analytics, read replicas)
├── Redis Cluster
│ (cache + sessions + rate limit)
├── S3
│ (image file storage)
└── Message Queue (BullMQ/SQS)
(async job processing)
// 1. Stateless servers
// Any request → any server
// Session in Redis, not in memory
// Scale by adding more servers
// 2. CDN for static assets
// CSS/JS served from 200+ locations
// Origin server barely touched
// 3. S3 for images
// Infinite storage
// Built-in redundancy (99.999999999%)
// Served via CDN (CloudFront)
// 4. Read replicas
// Reads: 90% of all queries
// → spread across 2+ replicas
// Writes: only to primary
// 5. Message queues
// "Accept upload" decoupled from
// "generate thumbnail"
// User gets 202 immediately
// Worker processes asynchronously
// Network partition happens between
// primary and replica databases.
// OPTION A: Consistency (CP)
// Refuse writes until healed.
// Users see errors.
// But data is never stale.
// → MongoDB default (CP)
// OPTION B: Availability (AP)
// Accept writes on both sides.
// Users always get responses.
// But data may be temporarily stale.
// → DynamoDB, Cassandra (AP)
// For PixelCraft:
// Image metadata → CP
// (don't lose or duplicate images)
// Analytics events → AP
// (slightly stale counts are fine)
// Real-time cursors → AP
// (brief desync is acceptable)
// At 1M users, monthly estimate:
Web servers (10 × $20) $200
WS servers (3 × $40) $120
Workers (5 × $20) $100
MongoDB Atlas (M30) $600
PostgreSQL (RDS) $200
Redis (ElastiCache) $150
S3 storage (10TB) $230
S3 transfer (5TB) $450
CDN (Cloudflare Pro) $25
Domain + DNS $15
Sentry (Team) $26
──────────────────────────
Total: ~$2,100/mo
// Revenue needed to sustain:
// 2,100 paying users at $1/mo
// or 210 users at $10/mo
// Rule of thumb: infrastructure
// should be <20% of revenue
CAP theorem: the fundamental constraint of distributed systems.
During a network partition, you must choose: Consistency (all nodes see the same data) or Availability (every request gets a response). You can't have both.