114
LVL 04 — SENIOR-IN-TRAININGSESSION 114DAY 114

MONITORING DASHBOARD

🎫 PIXELCRAFT-100
🔧DevOps | 🟡 Medium | Priority: 🟠 High

We deployed PixelCraft but have zero visibility into how it's performing. Is the server healthy? How fast are responses? How many users are active? Build an internal monitoring dashboard: uptime, error rate, response times, active users, storage usage. Alert when metrics exceed thresholds.
CONCEPTS.UNLOCKED
📊
Application Monitoring
Continuous measurement of your system's health. Collect metrics → store them → visualize on dashboards → alert on anomalies. You can't improve what you can't measure. You can't fix what you can't see.
💓
Health Check Endpoints
GET /api/health → { status: "ok", db: "connected" }. External services ping this every minute. If it fails, you get alerted. Checks database connection, Redis, disk space — not just "server responds."
⏱️
Percentile Latencies
P50, P95, P99 — not averages. Average hides outliers. P95 = 95% of requests are faster than this. P99 matters most: 1% of users = 10,000 at 1M scale. One slow request per hundred is unacceptable.
📈
Grafana Dashboards
Visualize metrics over time. Line charts for response times, gauges for error rates, counters for active users. Grafana connects to Prometheus, InfluxDB, or simple JSON endpoints.
🔔
Alerting Thresholds
Error rate > 5% → Slack alert. P95 > 2s → PagerDuty. Define thresholds for each metric. Alert when exceeded. Avoid alert fatigue: only alert on actionable conditions that require human response.
📋
SLA / SLO / SLI
SLI: what you measure. SLO: your target. SLA: your promise. SLI = P95 latency. SLO = P95 < 500ms, 99.9% uptime. SLA = contractual guarantee to customers. Measure → target → commit.
HANDS-ON.TASKS
01
Enhanced Health Check
app.get('/api/health', async (req, res) => { const checks = { server: 'ok', uptime: process.uptime(), timestamp: new Date().toISOString(), memory: { used: Math.round( process.memoryUsage().heapUsed / 1024 / 1024), total: Math.round( process.memoryUsage().heapTotal / 1024 / 1024), unit: 'MB', }, database: 'checking...', redis: 'checking...', }; try { await db.command({ ping: 1 }); checks.database = 'ok'; } catch { checks.database = 'error'; } try { await redis.ping(); checks.redis = 'ok'; } catch { checks.redis = 'error'; } const allOk = checks.database === 'ok' && checks.redis === 'ok'; res.status(allOk ? 200 : 503) .json(checks); });
02
Metrics Collection Middleware
// middleware/metrics.ts interface RequestMetric { method: string; path: string; statusCode: number; duration: number; timestamp: number; } const metrics: RequestMetric[] = []; app.use((req, res, next) => { const start = performance.now(); res.on('finish', () => { metrics.push({ method: req.method, path: req.route?.path ?? req.path, statusCode: res.statusCode, duration: performance.now() - start, timestamp: Date.now(), }); // Keep last 10,000 metrics if (metrics.length > 10000) metrics.splice(0, 5000); }); next(); }); // Metrics API app.get('/api/admin/metrics', requireAdmin, (req, res) => { const window = 5 * 60 * 1000; // 5min const recent = metrics.filter( m => m.timestamp > Date.now() - window); const durations = recent .map(m => m.duration).sort( (a, b) => a - b); res.json({ requestCount: recent.length, errorRate: recent.filter( m => m.statusCode >= 500).length / recent.length, latency: { p50: percentile(durations, 50), p95: percentile(durations, 95), p99: percentile(durations, 99), }, topEndpoints: groupAndCount( recent, 'path'), }); }); function percentile( arr: number[], p: number ) { const i = Math.ceil( arr.length * p / 100) - 1; return arr[Math.max(0, i)] ?? 0; }
03
Active Users & Storage
app.get('/api/admin/stats', requireAdmin, async (req, res) => { const [ totalUsers, activeToday, totalImages, storageBytes, ] = await Promise.all([ db.collection('users') .countDocuments(), db.collection('users') .countDocuments({ lastActive: { $gte: new Date( Date.now() - 86400000) }, }), db.collection('images') .countDocuments(), db.collection('images') .aggregate([{ $group: { _id: null, total: { $sum: '$fileSize' }, } }]).toArray() .then(r => r[0]?.total ?? 0), ]); res.json({ users: { total: totalUsers, activeToday }, images: { total: totalImages }, storage: { bytes: storageBytes, formatted: formatBytes( storageBytes), }, }); });
04
Alerting Logic
// lib/alerting.ts interface AlertRule { name: string; condition: () => Promise<boolean>; message: string; cooldown: number; // ms } const rules: AlertRule[] = [ { name: 'high-error-rate', condition: async () => { const m = await getMetrics(); return m.errorRate > 0.05; }, message: '🔴 Error rate > 5%', cooldown: 15 * 60 * 1000, }, { name: 'slow-p95', condition: async () => { const m = await getMetrics(); return m.latency.p95 > 2000; }, message: '🟡 P95 latency > 2s', cooldown: 30 * 60 * 1000, }, ]; // Check every minute setInterval(async () => { for (const rule of rules) { if (await rule.condition()) { await sendSlackAlert( rule.message); } } }, 60_000);
05
Close the Ticket
git switch -c devops/PIXELCRAFT-100-monitoring git add src/middleware/metrics.ts src/lib/alerting.ts git commit -m "Add monitoring dashboard + alerting (PIXELCRAFT-100)" git push origin devops/PIXELCRAFT-100-monitoring # PR → Review → Merge → Close ticket ✅
CS.DEEP-DIVE

SLI → SLO → SLA: the reliability hierarchy.

Google's SRE book formalized this: measure what matters, set targets, then make promises. In that order. Never the reverse.

// The hierarchy:

SLI (Service Level Indicator)
  What you measure
  "P95 latency is 120ms"
  "Uptime this month: 99.97%"

SLO (Service Level Objective)
  Your internal target
  "P95 latency < 500ms"
  "99.9% uptime (43min/month)"

SLA (Service Level Agreement)
  Your promise to customers
  "99.5% uptime guaranteed"
  "Refund if breached"

// Why averages lie:
Avg latency: 100ms (looks great!)
P99 latency: 5000ms
→ 1% of users wait 5 seconds
→ At 1M users = 10,000 unhappy

// Always measure percentiles.
// P50 = typical. P95 = bad day.
// P99 = worst experience.
"Monitoring Lab"
[A]Build a React dashboard page: fetch /api/admin/metrics and /api/admin/stats every 30 seconds. Display line charts (Recharts) for latency over time, gauges for error rate, and counters for active users.
[B]Add UptimeRobot (free tier): ping /api/health every 5 minutes from external locations worldwide. Get email/Slack alerts when the server goes down. See uptime percentage over 30/60/90 days.
[C]Research: read chapters 1-4 of Google's SRE book (free online). What is an "error budget"? How does Google decide when to stop shipping features and focus on reliability? Write a brief SRE proposal for PixelCraft.
REF.MATERIAL
ARTICLE
Google
The free book that defined Site Reliability Engineering: SLIs, SLOs, error budgets, monitoring, incident response. The industry standard.
SREESSENTIAL
VIDEO
Google Cloud
Clear explanation of the reliability hierarchy: what to measure, what targets to set, and how to make promises to customers.
SREOVERVIEW
ARTICLE
Grafana Labs
Official Grafana guide: creating dashboards, data sources, alerting, and visualization. The standard for metrics visualization.
GRAFANAOFFICIAL
ARTICLE
UptimeRobot
Free uptime monitoring: HTTP checks every 5 minutes, email/Slack alerts, public status pages. The easiest monitoring to set up.
UPTIMEFREE
VIDEO
StatsQuest
Why averages lie and percentiles tell the truth: P50, P95, P99 explained with clear examples. Essential for understanding latency metrics.
STATSPERCENTILES
// LEAVE EXCITED BECAUSE
Dashboard shows P50/P95/P99 latency, error rate, active users, and storage. Alerts fire to Slack when thresholds are breached. You see your system's health in real-time. No more flying blind.