SESSION 114 — Monitoring Dashboard

CONCEPTS.UNLOCKED

📊

Application Monitoring

Continuous measurement of your system's health. Collect metrics → store them → visualize on dashboards → alert on anomalies. You can't improve what you can't measure. You can't fix what you can't see.

💓

Health Check Endpoints

GET /api/health → { status: "ok", db: "connected" }. External services ping this every minute. If it fails, you get alerted. Checks database connection, Redis, disk space — not just "server responds."

⏱️

Percentile Latencies

P50, P95, P99 — not averages. Average hides outliers. P95 = 95% of requests are faster than this. P99 matters most: 1% of users = 10,000 at 1M scale. One slow request per hundred is unacceptable.

📈

Grafana Dashboards

Visualize metrics over time. Line charts for response times, gauges for error rates, counters for active users. Grafana connects to Prometheus, InfluxDB, or simple JSON endpoints.

🔔

Alerting Thresholds

Error rate > 5% → Slack alert. P95 > 2s → PagerDuty. Define thresholds for each metric. Alert when exceeded. Avoid alert fatigue: only alert on actionable conditions that require human response.

📋

SLA / SLO / SLI

SLI: what you measure. SLO: your target. SLA: your promise. SLI = P95 latency. SLO = P95 < 500ms, 99.9% uptime. SLA = contractual guarantee to customers. Measure → target → commit.

HANDS-ON.TASKS

Enhanced Health Check

▶

app.get('/api/health',
  async (req, res) => {
  const checks = {
    server: 'ok',
    uptime: process.uptime(),
    timestamp: new Date().toISOString(),
    memory: {
      used: Math.round(
        process.memoryUsage().heapUsed
        / 1024 / 1024),
      total: Math.round(
        process.memoryUsage().heapTotal
        / 1024 / 1024),
      unit: 'MB',
    },
    database: 'checking...',
    redis: 'checking...',
  };

  try {
    await db.command({ ping: 1 });
    checks.database = 'ok';
  } catch {
    checks.database = 'error';
  }

  try {
    await redis.ping();
    checks.redis = 'ok';
  } catch {
    checks.redis = 'error';
  }

  const allOk = checks.database === 'ok'
    && checks.redis === 'ok';
  res.status(allOk ? 200 : 503)
    .json(checks);
});

Metrics Collection Middleware

▶

// middleware/metrics.ts
interface RequestMetric {
  method: string;
  path: string;
  statusCode: number;
  duration: number;
  timestamp: number;
}

const metrics: RequestMetric[] = [];

app.use((req, res, next) => {
  const start = performance.now();

  res.on('finish', () => {
    metrics.push({
      method: req.method,
      path: req.route?.path ?? req.path,
      statusCode: res.statusCode,
      duration: performance.now()
        - start,
      timestamp: Date.now(),
    });

    // Keep last 10,000 metrics
    if (metrics.length > 10000)
      metrics.splice(0, 5000);
  });

  next();
});

// Metrics API
app.get('/api/admin/metrics',
  requireAdmin,
  (req, res) => {
  const window = 5 * 60 * 1000; // 5min
  const recent = metrics.filter(
    m => m.timestamp > Date.now()
      - window);

  const durations = recent
    .map(m => m.duration).sort(
      (a, b) => a - b);

  res.json({
    requestCount: recent.length,
    errorRate: recent.filter(
      m => m.statusCode >= 500).length
      / recent.length,
    latency: {
      p50: percentile(durations, 50),
      p95: percentile(durations, 95),
      p99: percentile(durations, 99),
    },
    topEndpoints: groupAndCount(
      recent, 'path'),
  });
});

function percentile(
  arr: number[], p: number
) {
  const i = Math.ceil(
    arr.length * p / 100) - 1;
  return arr[Math.max(0, i)] ?? 0;
}

Active Users & Storage

▶

app.get('/api/admin/stats',
  requireAdmin,
  async (req, res) => {
  const [
    totalUsers,
    activeToday,
    totalImages,
    storageBytes,
  ] = await Promise.all([
    db.collection('users')
      .countDocuments(),
    db.collection('users')
      .countDocuments({
      lastActive: {
        $gte: new Date(
          Date.now() - 86400000)
      },
    }),
    db.collection('images')
      .countDocuments(),
    db.collection('images')
      .aggregate([{
        $group: {
          _id: null,
          total: { $sum: '$fileSize' },
        }
      }]).toArray()
      .then(r => r[0]?.total ?? 0),
  ]);

  res.json({
    users: { total: totalUsers,
      activeToday },
    images: { total: totalImages },
    storage: {
      bytes: storageBytes,
      formatted: formatBytes(
        storageBytes),
    },
  });
});

Alerting Logic

▶

// lib/alerting.ts
interface AlertRule {
  name: string;
  condition: () => Promise<boolean>;
  message: string;
  cooldown: number; // ms
}

const rules: AlertRule[] = [
  {
    name: 'high-error-rate',
    condition: async () => {
      const m = await getMetrics();
      return m.errorRate > 0.05;
    },
    message: '🔴 Error rate > 5%',
    cooldown: 15 * 60 * 1000,
  },
  {
    name: 'slow-p95',
    condition: async () => {
      const m = await getMetrics();
      return m.latency.p95 > 2000;
    },
    message: '🟡 P95 latency > 2s',
    cooldown: 30 * 60 * 1000,
  },
];

// Check every minute
setInterval(async () => {
  for (const rule of rules) {
    if (await rule.condition()) {
      await sendSlackAlert(
        rule.message);
    }
  }
}, 60_000);

Close the Ticket

▶

git switch -c devops/PIXELCRAFT-100-monitoring
git add src/middleware/metrics.ts src/lib/alerting.ts
git commit -m "Add monitoring dashboard + alerting (PIXELCRAFT-100)"
git push origin devops/PIXELCRAFT-100-monitoring
# PR → Review → Merge → Close ticket ✅

CS.DEEP-DIVE

SLI → SLO → SLA: the reliability hierarchy.

Google's SRE book formalized this: measure what matters, set targets, then make promises. In that order. Never the reverse.

// The hierarchy:

SLI (Service Level Indicator)
  What you measure
  "P95 latency is 120ms"
  "Uptime this month: 99.97%"

SLO (Service Level Objective)
  Your internal target
  "P95 latency < 500ms"
  "99.9% uptime (43min/month)"

SLA (Service Level Agreement)
  Your promise to customers
  "99.5% uptime guaranteed"
  "Refund if breached"

// Why averages lie:
Avg latency: 100ms (looks great!)
P99 latency: 5000ms
→ 1% of users wait 5 seconds
→ At 1M users = 10,000 unhappy

// Always measure percentiles.
// P50 = typical. P95 = bad day.
// P99 = worst experience.

REF.MATERIAL

ARTICLE

Google SRE Book

Google

The free book that defined Site Reliability Engineering: SLIs, SLOs, error budgets, monitoring, incident response. The industry standard.

SREESSENTIAL

VIDEO

SLIs, SLOs, SLAs — Google Cloud

Google Cloud

Clear explanation of the reliability hierarchy: what to measure, what targets to set, and how to make promises to customers.

SREOVERVIEW

ARTICLE

Grafana Documentation

Grafana Labs

Official Grafana guide: creating dashboards, data sources, alerting, and visualization. The standard for metrics visualization.

GRAFANAOFFICIAL

ARTICLE

UptimeRobot

Free uptime monitoring: HTTP checks every 5 minutes, email/Slack alerts, public status pages. The easiest monitoring to set up.

UPTIMEFREE

VIDEO

Percentiles Explained — StatsQuest

StatsQuest

Why averages lie and percentiles tell the truth: P50, P95, P99 explained with clear examples. Essential for understanding latency metrics.

STATSPERCENTILES

MONITORING DASHBOARD