What is observability, and how do logging, metrics, and tracing work together in a production system?

Question

At a microservices company:

> "Users report intermittent slowness. Our logs show no errors. How do we find the bottleneck across 15 microservices?"

Interview OS Community · Accepted Answer

## The Three Pillars of Observability

### 1. Logging (What happened?)
```typescript
// Structured logging
logger.info({
  event: "order_created",
  orderId: "123",
  userId: "456",
  duration: 230,
  service: "order-service",
});

// Search: "Show me all order_created events for user 456"
```

### 2. Metrics (How much/how many?)
```yaml
# Prometheus metrics
http_requests_total{method="GET", path="/api/orders", status="200"} 15423
http_request_duration_seconds{quantile="0.99"} 0.234

# Alert: "p99 latency > 500ms for 5 minutes"
```

### 3. Tracing (Where did time go?)
```
Request → [API Gateway: 2ms]
         → [Auth Service: 15ms] ← SLOW!
         → [Order Service: 5ms]
         → [Database: 180ms] ← BOTTLENECK!
Total: 202ms

Trace ID: abc-123 links all services together
```

## Distributed Tracing Setup

```typescript
import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function createOrder(data) {
  const span = tracer.startSpan("createOrder");
  span.setAttribute("userId", data.userId);

const authResult = await tracer.startActiveSpan("verifyAuth", async (span) => {
    return verifyAuth(data.token);
  });

const order = await tracer.startActiveSpan("db.insert", async (span) => {
    return prisma.order.create({ data });
  });

span.end();
  return order;
}
```

## When Each Helps

| Symptom | Tool | Example |
|---------|------|---------|
| "Feature X is broken" | **Logs** | Error stack traces |
| "System is slow" | **Metrics** | p99 latency spike |
| "Which service is slow?" | **Traces** | 180ms in DB call |
| "How many 500s today?" | **Metrics** | Error rate graph |
| "Why did this request fail?" | **Traces + Logs** | Trace ID in logs |

## Stack Recommendation

| Tool | For | |
|------|-----|---|
| **Grafana** | Dashboards | Visualization |
| **Prometheus** | Metrics | Time-series data |
| **Loki/ELK** | Logs | Log aggregation |
| **Jaeger/Tempo** | Traces | Distributed tracing |
| **OpenTelemetry** | Instrumentation | Vendor-neutral collection |

Interview OS

What is observability, and how do logging, metrics, and tracing work together in a production system?

Question Details

Suggested Solution

The Three Pillars of Observability

1. Logging (What happened?)

2. Metrics (How much/how many?)

Prometheus metrics

Alert: "p99 latency > 500ms for 5 minutes"

3. Tracing (Where did time go?)

Distributed Tracing Setup

When Each Helps

Stack Recommendation

Discussion (0)