19. Observability for trading systems: logging, metrics, and tracing basics

Trading systems are noisy and time-sensitive. When something goes wrong—missed ticks, delayed jobs, failed API calls—you need answers fast. Observability is the safety net: logs for context, metrics for trends, and tracing for end-to-end visibility.

This post documents the baseline observability setup in this repo and how it maps to real production concerns.

1. Logging: structured, centralized, and consistent


The backend uses Loguru as the primary logger, with a custom configurator to capture both Loguru logs and standard Python logging.

  • backend/src/core/logging_config.py

Key points:

  • Logs go to stdout for container logs.
  • Logs are also written to a daily-rotated file under backend/logs/.
  • Standard logging is intercepted and routed through Loguru.

This keeps logging consistent across routers, services, and middleware.

2. Metrics: Prometheus + FastAPI Instrumentator


Metrics are exposed using prometheus_fastapi_instrumentator:

# backend/src/app_factory.py

Instrumentator().instrument(app).expose(app)

This automatically adds:

  • request counts
  • latency histograms
  • status code breakdowns

The Docker Compose stack includes Prometheus and Grafana:

  • docker-compose.yml
    • prometheus
    • grafana
    • provisioning config mounted from monitoring

This lets me visualize API throughput, error rates, and latency patterns during development.

3. Tracing: baseline today, hooks for later


At the moment, there is no distributed tracing. That’s a conscious choice while the system is still evolving.

The code is already structured in a way that makes tracing easy to add later:

  • HTTP client lives in app.state (single client per process)
  • services are thin, with clear boundaries
  • middleware already touches request lifecycle

When I introduce tracing, the first steps will be:

  • OpenTelemetry middleware for FastAPI
  • tracing for outbound HTTP calls to the Go worker
  • trace IDs propagated through logs

Even before full tracing, I still treat correlation IDs as a must-have for debugging retries and timeouts.

4. What matters most in trading systems


For trading workflows, the most valuable signals are:

  • Latency: time from market-data fetch → DB insert (end-to-end ingestion latency)
  • Error rate: failed fetches, bad payloads, and database errors
  • Event volume: number of ticks ingested per minute
  • Cache hit ratio: Redis API key lookups (worker endpoints)

This project already logs many of those events; the next step is to turn the key ones into metrics.