19. Observability for trading systems: logging, metrics, and tracing basics
Trading systems are noisy and time-sensitive. When something goes wrong—missed ticks, delayed jobs, failed API calls—you need answers fast. Observability is the safety net: logs for context, metrics for trends, and tracing for end-to-end visibility.
This post documents the baseline observability setup in this repo and how it maps to real production concerns.
1. Logging: structured, centralized, and consistent
The backend uses Loguru as the primary logger, with a custom configurator to capture both Loguru logs and standard Python logging.
backend/src/core/logging_config.py
Key points:
- Logs go to stdout for container logs.
- Logs are also written to a daily-rotated file under
backend/logs/. - Standard logging is intercepted and routed through
Loguru.
This keeps logging consistent across routers, services, and middleware.
2. Metrics: Prometheus + FastAPI Instrumentator
Metrics are exposed using prometheus_fastapi_instrumentator:
# backend/src/app_factory.py
Instrumentator().instrument(app).expose(app)This automatically adds:
- request counts
- latency histograms
- status code breakdowns
The Docker Compose stack includes Prometheus and Grafana:
- docker-compose.yml
- prometheus
- grafana
- provisioning config mounted from monitoring
This lets me visualize API throughput, error rates, and latency patterns during development.
3. Tracing: baseline today, hooks for later
At the moment, there is no distributed tracing. That’s a conscious choice while the system is still evolving.
The code is already structured in a way that makes tracing easy to add later:
- HTTP client lives in
app.state(single client per process) - services are thin, with clear boundaries
- middleware already touches request lifecycle
When I introduce tracing, the first steps will be:
- OpenTelemetry middleware for FastAPI
- tracing for outbound HTTP calls to the Go worker
- trace IDs propagated through logs
Even before full tracing, I still treat correlation IDs as a must-have for debugging retries and timeouts.
4. What matters most in trading systems
For trading workflows, the most valuable signals are:
- Latency: time from market-data fetch → DB insert (end-to-end ingestion latency)
- Error rate: failed fetches, bad payloads, and database errors
- Event volume: number of ticks ingested per minute
- Cache hit ratio: Redis API key lookups (worker endpoints)
This project already logs many of those events; the next step is to turn the key ones into metrics.