Skip to content

API LGTM OTEL Integration

This runbook documents how the API app exports logs, traces, and application metrics to the LGTM stack.

Signal Flow

  1. API emits telemetry via OpenTelemetry SDK (NodeSDK) to OTLP HTTP.
  2. OTEL Collector receives telemetry at :4318.
  3. Collector routes traces to Tempo, logs to Loki, and metrics to Prometheus (:8889).
  4. Grafana queries Loki/Tempo/Prometheus for correlated request, runtime, cache, and database diagnostics.

Runtime Components

  • Bootstrap: main.ts initializes and shuts down the OTEL SDK (initializeOtelSdk / shutdownOtelSdk).
  • Traces:
    • HTTP requests are wrapped by RequestObservabilityInterceptor (SpanKind.SERVER).
    • Identity event publishing creates producer spans (SpanKind.PRODUCER) via shared BasePublisher.
    • Notifications consumers create consumer spans (SpanKind.CONSUMER) via shared BaseConsumer.
  • Logs:
    • AppLoggerService emits structured OTEL logs.
    • Active trace.id / span.id are attached automatically when span context exists.
    • Every HTTP request emits a http.request.completed log with requestId, route/controller metadata, status code, latency, and optional userId.
    • Notifications command handlers emit namespaced error events for email and SMS failure paths, so async delivery issues are visible in Loki without querying the notifications database first.
  • Metrics:
    • AppMetricsService is a facade that exposes namespaced services (metrics.http, metrics.runtime, metrics.postgres, metrics.redis).
    • HttpRequestMetricsService emits HTTP RED metrics (http_requests_total, http_request_duration_seconds, http_requests_in_flight).
    • RuntimeMetricsService emits runtime metrics from inside the API process (nodejs_eventloop_lag_seconds, nodejs_active_handles_total, nodejs_active_requests_total).
    • PostgresMetricsService emits app-perspective database metrics (app_repository_operation_duration_seconds, app_repository_operation_errors_total, app_db_query_duration_seconds, app_db_query_errors_total). Repository methods record code-path latency/errors, while instrumentPostgresDataSource(...) wraps TypeORM query runners to capture low-cardinality SQL latency/errors for every connection.
    • RedisMetricsService emits app-perspective cache metrics (app_redis_operation_duration_seconds, app_redis_operation_errors_total, app_cache_hit_total, app_cache_miss_total). It also owns the Redis operation telemetry lifecycle through recordOperation(...), including duration, error handling, and key-prefix resolution.
    • You can inject either the facade (AppMetricsServiceKey) or a specific metrics service key (for example PostgresMetricsServiceKey or RedisMetricsServiceKey).

Notification Error Event Catalog

Event nameLevelEmitted byPurpose
notifications.email.template_missingwarnSendEmailNotificationHandlerTemplate key is not registered in the notifications template registry.
notifications.email.render_failederrorSendEmailNotificationHandlerEmail subject generation or HTML rendering failed before SMTP delivery.
notifications.email.delivery_failederrorSendEmailNotificationHandlerSMTP delivery failed after template rendering completed.
notifications.email.log_persist_failederrorSendEmailNotificationHandlerNotification audit row could not be stored in notifications.notification_logs.
notifications.sms.provider_misconfigurederrorSendSmsNotificationHandlerTwilio credentials or sender configuration are missing.
notifications.sms.delivery_failederrorSendSmsNotificationHandlerTwilio rejected or failed an SMS delivery attempt.
notifications.sms.log_persist_failederrorSendSmsNotificationHandlerNotification audit row could not be stored in notifications.notification_logs.

All notification events include the command context (templateKey, recipient, channel, and any available userId, clientId, or agencyId). SMS failures also include Twilio diagnostics when available, such as providerErrorCode, providerMoreInfo, providerDetails, and missingVariables.

Request-Level Operator Signals

  • Each HTTP response includes x-request-id. If the caller already sends that header, the API reuses it; otherwise it generates a new UUIDv7-compatible request id.
  • http.request.completed is emitted at info for successful responses, warn for 4xx, and error for 5xx, which makes request failures visible in Loki even when the controller code does not log explicitly.

Metric Catalog

HTTP + Runtime

Metric nameTypeAttributesPurpose
http_requests_totalCountermethod, route, status_codeRequest throughput + errors
http_request_duration_secondsHistogrammethod, route, status_codeRequest latency percentiles
http_requests_in_flightUpDownCountermethodCurrent concurrency
nodejs_eventloop_lag_secondsHistogramnoneNode.js event loop saturation
nodejs_active_handles_totalObservableGaugenoneRuntime handle pressure
nodejs_active_requests_totalObservableGaugenoneRuntime request backlog signal

Redis (Application Perspective)

These app metrics complement infrastructure Redis metrics from redis-exporter (memory, clients, server throughput).

Redis cache callers emit these metrics through RedisMetricsService.recordOperation(...). The Redis callback stays focused on Redis I/O while the metrics service records duration/error telemetry around it and exposes cache hit/miss helpers for the callback body.

Metric nameTypeAttributesPurpose
app_redis_operation_duration_secondsHistogramoperation, key_prefixApp-side Redis latency by operation
app_redis_operation_errors_totalCounteroperation, key_prefix, error_typeApp-side Redis failures
app_cache_hit_totalCountercache_name, key_prefixCache hits from API code path
app_cache_miss_totalCountercache_name, key_prefixCache misses from API code path

PostgreSQL (Application Perspective)

These metrics complement infrastructure PostgreSQL metrics from postgres-exporter (connections, locks, deadlocks, buffer hit ratio, row throughput). The API dashboard stays focused on app-perspective database telemetry, while the community PostgreSQL dashboard reads exporter metrics directly.

TypeORM connections are instrumented centrally in database.config.ts via instrumentPostgresDataSource(...), which wraps every created query runner. Repository implementations also wrap their public methods through PostgresMetricsService.recordRepositoryOperation(...) so you can compare code-path latency to raw query latency.

Metric nameTypeAttributesPurpose
app_repository_operation_duration_secondsHistogrammodule, repository, methodApp-side repository latency by code path
app_repository_operation_errors_totalCountermodule, repository, method, error_typeApp-side repository failures by code path
app_db_query_duration_secondsHistogramconnection, query_kind, entityApp-side SQL latency by connection and table
app_db_query_errors_totalCounterconnection, query_kind, entity, error_typeApp-side SQL failures by connection and table

Dashboard

  • Provisioned dashboard: infra/observability/grafana-provisioning/dashboards/infra/infra-api-observability-dashboard.json
  • Folder in Grafana: Infra
  • Focus: API request rate/errors/latency, in-flight requests, event loop lag, Node runtime counters, Redis app-perspective latency/errors/hit ratio, Postgres repository/query latency, and Postgres repository/query errors.
  • Community PostgreSQL dashboard: infra/observability/grafana-provisioning/dashboards/community/postgresql-dashboard.json

Identity -> Notifications Traceability

Identity and Notifications are independent modules that communicate through integration events.

  • Identity publisher attaches metadata to each integration event:
    • eventId
    • correlationId
    • occurredAt
    • producer
    • trace (traceId, spanId, traceFlags)
  • Notifications consumers use span links from event.metadata.trace.
  • This keeps module boundaries explicit while preserving end-to-end diagnostics in Tempo.

Environment Variables (apps/api/.env)

VariableDefaultPurpose
OBSERVABILITY_ENABLEDtrueEnables/disables OTEL SDK initialization.
OBSERVABILITY_OTLP_HTTP_ENDPOINThttp://otel-collector:4318Base OTLP HTTP endpoint used for /v1/traces, /v1/logs, and /v1/metrics.
OBSERVABILITY_SERVICE_NAMEdaramex-apiService name sent in OTEL resource attributes.
OBSERVABILITY_SERVICE_VERSION0.0.1Service version sent in OTEL resource attributes.
OBSERVABILITY_LOG_LEVELinfoApplication logger threshold (debug | info | warn | error).

Verification Steps

  1. Start the API with OBSERVABILITY_ENABLED=true and collector reachable.
  2. Perform an identity action that emits an integration event (for example user registration).
  3. In Grafana Tempo, verify:
    • HTTP server span exists.
    • Identity producer span exists.
    • Notifications consumer span exists with link to producer span.
  4. In Grafana Loki, verify structured log events with matching trace.id.
  5. Force a notifications failure path (for example, request an unknown email template in a consumer test flow or remove a Twilio variable locally) and verify one of the notifications.email.* or notifications.sms.* events appears with the expected attributes.
  6. In Grafana Explore (Prometheus), verify metrics are present with:
    • sum(rate(http_requests_total{service_name="daramex-api"}[5m]))
    • histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service_name="daramex-api"}[5m])) by (le))
    • sum(rate(app_repository_operation_duration_seconds_count{service_name="daramex-api"}[5m])) by (module, repository, method)
    • sum(rate(app_db_query_duration_seconds_count{service_name="daramex-api"}[5m])) by (connection, query_kind, entity)
    • sum(rate(app_redis_operation_duration_seconds_count{service_name="daramex-api"}[5m])) by (operation)

Troubleshooting

  • No traces in Tempo:
    • Confirm OBSERVABILITY_ENABLED=true.
    • Confirm API can reach OBSERVABILITY_OTLP_HTTP_ENDPOINT.
    • Confirm collector traces pipeline exports to Tempo (infra/observability/collector-config.yaml).
  • No logs in Loki:
    • Confirm app logs are emitted through AppLoggerService.
    • Confirm collector logs pipeline exports to Loki.
  • Notification failures only appear in the database:
    • Confirm the Notifications commands execute through SendEmailNotificationHandler or SendSmsNotificationHandler.
    • Confirm error paths are using the structured notification events listed in this runbook.
    • Confirm the failing workflow still has an active trace context if you expect trace.id / span.id correlation in Loki.
  • No metrics in Prometheus:
    • Confirm NodeSDK metrics reader is configured with OTLP HTTP metrics exporter (/v1/metrics).
    • Confirm collector metrics pipeline exports to Prometheus.
    • Confirm collector Prometheus exporter has resource_to_telemetry_conversion.enabled: true so service_name labels are available.
  • No postgres-exporter metrics:
    • Confirm postgres-exporter is running in the infra compose stack.
    • Confirm Prometheus scrapes postgres-exporter:9187 with job postgres.
    • Confirm the exporter can reach the database container with the configured DATA_SOURCE_URI, DATA_SOURCE_USER, and DATA_SOURCE_PASS.
  • No app-side Postgres metrics:
    • Confirm repository methods execute through instrumented implementations in modules/**/infrastructure/repositories/*.repository.impl.ts.
    • Confirm createSchemaDataSourceOptions(...) initializes the datasource through instrumentPostgresDataSource(...).
  • Missing module linkage (identity -> notifications):
    • Confirm published event has metadata.trace.
    • Confirm consumer spans are created with links: [event.metadata.trace].