Appearance
API LGTM OTEL Integration
This runbook documents how the API app exports logs, traces, and application metrics to the LGTM stack.
Signal Flow
- API emits telemetry via OpenTelemetry SDK (
NodeSDK) to OTLP HTTP. - OTEL Collector receives telemetry at
:4318. - Collector routes traces to Tempo, logs to Loki, and metrics to Prometheus (
:8889). - Grafana queries Loki/Tempo/Prometheus for correlated request, runtime, cache, and database diagnostics.
Runtime Components
- Bootstrap:
main.tsinitializes and shuts down the OTEL SDK (initializeOtelSdk/shutdownOtelSdk). - Traces:
- HTTP requests are wrapped by
RequestObservabilityInterceptor(SpanKind.SERVER). - Identity event publishing creates producer spans (
SpanKind.PRODUCER) via sharedBasePublisher. - Notifications consumers create consumer spans (
SpanKind.CONSUMER) via sharedBaseConsumer.
- HTTP requests are wrapped by
- Logs:
AppLoggerServiceemits structured OTEL logs.- Active
trace.id/span.idare attached automatically when span context exists. - Every HTTP request emits a
http.request.completedlog withrequestId, route/controller metadata, status code, latency, and optionaluserId. - Notifications command handlers emit namespaced error events for email and SMS failure paths, so async delivery issues are visible in Loki without querying the notifications database first.
- Metrics:
AppMetricsServiceis a facade that exposes namespaced services (metrics.http,metrics.runtime,metrics.postgres,metrics.redis).HttpRequestMetricsServiceemits HTTP RED metrics (http_requests_total,http_request_duration_seconds,http_requests_in_flight).RuntimeMetricsServiceemits runtime metrics from inside the API process (nodejs_eventloop_lag_seconds,nodejs_active_handles_total,nodejs_active_requests_total).PostgresMetricsServiceemits app-perspective database metrics (app_repository_operation_duration_seconds,app_repository_operation_errors_total,app_db_query_duration_seconds,app_db_query_errors_total). Repository methods record code-path latency/errors, whileinstrumentPostgresDataSource(...)wraps TypeORM query runners to capture low-cardinality SQL latency/errors for every connection.RedisMetricsServiceemits app-perspective cache metrics (app_redis_operation_duration_seconds,app_redis_operation_errors_total,app_cache_hit_total,app_cache_miss_total). It also owns the Redis operation telemetry lifecycle throughrecordOperation(...), including duration, error handling, and key-prefix resolution.- You can inject either the facade (
AppMetricsServiceKey) or a specific metrics service key (for examplePostgresMetricsServiceKeyorRedisMetricsServiceKey).
Notification Error Event Catalog
| Event name | Level | Emitted by | Purpose |
|---|---|---|---|
notifications.email.template_missing | warn | SendEmailNotificationHandler | Template key is not registered in the notifications template registry. |
notifications.email.render_failed | error | SendEmailNotificationHandler | Email subject generation or HTML rendering failed before SMTP delivery. |
notifications.email.delivery_failed | error | SendEmailNotificationHandler | SMTP delivery failed after template rendering completed. |
notifications.email.log_persist_failed | error | SendEmailNotificationHandler | Notification audit row could not be stored in notifications.notification_logs. |
notifications.sms.provider_misconfigured | error | SendSmsNotificationHandler | Twilio credentials or sender configuration are missing. |
notifications.sms.delivery_failed | error | SendSmsNotificationHandler | Twilio rejected or failed an SMS delivery attempt. |
notifications.sms.log_persist_failed | error | SendSmsNotificationHandler | Notification audit row could not be stored in notifications.notification_logs. |
All notification events include the command context (templateKey, recipient, channel, and any available userId, clientId, or agencyId). SMS failures also include Twilio diagnostics when available, such as providerErrorCode, providerMoreInfo, providerDetails, and missingVariables.
Request-Level Operator Signals
- Each HTTP response includes
x-request-id. If the caller already sends that header, the API reuses it; otherwise it generates a new UUIDv7-compatible request id. http.request.completedis emitted atinfofor successful responses,warnfor4xx, anderrorfor5xx, which makes request failures visible in Loki even when the controller code does not log explicitly.
Metric Catalog
HTTP + Runtime
| Metric name | Type | Attributes | Purpose |
|---|---|---|---|
http_requests_total | Counter | method, route, status_code | Request throughput + errors |
http_request_duration_seconds | Histogram | method, route, status_code | Request latency percentiles |
http_requests_in_flight | UpDownCounter | method | Current concurrency |
nodejs_eventloop_lag_seconds | Histogram | none | Node.js event loop saturation |
nodejs_active_handles_total | ObservableGauge | none | Runtime handle pressure |
nodejs_active_requests_total | ObservableGauge | none | Runtime request backlog signal |
Redis (Application Perspective)
These app metrics complement infrastructure Redis metrics from redis-exporter (memory, clients, server throughput).
Redis cache callers emit these metrics through RedisMetricsService.recordOperation(...). The Redis callback stays focused on Redis I/O while the metrics service records duration/error telemetry around it and exposes cache hit/miss helpers for the callback body.
| Metric name | Type | Attributes | Purpose |
|---|---|---|---|
app_redis_operation_duration_seconds | Histogram | operation, key_prefix | App-side Redis latency by operation |
app_redis_operation_errors_total | Counter | operation, key_prefix, error_type | App-side Redis failures |
app_cache_hit_total | Counter | cache_name, key_prefix | Cache hits from API code path |
app_cache_miss_total | Counter | cache_name, key_prefix | Cache misses from API code path |
PostgreSQL (Application Perspective)
These metrics complement infrastructure PostgreSQL metrics from postgres-exporter (connections, locks, deadlocks, buffer hit ratio, row throughput). The API dashboard stays focused on app-perspective database telemetry, while the community PostgreSQL dashboard reads exporter metrics directly.
TypeORM connections are instrumented centrally in database.config.ts via instrumentPostgresDataSource(...), which wraps every created query runner. Repository implementations also wrap their public methods through PostgresMetricsService.recordRepositoryOperation(...) so you can compare code-path latency to raw query latency.
| Metric name | Type | Attributes | Purpose |
|---|---|---|---|
app_repository_operation_duration_seconds | Histogram | module, repository, method | App-side repository latency by code path |
app_repository_operation_errors_total | Counter | module, repository, method, error_type | App-side repository failures by code path |
app_db_query_duration_seconds | Histogram | connection, query_kind, entity | App-side SQL latency by connection and table |
app_db_query_errors_total | Counter | connection, query_kind, entity, error_type | App-side SQL failures by connection and table |
Dashboard
- Provisioned dashboard:
infra/observability/grafana-provisioning/dashboards/infra/infra-api-observability-dashboard.json - Folder in Grafana:
Infra - Focus: API request rate/errors/latency, in-flight requests, event loop lag, Node runtime counters, Redis app-perspective latency/errors/hit ratio, Postgres repository/query latency, and Postgres repository/query errors.
- Community PostgreSQL dashboard:
infra/observability/grafana-provisioning/dashboards/community/postgresql-dashboard.json
Identity -> Notifications Traceability
Identity and Notifications are independent modules that communicate through integration events.
- Identity publisher attaches
metadatato each integration event:eventIdcorrelationIdoccurredAtproducertrace(traceId,spanId,traceFlags)
- Notifications consumers use span links from
event.metadata.trace. - This keeps module boundaries explicit while preserving end-to-end diagnostics in Tempo.
Environment Variables (apps/api/.env)
| Variable | Default | Purpose |
|---|---|---|
OBSERVABILITY_ENABLED | true | Enables/disables OTEL SDK initialization. |
OBSERVABILITY_OTLP_HTTP_ENDPOINT | http://otel-collector:4318 | Base OTLP HTTP endpoint used for /v1/traces, /v1/logs, and /v1/metrics. |
OBSERVABILITY_SERVICE_NAME | daramex-api | Service name sent in OTEL resource attributes. |
OBSERVABILITY_SERVICE_VERSION | 0.0.1 | Service version sent in OTEL resource attributes. |
OBSERVABILITY_LOG_LEVEL | info | Application logger threshold (debug | info | warn | error). |
Verification Steps
- Start the API with
OBSERVABILITY_ENABLED=trueand collector reachable. - Perform an identity action that emits an integration event (for example user registration).
- In Grafana Tempo, verify:
- HTTP server span exists.
- Identity producer span exists.
- Notifications consumer span exists with link to producer span.
- In Grafana Loki, verify structured log events with matching
trace.id. - Force a notifications failure path (for example, request an unknown email template in a consumer test flow or remove a Twilio variable locally) and verify one of the
notifications.email.*ornotifications.sms.*events appears with the expected attributes. - In Grafana Explore (Prometheus), verify metrics are present with:
sum(rate(http_requests_total{service_name="daramex-api"}[5m]))histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service_name="daramex-api"}[5m])) by (le))sum(rate(app_repository_operation_duration_seconds_count{service_name="daramex-api"}[5m])) by (module, repository, method)sum(rate(app_db_query_duration_seconds_count{service_name="daramex-api"}[5m])) by (connection, query_kind, entity)sum(rate(app_redis_operation_duration_seconds_count{service_name="daramex-api"}[5m])) by (operation)
Troubleshooting
- No traces in Tempo:
- Confirm
OBSERVABILITY_ENABLED=true. - Confirm API can reach
OBSERVABILITY_OTLP_HTTP_ENDPOINT. - Confirm collector traces pipeline exports to Tempo (
infra/observability/collector-config.yaml).
- Confirm
- No logs in Loki:
- Confirm app logs are emitted through
AppLoggerService. - Confirm collector logs pipeline exports to Loki.
- Confirm app logs are emitted through
- Notification failures only appear in the database:
- Confirm the Notifications commands execute through
SendEmailNotificationHandlerorSendSmsNotificationHandler. - Confirm error paths are using the structured notification events listed in this runbook.
- Confirm the failing workflow still has an active trace context if you expect
trace.id/span.idcorrelation in Loki.
- Confirm the Notifications commands execute through
- No metrics in Prometheus:
- Confirm
NodeSDKmetrics reader is configured with OTLP HTTP metrics exporter (/v1/metrics). - Confirm collector metrics pipeline exports to Prometheus.
- Confirm collector Prometheus exporter has
resource_to_telemetry_conversion.enabled: truesoservice_namelabels are available.
- Confirm
- No postgres-exporter metrics:
- Confirm
postgres-exporteris running in the infra compose stack. - Confirm Prometheus scrapes
postgres-exporter:9187with jobpostgres. - Confirm the exporter can reach the
databasecontainer with the configuredDATA_SOURCE_URI,DATA_SOURCE_USER, andDATA_SOURCE_PASS.
- Confirm
- No app-side Postgres metrics:
- Confirm repository methods execute through instrumented implementations in
modules/**/infrastructure/repositories/*.repository.impl.ts. - Confirm
createSchemaDataSourceOptions(...)initializes the datasource throughinstrumentPostgresDataSource(...).
- Confirm repository methods execute through instrumented implementations in
- Missing module linkage (identity -> notifications):
- Confirm published event has
metadata.trace. - Confirm consumer spans are created with
links: [event.metadata.trace].
- Confirm published event has