API LGTM OTEL Integration

This runbook documents how the API app exports logs, traces, and application metrics to the LGTM stack.

Signal Flow

API emits telemetry via OpenTelemetry SDK (NodeSDK) to OTLP HTTP.
OTEL Collector receives telemetry at :4318.
Collector routes traces to Tempo, logs to Loki, and metrics to Prometheus (:8889).
Grafana queries Loki/Tempo/Prometheus for correlated request, runtime, cache, and database diagnostics.

Runtime Components

Bootstrap: main.ts initializes and shuts down the OTEL SDK (initializeOtelSdk / shutdownOtelSdk).
Traces:
- HTTP requests are wrapped by RequestObservabilityInterceptor (SpanKind.SERVER).
- Identity event publishing creates producer spans (SpanKind.PRODUCER) via shared BasePublisher.
- Notifications consumers create consumer spans (SpanKind.CONSUMER) via shared BaseConsumer.
Logs:
- AppLoggerService emits structured OTEL logs.
- Active trace.id / span.id are attached automatically when span context exists.
- Every HTTP request emits a http.request.completed log with requestId, route/controller metadata, status code, latency, and optional userId.
- Notifications command handlers emit namespaced error events for email and SMS failure paths, so async delivery issues are visible in Loki without querying the notifications database first.
Metrics:
- AppMetricsService is a facade that exposes namespaced services (metrics.http, metrics.runtime, metrics.postgres, metrics.redis).
- HttpRequestMetricsService emits HTTP RED metrics (http_requests_total, http_request_duration_seconds, http_requests_in_flight).
- RuntimeMetricsService emits runtime metrics from inside the API process (nodejs_eventloop_lag_seconds, nodejs_active_handles_total, nodejs_active_requests_total).
- PostgresMetricsService emits app-perspective database metrics (app_repository_operation_duration_seconds, app_repository_operation_errors_total, app_db_query_duration_seconds, app_db_query_errors_total). Repository methods record code-path latency/errors, while instrumentPostgresDataSource(...) wraps TypeORM query runners to capture low-cardinality SQL latency/errors for every connection.
- RedisMetricsService emits app-perspective cache metrics (app_redis_operation_duration_seconds, app_redis_operation_errors_total, app_cache_hit_total, app_cache_miss_total). It also owns the Redis operation telemetry lifecycle through recordOperation(...), including duration, error handling, and key-prefix resolution.
- You can inject either the facade (AppMetricsServiceKey) or a specific metrics service key (for example PostgresMetricsServiceKey or RedisMetricsServiceKey).

Notification Error Event Catalog

Event name	Level	Emitted by	Purpose
`notifications.email.template_missing`	warn	`SendEmailNotificationHandler`	Template key is not registered in the notifications template registry.
`notifications.email.render_failed`	error	`SendEmailNotificationHandler`	Email subject generation or HTML rendering failed before SMTP delivery.
`notifications.email.delivery_failed`	error	`SendEmailNotificationHandler`	SMTP delivery failed after template rendering completed.
`notifications.email.log_persist_failed`	error	`SendEmailNotificationHandler`	Notification audit row could not be stored in `notifications.notification_logs`.
`notifications.sms.provider_misconfigured`	error	`SendSmsNotificationHandler`	Twilio credentials or sender configuration are missing.
`notifications.sms.delivery_failed`	error	`SendSmsNotificationHandler`	Twilio rejected or failed an SMS delivery attempt.
`notifications.sms.log_persist_failed`	error	`SendSmsNotificationHandler`	Notification audit row could not be stored in `notifications.notification_logs`.

All notification events include the command context (templateKey, recipient, channel, and any available userId, clientId, or agencyId). SMS failures also include Twilio diagnostics when available, such as providerErrorCode, providerMoreInfo, providerDetails, and missingVariables.

Request-Level Operator Signals

Each HTTP response includes x-request-id. If the caller already sends that header, the API reuses it; otherwise it generates a new UUIDv7-compatible request id.
http.request.completed is emitted at info for successful responses, warn for 4xx, and error for 5xx, which makes request failures visible in Loki even when the controller code does not log explicitly.

Metric Catalog

HTTP + Runtime

Metric name	Type	Attributes	Purpose
`http_requests_total`	Counter	`method`, `route`, `status_code`	Request throughput + errors
`http_request_duration_seconds`	Histogram	`method`, `route`, `status_code`	Request latency percentiles
`http_requests_in_flight`	UpDownCounter	`method`	Current concurrency
`nodejs_eventloop_lag_seconds`	Histogram	none	Node.js event loop saturation
`nodejs_active_handles_total`	ObservableGauge	none	Runtime handle pressure
`nodejs_active_requests_total`	ObservableGauge	none	Runtime request backlog signal

Redis (Application Perspective)

These app metrics complement infrastructure Redis metrics from redis-exporter (memory, clients, server throughput).

Redis cache callers emit these metrics through RedisMetricsService.recordOperation(...). The Redis callback stays focused on Redis I/O while the metrics service records duration/error telemetry around it and exposes cache hit/miss helpers for the callback body.

Metric name	Type	Attributes	Purpose
`app_redis_operation_duration_seconds`	Histogram	`operation`, `key_prefix`	App-side Redis latency by operation
`app_redis_operation_errors_total`	Counter	`operation`, `key_prefix`, `error_type`	App-side Redis failures
`app_cache_hit_total`	Counter	`cache_name`, `key_prefix`	Cache hits from API code path
`app_cache_miss_total`	Counter	`cache_name`, `key_prefix`	Cache misses from API code path

PostgreSQL (Application Perspective)

These metrics complement infrastructure PostgreSQL metrics from postgres-exporter (connections, locks, deadlocks, buffer hit ratio, row throughput). The API dashboard stays focused on app-perspective database telemetry, while the community PostgreSQL dashboard reads exporter metrics directly.

TypeORM connections are instrumented centrally in database.config.ts via instrumentPostgresDataSource(...), which wraps every created query runner. Repository implementations also wrap their public methods through PostgresMetricsService.recordRepositoryOperation(...) so you can compare code-path latency to raw query latency.

Metric name	Type	Attributes	Purpose
`app_repository_operation_duration_seconds`	Histogram	`module`, `repository`, `method`	App-side repository latency by code path
`app_repository_operation_errors_total`	Counter	`module`, `repository`, `method`, `error_type`	App-side repository failures by code path
`app_db_query_duration_seconds`	Histogram	`connection`, `query_kind`, `entity`	App-side SQL latency by connection and table
`app_db_query_errors_total`	Counter	`connection`, `query_kind`, `entity`, `error_type`	App-side SQL failures by connection and table

Dashboard

Provisioned dashboard: infra/observability/grafana-provisioning/dashboards/infra/infra-api-observability-dashboard.json
Folder in Grafana: Infra
Focus: API request rate/errors/latency, in-flight requests, event loop lag, Node runtime counters, Redis app-perspective latency/errors/hit ratio, Postgres repository/query latency, and Postgres repository/query errors.
Community PostgreSQL dashboard: infra/observability/grafana-provisioning/dashboards/community/postgresql-dashboard.json

Identity -> Notifications Traceability

Identity and Notifications are independent modules that communicate through integration events.

Identity publisher attaches metadata to each integration event:
- eventId
- correlationId
- occurredAt
- producer
- trace (traceId, spanId, traceFlags)
Notifications consumers use span links from event.metadata.trace.
This keeps module boundaries explicit while preserving end-to-end diagnostics in Tempo.

Environment Variables (`apps/api/.env`)

Variable	Default	Purpose
`OBSERVABILITY_ENABLED`	`true`	Enables/disables OTEL SDK initialization.
`OBSERVABILITY_OTLP_HTTP_ENDPOINT`	`http://otel-collector:4318`	Base OTLP HTTP endpoint used for `/v1/traces`, `/v1/logs`, and `/v1/metrics`.
`OBSERVABILITY_SERVICE_NAME`	`daramex-api`	Service name sent in OTEL resource attributes.
`OBSERVABILITY_SERVICE_VERSION`	`0.0.1`	Service version sent in OTEL resource attributes.
`OBSERVABILITY_LOG_LEVEL`	`info`	Application logger threshold (`debug` \| `info` \| `warn` \| `error`).

Verification Steps

Start the API with OBSERVABILITY_ENABLED=true and collector reachable.
Perform an identity action that emits an integration event (for example user registration).
In Grafana Tempo, verify:
- HTTP server span exists.
- Identity producer span exists.
- Notifications consumer span exists with link to producer span.
In Grafana Loki, verify structured log events with matching trace.id.
Force a notifications failure path (for example, request an unknown email template in a consumer test flow or remove a Twilio variable locally) and verify one of the notifications.email.* or notifications.sms.* events appears with the expected attributes.
In Grafana Explore (Prometheus), verify metrics are present with:
- sum(rate(http_requests_total{service_name="daramex-api"}[5m]))
- histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service_name="daramex-api"}[5m])) by (le))
- sum(rate(app_repository_operation_duration_seconds_count{service_name="daramex-api"}[5m])) by (module, repository, method)
- sum(rate(app_db_query_duration_seconds_count{service_name="daramex-api"}[5m])) by (connection, query_kind, entity)
- sum(rate(app_redis_operation_duration_seconds_count{service_name="daramex-api"}[5m])) by (operation)

Troubleshooting

No traces in Tempo:
- Confirm OBSERVABILITY_ENABLED=true.
- Confirm API can reach OBSERVABILITY_OTLP_HTTP_ENDPOINT.
- Confirm collector traces pipeline exports to Tempo (infra/observability/collector-config.yaml).
No logs in Loki:
- Confirm app logs are emitted through AppLoggerService.
- Confirm collector logs pipeline exports to Loki.
Notification failures only appear in the database:
- Confirm the Notifications commands execute through SendEmailNotificationHandler or SendSmsNotificationHandler.
- Confirm error paths are using the structured notification events listed in this runbook.
- Confirm the failing workflow still has an active trace context if you expect trace.id / span.id correlation in Loki.
No metrics in Prometheus:
- Confirm NodeSDK metrics reader is configured with OTLP HTTP metrics exporter (/v1/metrics).
- Confirm collector metrics pipeline exports to Prometheus.
- Confirm collector Prometheus exporter has resource_to_telemetry_conversion.enabled: true so service_name labels are available.
No postgres-exporter metrics:
- Confirm postgres-exporter is running in the infra compose stack.
- Confirm Prometheus scrapes postgres-exporter:9187 with job postgres.
- Confirm the exporter can reach the database container with the configured DATA_SOURCE_URI, DATA_SOURCE_USER, and DATA_SOURCE_PASS.
No app-side Postgres metrics:
- Confirm repository methods execute through instrumented implementations in modules/**/infrastructure/repositories/*.repository.impl.ts.
- Confirm createSchemaDataSourceOptions(...) initializes the datasource through instrumentPostgresDataSource(...).
Missing module linkage (identity -> notifications):
- Confirm published event has metadata.trace.
- Confirm consumer spans are created with links: [event.metadata.trace].

API LGTM OTEL Integration ​

Signal Flow ​

Runtime Components ​

Notification Error Event Catalog ​

Request-Level Operator Signals ​

Metric Catalog ​

HTTP + Runtime ​

Redis (Application Perspective) ​

PostgreSQL (Application Perspective) ​

Dashboard ​

Identity -> Notifications Traceability ​

Environment Variables (apps/api/.env) ​

Verification Steps ​

Troubleshooting ​