Skip to content

Observability

Shared primitives and conventions for structured logs, metrics, and traces in apps/api.

Key building blocks already in place:

  • AppLoggerService — the ONLY way to emit structured logs. NestJS Logger and console.log are forbidden inside apps/api/src (see .oxlintrc.json).
  • LogContextAsyncLocalStorage-backed per-request context. Services append attributes via LogContext.set(...) / LogContext.append(...); the final log record is emitted at the request boundary.
  • otel-sdk.ts — boots the OpenTelemetry Node SDK (traces + logs + metrics) via OTLP/HTTP.

AI Observability — Phase 0 primitives

Part of the end-to-end AI observability work tracked in GitHub issue #174. Phase 0 ships two shared primitives used by every subsequent phase.

timedSpan helper

Location: apps/api/src/shared/infrastructure/observability/timed-span.ts.

Wraps an async function with a stopwatch-style span. On completion it appends a record to LogContext under the spans key and (optionally) records the duration on an OpenTelemetry histogram.

Signature:

ts
export interface ITimedSpanOptions {
  histogram?: import('@opentelemetry/api').Histogram;
  attributes?: Record<string, string | number | boolean>;
}

export async function timedSpan<T>(
  name: string,
  fn: () => Promise<T>,
  opts?: ITimedSpanOptions,
): Promise<T>;

Behavior:

  • On success: appends { name, duration_ms, ok: true } to LogContext.spans, returns the value.
  • On failure: appends { name, duration_ms, ok: false, error_code } and rethrows the original error.
  • error_code is extracted heuristically — err.codeerr.nameundefined. The full error object is intentionally NOT logged (privacy / size).
  • If opts.histogram is given, the duration is recorded with opts.attributes (or {} when omitted). Recording happens on both success and failure so the distribution is not biased.

Example:

ts
import { timedSpan } from '@api/shared/infrastructure/observability/timed-span';
import { aiToolDurationMs } from '@api/shared/infrastructure/observability/ai-metrics';

await timedSpan(
  'ai.tool.search_docs',
  () => runTool(input),
  {
    histogram: aiToolDurationMs,
    attributes: { tool: 'search_docs', result: 'ok' },
  },
);

When to use it:

  • Any async operation you want both (a) recorded as part of the canonical request log (via LogContext.spans) AND (b) aggregated as a latency distribution.
  • Inside feature modules where a dedicated distributed-tracing span would be overkill but you still want per-operation visibility.

When NOT to use it:

  • For full distributed tracing spans — use @opentelemetry/api trace.getTracer(...).startActiveSpan(...) directly.
  • For plain error handling without timing — don't dress a try/catch up as a span.

ai-metrics.ts — central AI instruments

Location: apps/api/src/shared/infrastructure/observability/ai-metrics.ts.

Declares every OpenTelemetry instrument used across the AI module in a single file so names stay typo-proof and greppable. Instruments are created eagerly from metrics.getMeter('ai'); if no MeterProvider is registered yet, the API falls back to a no-op meter, so importing this module is always safe.

Exported groups (see the file for descriptions and units):

  • HistogramsaiStreamTtftMs, aiStreamTotalMs, aiStreamPreMs, aiToolDurationMs, aiToolApprovalLatencyMs, aiRagQueryMs, aiRagDocsMatched, aiGuardrailMs, aiSubagentDurationMs.
  • CountersaiMessagesSentTotal, aiStreamAbortsTotal, aiToolInvocationsTotal, aiToolApprovalsTotal, aiGuardrailBlocksTotal, aiRagEmptyResultsTotal, aiSubagentInvocationsTotal, aiSubagentTokensTotal.
  • UpDownCounters (gauges)aiActiveStreams, aiMcpPoolConnections.

The canonical name table is also exported as AI_METRIC_NAMES for snapshot-style tests and reverse lookup.

Label discipline (STRICT)

Metric attributes must stay LOW cardinality. This is enforced by convention today; a lint rule will follow.

Allowed label keys (bounded enums / stable values):

  • model, provider, tool, type, reason, result, subagent, phase, decision

FORBIDDEN as labels (high cardinality — cardinality explosion will blow up Prometheus / OTel backends):

  • agent_id, chat_id, user_id, org_id, message_id

High-cardinality identifiers belong in logs only — attach them via LogContext.set('agent_id', ...) / AppLoggerService.info(...) so they are available for log-based search without polluting metric cardinality.

Phase 5 note

Phase 5 of issue #174 wires the MeterProvider (currently bootstrapped in otel-sdk.ts) into every instrument declared here. Until that phase lands, all .record() / .add() calls are safe no-ops — the API layer does the right thing when no provider is registered.