Appearance
Observability
Shared primitives and conventions for structured logs, metrics, and traces in apps/api.
Key building blocks already in place:
AppLoggerService— the ONLY way to emit structured logs. NestJSLoggerandconsole.logare forbidden insideapps/api/src(see.oxlintrc.json).LogContext—AsyncLocalStorage-backed per-request context. Services append attributes viaLogContext.set(...)/LogContext.append(...); the final log record is emitted at the request boundary.otel-sdk.ts— boots the OpenTelemetry Node SDK (traces + logs + metrics) via OTLP/HTTP.
AI Observability — Phase 0 primitives
Part of the end-to-end AI observability work tracked in GitHub issue #174. Phase 0 ships two shared primitives used by every subsequent phase.
timedSpan helper
Location: apps/api/src/shared/infrastructure/observability/timed-span.ts.
Wraps an async function with a stopwatch-style span. On completion it appends a record to LogContext under the spans key and (optionally) records the duration on an OpenTelemetry histogram.
Signature:
ts
export interface ITimedSpanOptions {
histogram?: import('@opentelemetry/api').Histogram;
attributes?: Record<string, string | number | boolean>;
}
export async function timedSpan<T>(
name: string,
fn: () => Promise<T>,
opts?: ITimedSpanOptions,
): Promise<T>;Behavior:
- On success: appends
{ name, duration_ms, ok: true }toLogContext.spans, returns the value. - On failure: appends
{ name, duration_ms, ok: false, error_code }and rethrows the original error. error_codeis extracted heuristically —err.code→err.name→undefined. The full error object is intentionally NOT logged (privacy / size).- If
opts.histogramis given, the duration is recorded withopts.attributes(or{}when omitted). Recording happens on both success and failure so the distribution is not biased.
Example:
ts
import { timedSpan } from '@api/shared/infrastructure/observability/timed-span';
import { aiToolDurationMs } from '@api/shared/infrastructure/observability/ai-metrics';
await timedSpan(
'ai.tool.search_docs',
() => runTool(input),
{
histogram: aiToolDurationMs,
attributes: { tool: 'search_docs', result: 'ok' },
},
);When to use it:
- Any async operation you want both (a) recorded as part of the canonical request log (via
LogContext.spans) AND (b) aggregated as a latency distribution. - Inside feature modules where a dedicated distributed-tracing span would be overkill but you still want per-operation visibility.
When NOT to use it:
- For full distributed tracing spans — use
@opentelemetry/apitrace.getTracer(...).startActiveSpan(...)directly. - For plain error handling without timing — don't dress a try/catch up as a span.
ai-metrics.ts — central AI instruments
Location: apps/api/src/shared/infrastructure/observability/ai-metrics.ts.
Declares every OpenTelemetry instrument used across the AI module in a single file so names stay typo-proof and greppable. Instruments are created eagerly from metrics.getMeter('ai'); if no MeterProvider is registered yet, the API falls back to a no-op meter, so importing this module is always safe.
Exported groups (see the file for descriptions and units):
- Histograms —
aiStreamTtftMs,aiStreamTotalMs,aiStreamPreMs,aiToolDurationMs,aiToolApprovalLatencyMs,aiRagQueryMs,aiRagDocsMatched,aiGuardrailMs,aiSubagentDurationMs. - Counters —
aiMessagesSentTotal,aiStreamAbortsTotal,aiToolInvocationsTotal,aiToolApprovalsTotal,aiGuardrailBlocksTotal,aiRagEmptyResultsTotal,aiSubagentInvocationsTotal,aiSubagentTokensTotal. - UpDownCounters (gauges) —
aiActiveStreams,aiMcpPoolConnections.
The canonical name table is also exported as AI_METRIC_NAMES for snapshot-style tests and reverse lookup.
Label discipline (STRICT)
Metric attributes must stay LOW cardinality. This is enforced by convention today; a lint rule will follow.
Allowed label keys (bounded enums / stable values):
model,provider,tool,type,reason,result,subagent,phase,decision
FORBIDDEN as labels (high cardinality — cardinality explosion will blow up Prometheus / OTel backends):
agent_id,chat_id,user_id,org_id,message_id
High-cardinality identifiers belong in logs only — attach them via LogContext.set('agent_id', ...) / AppLoggerService.info(...) so they are available for log-based search without polluting metric cardinality.
Phase 5 note
Phase 5 of issue #174 wires the MeterProvider (currently bootstrapped in otel-sdk.ts) into every instrument declared here. Until that phase lands, all .record() / .add() calls are safe no-ops — the API layer does the right thing when no provider is registered.