Appearance
008 - TLS Strategy for Custom Domains
Status
Accepted (v1)
Date
2026-04-13
Context
The Agency Custom Domains feature (see feature doc) allows agency admins to connect a branded subdomain (e.g. booking.kfc.com) so that end-users never see a daramex.org URL.
Serving a custom domain over HTTPS requires TLS termination at the infrastructure layer. Three requirements drive the decision:
- Each custom domain must have its own valid TLS certificate. SNI routing requires a cert for the exact hostname; a wildcard cert for
*.daramex.orgdoes not coverbooking.kfc.com. - The application layer must remain infra-agnostic. The data model, DNS verification flow, and API contract must not change when the TLS strategy changes. This separation was an explicit design goal so that v2 can swap the TLS mechanism without touching domain logic.
- v1 scope is tight. The priority is shipping the feature with the lowest operational risk and without introducing new infrastructure components.
The DaraMex stack at the time of this decision:
- Dokploy as the deployment platform, which uses Traefik as the reverse proxy.
- Traefik has built-in ACME support via Let's Encrypt.
- All services are fronted by Traefik; custom domains added to a Dokploy service automatically trigger cert issuance.
- Redis and NestJS are already in production; no new infra is needed for the application-level custom domain feature.
Three TLS strategies were evaluated:
Options Evaluated
Option A — Dokploy per-domain Let's Encrypt (manual operator registration)
When an agency's custom domain reaches verified status, the platform operator manually adds the hostname to the Dokploy service's domain list via the Dokploy UI. Traefik then issues a Let's Encrypt certificate automatically for that hostname.
Pros:
- Zero new infrastructure — reuses the existing Dokploy + Traefik + Let's Encrypt stack.
- Cert renewal is automatic (Traefik handles it).
- No new dependencies.
- The application layer does not need to know anything about TLS provisioning.
Cons:
- Manual operational step per domain — does not scale beyond ~20 active concurrent domains.
- If operators delay registration, agencies see a TLS error window between
verifiedstatus and cert issuance. - Subject to Let's Encrypt rate limits (5 certs/week per registered domain).
Option B — Cloudflare proxy
Place Cloudflare in front of proxy.daramex.org. Agency admins CNAME their hostname to proxy.daramex.org, which resolves to Cloudflare. Cloudflare terminates TLS for any hostname pointed at it.
Pros:
- No operator TLS registration step — Cloudflare handles certs automatically for all proxied hostnames.
- Cloudflare "CNAME flattening" also simplifies apex domain support (see v2 items).
- DDoS protection and CDN as a side effect.
Cons:
- Adds a hard dependency on Cloudflare (vendor lock-in).
- Requires agencies to use Cloudflare at their DNS provider, or to configure a proxy chain (their DNS provider → Cloudflare CNAME → DaraMex). This complicates the DNS instructions shown in the dashboard.
- Costs money at scale (Cloudflare for SaaS / Workers for Platforms pricing).
- Not immediately available — requires Cloudflare account provisioning and configuration before the feature can ship.
- Over-engineered for a v1 where the number of active custom domains is expected to be small.
Option C — Caddy with on-demand TLS
Replace or extend the current Traefik reverse proxy with Caddy, which supports on-demand TLS: Caddy provisions a cert on the first HTTPS request for any unknown hostname, without pre-registration.
Pros:
- Fully automated — no operator step per domain.
- On-demand TLS is a well-understood pattern for custom-domain SaaS products.
Cons:
- Requires replacing or running alongside the existing Traefik setup — a significant operational change outside the scope of this feature.
- On-demand TLS has a latency hit on the first request while cert issuance happens (ACME challenge round-trip).
- Caddy configuration, logging, and operational tooling differ from Traefik; the team would need to learn and maintain two reverse proxies (or migrate fully).
- Risk is disproportionate to the v1 user volume.
Decision
v1 uses Option A: Dokploy per-domain Let's Encrypt with manual operator registration.
Rationale:
- Lowest time-to-ship: no infrastructure changes required.
- The existing Dokploy + Traefik + Let's Encrypt stack already handles cert issuance and renewal.
- The application layer is deliberately kept TLS-agnostic — the only operator-visible artifact is the Dokploy domain entry. The data model, DNS verification, caching, and API contract do not change between strategies.
- v1 projections estimate fewer than 20 simultaneously active custom domains, which is within the operational capacity of manual provisioning.
The operator runbook for this step is at: Provision TLS for a Custom Domain.
Consequences
Positive
- No new infrastructure components introduced.
- Cert renewal is handled automatically by Traefik's ACME client.
- Zero application-layer coupling to TLS — the feature ships without any infra prerequisite beyond confirming that Dokploy accepts arbitrary custom hostnames (not just
*.daramex.org). - The decision to use a stable
proxy.daramex.orgCNAME target (rather than a raw IP) means that if the server IP changes, only theproxy.daramex.orgA record needs updating — all agency CNAMEs continue to work.
Negative
- One manual operator step per verified domain — does not scale.
- If the number of active custom domains grows beyond ~20, operator burden becomes noticeable and errors (missed provisioning) affect agency end-users with TLS warnings.
- Let's Encrypt rate limits (5 certs/week per registered domain, and account-level limits) could become a constraint under high churn (agencies repeatedly removing and re-adding the same domain).
Revisit Conditions
Reopen this decision when any of the following is true:
- The platform has more than 10 simultaneously active custom domains and operator provisioning is causing delays or errors.
- An agency requests apex domain support (see pending v2 item in feature doc) — apex support and Cloudflare TLS are strongly correlated and should be tackled together.
- A missed provisioning incident causes an agency SLA breach.
Engram topic for the TLS v2 decision: sdd/agency-custom-domains/pending/tls-cloudflare-alternative.