Book 1 of 3Coming September 202612 chapters · 3 parts

LLM Systems in Production.

Cloud-Native Patterns for AI Engineers

Book 1 of the Full-Stack AI Engineering Series. The infrastructure layer. Production is not deployment. It is the architecture of trust under load.

By Dr. N. Khan · The Full-Stack AI Engineering Series
Set inside Nebula Financial, a fictional regulated fintech · System: NexusCore

Tell me when it ships → Read the published playbook

Coming September 2026 · the infrastructure layer

§ 01Overview

AI reached production in regulated finance faster than the infrastructure built to govern it. A model call that moves money, extends credit, or clears a name is not a feature. It is a regulated event that happens to produce text.

There are two rooms in every AI program. In the first, the model answers and everyone agrees the future has arrived. In the second, a regulator asks what the model received, which version served it, where the data travelled, and what it cost. The first room is crowded. The second is nearly empty. This book lives in the second room.

Data residency, the audit trail, latency, and cost are not four governance problems. They are one decision, made once per request, that a regulated institution cannot afford to make in four hundred different places. NexusCore is the gateway that makes it once: the routing and observability layer between every application and the pool of models behind it, deciding which model may answer under which budget, and recording the decision as evidence.

§ 02The system · NexusCore

One gateway. Four planes.

Plane 1

Data plane

The request's path from application to model and back, across cloud, on-premise, and edge, inside its residency boundary.

Plane 2

Control plane

The Routing Brain and the Latency Controller, deciding which model answers and which decode strategy spends the GPU, under the request's budget.

Plane 3

Governance plane

Signed artifacts, provenance, and promotion control, so no routing policy reaches production unreviewed.

Plane 4

Evidence plane

The complete, immutable, queryable record of every decision. Evidence, not a log line.

The gateway owns five responsibilities and refuses three. It owns model selection, residency, cost attribution, the audit record, and the escalation boundary. It refuses prompt authorship, the user experience, and the purpose of the feature. The application owns the question. The gateway owns the institution's accountability in the answering.

§ 03The patterns

The patterns that make a gateway defensible.

The Routing Brain

Reads the request, selects the smallest model the request can trust, and records why. Routing is not optimization. It is the architecture of restraint.

Research anchor

RoutingPolicy

A versioned, signed, promotable policy that names the model pool and the boundaries the optimizer may never cross.

Policy as config

The three-tier model pool

Small, mid-tier, frontier. The frontier model is reserved, never the default that nothing in the architecture argues against.

Speculative decoding

Latency as a control surface, not a model trick. The SLO sets the budget; the budget selects the decode strategy.

Research anchor

The Router Governance Plane

Signed artifacts, provenance records, promotion control. Three lifecycles defended, one plane: the one that learns, the one that serves, the one that merely persists.

Research anchor

The evidence store

Complete, immutable, queryable. Compliance is a property the system already had before the auditor arrived.

§ 04The cost of no gateway

An outage built from correct components is the signature of a missing check.

Without a gateway

With NexusCore

Each team builds its own path to a model. Locally sensible, globally catastrophic.

One governed gateway every request passes through.

Residency, audit, latency, and cost decided in four hundred places.

One decision, made once per request.

The frontier model becomes the default. A structural overspend with no owner.

The smallest model the request can trust, chosen by policy.

Nine days to answer one question about an eight-month-old decision.

The who-routed-what report, produced on demand.

A fluent falsehood delivered with total confidence that no exception handler catches.

Uncertainty escalated to a human at the boundary, with its reason attached.

§ 05Inside the book

A reference you work from.

Chapters

Parts

Research
anchors

Appendices

Part I — The Gateway and Its Foundations

Why Fintech Needs an LLM Gateway. The regulated-environment problem and the cost of the shadow integrations a gateway replaces.
Foundations of LLM Inference and Serving. Prefill and decode, continuous batching, and the cost-latency-quality triangle.
Multi-Cloud and Hybrid Topology for LLMs. Control plane over cloud, on-premise, and edge, with residency boundaries that hold.

Part II — The Routing Brain and Decode Economics

Designing the Routing Brain. From static rules to a learned, versioned routing policy trained on the logs.
Edge and On-Device Routing. Confident, or seek stronger: uncertainty-based escalation with monetary and PII hooks.
Speculative Decoding as an SRE Primitive. Draft-model, self-speculation, and suffix strategies under a latency budget.
Distributed and Pipelined Inference in Practice. Multi-GPU decoding and partitioning, throughput against tail latency.

Part III — Observability, Security, and Governance

Observability for LLM Routers and Models. The four fields every call must carry, and the routing-event log schema.
Router Lifecycle Security and Governance. The router as an attack surface, defended by the Router Governance Plane.
Compliance, Audit Trails, and AI Conformance. The evidence store, governed whitelists, and controls mapped to TOGAF and DMBOK.
Case Study: A Trading-Desk Outage and a Router Rollback. A bad version, eleven green minutes, and the conformance test that would have caught it.
The NexusCore Blueprint. The full reference architecture, and the handoff into Book 2.

§ 06Who it is for

For the engineer who already runs production.

Site reliability engineersCloud architectsPlatform engineersML platform leadsHeads of AI infrastructureCTOs & VPs of EngineeringSREs carrying AI trafficAudit & compliance partners

Written for readers who think in service level objectives, error budgets, and percentiles. The bar on AI is deliberately low: if you know that a model takes a prompt and produces text, non-deterministically, you have enough to begin. It is written for the engineer you used to manage, and the one you are now.

§ 07From the manuscript

A gateway is not plumbing. It is the place where an institution decides what it is allowed to think, and how much that thought may cost.

LLM Systems in Production
Chapter 1 · Draft manuscript

§ 08The series

One discipline, observed from three altitudes.

Three books, one fictional regulated fintech, Nebula Financial, and three systems that are not three products but three faces of one platform, each owning a layer of the stack.

Book 1 · NexusCore LLM Systems in Production Infrastructure The routing and observability gateway that decides which model may answer, under which latency, cost, and risk budget, and records the decision as evidence. You are here

Book 2 · AgentMesh Prompt Systems & Agent Orchestration Application The catalog of record and orchestration engine that turns a pile of prompts into governed agents that plan, act, verify, and submit to human review. View Book 2 → Book 3 · ThinkFlow DevOps for AI-Native Platforms Operations The AI-augmented internal developer platform that builds, ships, and governs the models, agents, and code the other two layers depend on. View Book 3 →

Reading the series? The free Cross-Book Navigation Guide and Series Cheat Sheet map the thread across all three books. Read on the page or download the print-ready PDFs.

§ 09About the author

Dr. N. Khan

Dr. N. Khan is an enterprise AI architect and governance advisor with twenty-five years building AI and machine-learning systems at scale. As Principal Architect at iSystematic, he designs the full stack of governed production AI: the LLM infrastructure that routes it, the agents that act on it, the platform that ships it, and the governance that keeps all three defensible.

His practice sits at an unusual intersection of supervisory regulation, quantitative model risk, and enterprise architecture (TOGAF, DMBOK, ISO 27001, SOC 2). He holds a PhD spanning neuro-marketing and computer science, and he is the author of the AI governance Enterprise Playbook.

Dr. Khan writes as a practitioner. His frameworks are built to be used, contested, and adapted, not merely read. He works across North America and the GCC, from stations in Toronto, Calgary, and Winnipeg, Canada, with active advisory engagements in the UAE, Qatar, Kuwait, and Saudi Arabia. More at nabeelkhan.com.

§ 10Be first to read it

When it ships, you will know first.

The book is in draft. Leave a name and a working email, and you get one note when Book 1 publishes. No list, no noise, no second message.

You are on the list.

One note when LLM Systems in Production publishes. Until then, the Enterprise Playbook is out now.

Speed without governance is debt. Governance is the architecture that lets speed compound.

Tell me when it ships → Next: Book 2 → Work with the author →

Fin · Book I of III

One gateway. Four planes.

The patterns that make a gateway defensible.

The Routing Brain

RoutingPolicy

The three-tier model pool

Speculative decoding

The Router Governance Plane

The evidence store

An outage built from correct components is the signature of a missing check.

A reference you work from.

Part I — The Gateway and Its Foundations

Part II — The Routing Brain and Decode Economics

Part III — Observability, Security, and Governance

For the engineer who already runs production.

One discipline, observed from three altitudes.

When it ships, you will know first.

Notify me when Book 1 is out

You are on the list.