Book 1 of 3Forthcoming · 202612 chapters · 3 parts

LLM Systems in Production.

Cloud-Native Patterns for AI Engineers

Book 1 of the Full-Stack AI Engineering Series. The infrastructure layer. Production is not deployment. It is the architecture of trust under load.

Forthcoming · 2026 · the infrastructure layer
§ 01Overview

AI reached production in regulated finance faster than the infrastructure built to govern it. A model call that moves money, extends credit, or clears a name is not a feature. It is a regulated event that happens to produce text.

There are two rooms in every AI program. In the first, the model answers and everyone agrees the future has arrived. In the second, a regulator asks what the model received, which version served it, where the data travelled, and what it cost. The first room is crowded. The second is nearly empty. This book lives in the second room.

Data residency, the audit trail, latency, and cost are not four governance problems. They are one decision, made once per request, that a regulated institution cannot afford to make in four hundred different places. NexusCore is the gateway that makes it once: the routing and observability layer between every application and the pool of models behind it, deciding which model may answer under which budget, and recording the decision as evidence.

§ 02The system · NexusCore

One gateway. Four planes.

Plane 1
Data plane
The request's path from application to model and back, across cloud, on-premise, and edge, inside its residency boundary.
Plane 2
Control plane
The Routing Brain and the Latency Controller, deciding which model answers and which decode strategy spends the GPU, under the request's budget.
Plane 3
Governance plane
Signed artifacts, provenance, and promotion control, so no routing policy reaches production unreviewed.
Plane 4
Evidence plane
The complete, immutable, queryable record of every decision. Evidence, not a log line.

The gateway owns five responsibilities and refuses three. It owns model selection, residency, cost attribution, the audit record, and the escalation boundary. It refuses prompt authorship, the user experience, and the purpose of the feature. The application owns the question. The gateway owns the institution's accountability in the answering.

§ 03The patterns

The patterns that make a gateway defensible.

The Routing Brain

01

Reads the request, selects the smallest model the request can trust, and records why. Routing is not optimization. It is the architecture of restraint.

Research anchor

RoutingPolicy

02

A versioned, signed, promotable policy that names the model pool and the boundaries the optimizer may never cross.

Policy as config

The three-tier model pool

03

Small, mid-tier, frontier. The frontier model is reserved, never the default that nothing in the architecture argues against.

Speculative decoding

04

Latency as a control surface, not a model trick. The SLO sets the budget; the budget selects the decode strategy.

Research anchor

The Router Governance Plane

05

Signed artifacts, provenance records, promotion control. Three lifecycles defended, one plane: the one that learns, the one that serves, the one that merely persists.

Research anchor

The evidence store

06

Complete, immutable, queryable. Compliance is a property the system already had before the auditor arrived.

§ 04The cost of no gateway

An outage built from correct components is the signature of a missing check.

Without a gateway
With NexusCore
Each team builds its own path to a model. Locally sensible, globally catastrophic.
One governed gateway every request passes through.
Residency, audit, latency, and cost decided in four hundred places.
One decision, made once per request.
The frontier model becomes the default. A structural overspend with no owner.
The smallest model the request can trust, chosen by policy.
Nine days to answer one question about an eight-month-old decision.
The who-routed-what report, produced on demand.
A fluent falsehood delivered with total confidence that no exception handler catches.
Uncertainty escalated to a human at the boundary, with its reason attached.
§ 05Inside the book

A reference you work from.

12
Chapters
3
Parts
3
Research
anchors
6
Appendices

Part I — The Gateway and Its Foundations

Part II — The Routing Brain and Decode Economics

Part III — Observability, Security, and Governance

§ 06Who it is for

For the engineer who already runs production.

Site reliability engineersCloud architectsPlatform engineersML platform leadsHeads of AI infrastructureCTOs & VPs of EngineeringSREs carrying AI trafficAudit & compliance partners

Written for readers who think in service level objectives, error budgets, and percentiles. The bar on AI is deliberately low: if you know that a model takes a prompt and produces text, non-deterministically, you have enough to begin. It is written for the engineer you used to manage, and the one you are now.

§ 07From the manuscript
A gateway is not plumbing. It is the place where an institution decides what it is allowed to think, and how much that thought may cost.
LLM Systems in Production
Chapter 1 · Draft manuscript
§ 08The series

One discipline, observed from three altitudes.

Three books, one fictional regulated fintech, Nebula Financial, and three systems that are not three products but three faces of one platform, each owning a layer of the stack.

§ 09About the author
Dr. N. Khan

Dr. N. Khan is an enterprise AI architect and governance advisor with twenty-five years building AI and machine-learning systems at scale. As Principal Architect at iSystematic, he designs the full stack of governed production AI: the LLM infrastructure that routes it, the agents that act on it, the platform that ships it, and the governance that keeps all three defensible.

His practice sits at an unusual intersection of supervisory regulation, quantitative model risk, and enterprise architecture (TOGAF, DMBOK, ISO 27001, SOC 2). He holds a PhD spanning neuro-marketing and computer science, and he is the author of the AI governance Enterprise Playbook.

Dr. Khan writes as a practitioner. His frameworks are built to be used, contested, and adapted, not merely read. He is based in Winnipeg, Canada, with active advisory engagements across MENA. More at nabeelkhan.com.

§ 10Be first to read it

When it ships, you will know first.

The book is in draft. Leave a name and a working email, and you get one note when Book 1 publishes. No list, no noise, no second message.

Early notice · one email

Notify me when Book 1 is out

For LLM Systems in Production. One note on publication, nothing else.

One message, on publication. Unsubscribe is a reply.

You are on the list.

One note when LLM Systems in Production publishes. Until then, the Enterprise Playbook is out now.

Speed without governance is debt. Governance is the architecture that lets speed compound.

Fin · Book I of III