Building with LLMs — Stanford AI Engineering Study Guide

01

LLM Limitations — What You're Working Around

Understanding the constraints shapes every design decision. All techniques in this guide exist to solve one or more of these four problems.

The Four Root Limitations

Domain Knowledge Gaps

Base models lack proprietary data, recent events, and internal docs.

Context Window Limits

Can't hold arbitrarily long history — forces architectural choices.

Hallucinations

Generates plausible-sounding but incorrect outputs with confidence.

Difficulty of Control

Hard to get consistent, structured outputs reliably across inputs.

Which Limitation Drives Which Technique

graph TD A[Domain Knowledge Gap] -->|solve with| B[RAG] C[Context Window Limit] -->|solve with| D[Chunking and Memory] E[Hallucinations] -->|reduce with| F[Grounding + Evals] G[Control Difficulty] -->|solve with| H[Prompt Engineering] classDef lim fill:#1c2333,stroke:var(--accent),color:#e6edf3 classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3 class A,C,E,G lim class B,D,F,H sol

These four problems are the root motivation for prompt engineering, RAG, fine-tuning, and agentic workflows. Every technique in this guide is solving one or more of them.

Need to identify your AI bottlenecks? Our AI Readiness Audit maps your gaps and prioritizes fixes by ROI.Get a Free Audit →

⚡

Free Tool

Claude Workspace Optimizer

Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.

→

02

Prompt Engineering

Zero-Shot vs Few-Shot

Zero-shot: give the LLM a task description only, no examples. Always start here.
Few-shot: provide 2–5 input/output examples before the actual task.
Few-shot helps when the model needs to understand output format, tone, or domain-specific patterns.
Rule of thumb: try zero-shot first; add examples only when output quality is inconsistent.

Chain-of-Thought (CoT)

Instruct the model to "think step by step" before giving a final answer.
Dramatically improves performance on reasoning, math, and multi-step logic.
Works because intermediate tokens give the model space to work through the problem.
Can be explicit ("let's think step by step") or via few-shot examples that show reasoning.

Prompt Chaining

Break a complex task into sequential prompts, each with a focused sub-task.
Output of step N becomes input to step N+1.
Benefits: easier to debug, can inject validation between steps, shorter prompts perform better.
Use cases: research → draft → edit pipelines; extract → classify → respond flows.

Prompt Templates

Parameterize prompts so the structure is stable — only the inputs change.
Makes prompts testable, versionable, and reusable across similar tasks.
Store templates separately from business logic.

LLM Judges / Evals with Promptfoo

Use an LLM to evaluate whether another LLM's output meets quality criteria. Define rubrics: what does "good" look like for this task? Useful for subjective quality (tone, helpfulness, completeness) where rule-based checks fail.

Promptfoo: open-source tool for running LLM evals at scale.
Define test cases with expected behaviors; run across model versions to detect regressions.
Generates quality reports automatically.

Which Prompt Technique to Use — Decision Tree

graph TD START([New Task]) --> Q1{Output quality OK?} Q1 -->|Yes| SHIP([Ship Zero-Shot]) Q1 -->|No| Q2{Format or style issue?} Q2 -->|Yes| FS[Add Few-Shot Examples] Q2 -->|No| Q3{Multi-step reasoning?} Q3 -->|Yes| COT[Chain-of-Thought] Q3 -->|No| Q4{Complex pipeline?} Q4 -->|Yes| CHAIN[Prompt Chaining] Q4 -->|No| FS classDef dec fill:#1c2333,stroke:#d4a843,color:#e6edf3 classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef term fill:#1a110a,stroke:var(--accent),color:#e6edf3 class Q1,Q2,Q3,Q4 dec class FS,COT,CHAIN sol class START,SHIP term

Chain-of-Thought — Live Example

Prompt

"A store sells items at $12, $8, $15. Customer pays $50. Think step by step, then give the change."

Model thinking

Step 1: Total cost = $12 + $8 + $15 = $35

Step 2: Customer paid $50

Step 3: Change = $50 − $35 = $15

Answer

The change is $15.

Without CoT the model often outputs an arithmetic error. The intermediate tokens give it space to reason correctly. Works because the intermediate tokens themselves are meaningful computation.

Deep DivePrompt Engineering GuideChain-of-thought, few-shot patterns, structured output, system prompt design — with code examples.

→

Need production-grade prompts for your workflows? We build prompt systems that work reliably at scale - not just in the demo.See What We Build →

⚡

Free Tool

Claude Workspace Optimizer

Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.

→

03

Fine-Tuning — When (Not) to Use It

Core recommendation: avoid fine-tuning. The lecturer strongly recommends against it for most teams. Prompt engineering and RAG cover the vast majority of cases.

Why to Avoid It

Overfitting risk: fine-tuning on small datasets leads to narrow behavior that fails on edge cases.
Stale quickly: a fine-tuned model becomes outdated as base models improve; you're constantly re-tuning.
Data collection is hard: requires high-quality labeled examples that are expensive to produce.
Prompt engineering often wins: a well-crafted prompt frequently achieves what fine-tuning would.

The Slack Fine-Tuning Failure (Case Study)

Slack attempted to fine-tune a model for internal use. The fine-tuned model regressed on general capabilities, became brittle on edge cases, and required ongoing maintenance as base models updated. The effort wasn't worth the outcome.

When Fine-Tuning Is Justified

Very specific output format that prompts can't reliably produce.
Latency/cost requirements that need a smaller model tuned to reach quality bar.
You have thousands of high-quality labeled examples.
Task is stable — won't change as base models improve.

Should I Fine-Tune? — Decision Tree

graph TD A([Considering Fine-Tuning]) --> B[Try Prompt Engineering First] B --> C{Works?} C -->|Yes| D([Ship It]) C -->|No| E[Try RAG] E --> F{Works?} F -->|Yes| D F -->|No| G{1000+ quality examples?} G -->|No| H([Collect data first]) G -->|Yes| I{Task stable?} I -->|No| J([Wait for better base model]) I -->|Yes| K([Fine-tune]) classDef dec fill:#1c2333,stroke:#d4a843,color:#e6edf3 classDef ship fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef warn fill:#1a1208,stroke:#d4a843,color:#e6edf3 classDef start fill:#1a110a,stroke:var(--accent),color:#e6edf3 class C,F,G,I dec class D,K ship class H,J warn class A start

RAG vs Fine-Tuning — Head to Head

RAGFine-Tuning

Fresh Data

Update index

Retrain

Debuggability

Inspect docs

Opaque

Cost

Inference + retrieve

Train + infer

Coverage

Scales freely

Fixed to training

Deep DiveRAG vs Fine-Tuning vs Prompt EngineeringDecision framework for choosing the right technique — cost, latency, accuracy tradeoffs.

→

Considering fine-tuning? We will tell you honestly if it is worth it - and show you what RAG or prompting can achieve first.Talk to Us →

⚡

Free Tool

Claude Workspace Optimizer

Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.

→

04

Retrieval-Augmented Generation (RAG)

RAG solves the domain knowledge problem: instead of baking proprietary knowledge into the model, you retrieve it at query time.

Vector Databases

Store documents as embeddings — dense numerical representations of meaning.
Semantic search: find documents similar in meaning, not just keyword matches.
Common options: Pinecone, Weaviate, pgvector (Postgres), Chroma.
Embedding model choice matters — different models have different tradeoffs for your domain.

Chunking Strategy

Split source documents into chunks before embedding.
Chunk size is critical: too small loses context; too large dilutes relevance signal.
Common approaches: fixed-size with overlap, sentence-boundary aware, paragraph-level.
Experiment for your specific domain — no universal right answer.

HyDE — Hypothetical Document Embeddings

Problem: query embeddings and document embeddings live in different semantic spaces. A user asking "why is the sky blue?" doesn't match well against a physics textbook paragraph.

Ask the LLM to generate a hypothetical answer to the query.
Embed that hypothetical answer.
Use that embedding for retrieval instead of the original query.

The hypothetical answer matches the document's "answer language" better than the question. Measurably improves retrieval quality on knowledge-intensive tasks.

RAG Pipeline — Indexing + Query Phases

graph TD subgraph INDEX["Indexing Phase (offline)"] D[Source Documents] --> CH[Chunk Text] CH --> EMB[Embed Chunks] EMB --> VDB[(Vector Database)] end subgraph QUERY["Query Phase (real-time)"] Q[User Query] --> QE[Embed Query] QE --> SR[Semantic Search] SR --> VDB VDB --> TOP[Top-K Docs Retrieved] TOP --> CTX[Inject into Context] CTX --> LLM[LLM] LLM --> ANS[Grounded Answer] end classDef idx fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef qry fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef db fill:#1c2333,stroke:#d4a843,color:#e6edf3 class D,CH,EMB idx class Q,QE,SR,TOP,CTX,LLM,ANS qry class VDB db

HyDE — Bridging the Semantic Space Gap

Query
Space

Answer
Space

query
embed

HyDE bridges
the gap

better
match Standard query → mismatch · HyDE answer embedding → retrieves better documents

RAG vs Fine-Tuning Comparison

Concern	RAG	Fine-Tuning
Fresh data	Easy — update the index	Hard — retrain required
Debugging	Inspect retrieved docs	Opaque
Cost	Inference + retrieval	Training + inference
Coverage	Scales to large knowledge bases	Fixed to training distribution

Deep DiveRetrieval-Augmented Generation Deep DiveChunking strategies, embedding models, HyDE, hybrid search, and reranking for production pipelines.

→

Want RAG for your knowledge base? We design and deploy retrieval pipelines - chunking, embeddings, and HyDE - tailored to your data.Explore Content Systems →

⚡

Free Tool

Claude Workspace Optimizer

Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.

→

05

Agentic Workflows

Andrew Ng's definition: An agentic workflow is one where an LLM iterates on its output — it can plan, act, observe results, and revise. The LLM isn't just doing one forward pass; it's deciding what to do next based on what happened.

Memory Types

Working memory (short-term): the current context window. What the agent knows "right now." Limited by context window size.
Archival memory (long-term): external storage (databases, vector stores, files) the agent can read/write. Enables persistence across sessions. This is what lets an agent "remember" past interactions.

Tools

Tools transform the LLM from a question-answerer into an actor.

Web search: access current information.
Code execution: run Python, test logic.
Database read/write: query and update persistent state.
API calls: interact with external services (calendar, email, CRM).
File system: read documents, save outputs.

MCP — Model Context Protocol

Standardized protocol for agents to call tools and other services.
Treats tools as typed interfaces: here's what this tool accepts, here's what it returns.
Enables tool reuse across different agents and systems.
In multi-agent systems: agents communicate with each other via MCP — an agent is just another tool.
Anthropic-backed standard gaining adoption across the ecosystem.

Autonomy Spectrum

Level	Description
Low autonomy	Human approves every action
Medium	Agent acts, human reviews outputs
High autonomy	Agent runs end-to-end, humans get alerts on exceptions

Start low autonomy, earn trust, expand. Don't give full autonomy to a system you haven't validated.

The Agent Loop

graph TD GOAL([User Goal]) --> PLAN[Plan Next Action] PLAN --> TOOL[Select Tool] TOOL --> ACT[Execute Action] ACT --> OBS[Observe Result] OBS --> EVAL{Goal Achieved?} EVAL -->|No - revise| PLAN EVAL -->|Yes| RESP([Return to User]) MEM1[(Working Memory)] -.->|read/write| PLAN MEM2[(Archival Memory)] -.->|read/write| ACT classDef loop fill:#1c2333,stroke:var(--accent),color:#e6edf3 classDef mem fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef term fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef dec fill:#1a1208,stroke:#d4a843,color:#e6edf3 class PLAN,TOOL,ACT,OBS loop class MEM1,MEM2 mem class GOAL,RESP term class EVAL dec

Working Memory vs Archival Memory

Working Memory (short-term)

System: You are a helpful assistant...

User: Process order #4821

Agent: Looking up order #4821...

Context window usage

72% full — lost on session end

Archival Memory (long-term)

user_prefs: {theme: dark, lang: en}

last_order: {id: 4820, status: shipped}

session_2026-04-10: resolved issue

session_2026-04-09: asked about returns

→ persists across all sessions

Deep DiveAgentic AI Workflows Deep DiveMemory types, tool integration, MCP architecture, the autonomy spectrum, and production reliability patterns.

→

Ready to build agentic workflows? We architect and deploy AI agents with the right tool integrations, memory design, and autonomy level.See Workflow Automation →

⚡

Free Tool

Claude Workspace Optimizer

Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.

→

06

Evals — Customer Support Agent Case Study

The lecture walked through building evals for a customer support AI agent handling email inquiries: order tracking, address changes, refund requests.

Step 1: Task Decomposition

Break the agent's workflow into discrete, testable components. Each step is a potential failure point — evals should exist at each one.

Read and parse the customer email.
Identify the intent (what is the customer asking?).
Extract entities (order ID, address, email address).
Look up relevant information (order status, account data).
Draft a response.
Update database if needed.
Send the email.

LLM Judges with Rubrics

Write a rubric: "A good response is polite, addresses all questions, provides the order status, and includes a next step."
Ask an LLM to grade the output 1–5 on each rubric criterion.
Use multiple LLM judges and average for reliability.
Works better than rule-based checks for open-ended text.

Objective vs Subjective Evals

Objective (code/rules): did the agent extract the correct order ID? Did it look up the right account? Was the database updated correctly? These are pure Python assertions.
Subjective (LLM judges): is the response tone appropriate? Did the email address all the customer's questions? Is the response concise without being curt?

Component-Based vs End-to-End

Component-based: test each step in isolation. Faster, easier to pinpoint failures. Use during development. "The intent classifier is wrong 30% of the time on refund requests."
End-to-end: test the full workflow with real inputs. Catches emergent failures from step interactions. Slower, more expensive. Use before deploys.
Use both. Component evals for development; end-to-end for regression testing.

LLM Traces Are Non-Negotiable

If you're interviewing with an AI startup, ask: "Do you have LLM traces?" Without traces, debugging a multi-step agent is nearly impossible.

Good trace tooling: LangSmith, Braintrust, Helicone, Arize.

Full chain of every prompt sent.
Every LLM response at each step.
Tool call inputs/outputs.
Timing data per component.

Customer Support Agent — Steps + Eval Checkpoints

graph TD A[Customer Email] --> B[Parse Email] B --> C[Identify Intent] C --> D[Extract Entities] D --> E[DB Lookup] E --> F[Draft Response] F --> G[Update Database] G --> H[Send Email] B -.->|Objective eval| E1[Parsing accuracy] C -.->|Objective eval| E2[Intent accuracy] D -.->|Objective eval| E3[Entity extraction] E -.->|Objective eval| E4[Lookup correctness] F -.->|LLM judge| E5[Tone + completeness] G -.->|Objective eval| E6[DB state correct?] classDef step fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef ev fill:#0d2520,stroke:#0d9488,color:#aaa class A,B,C,D,E,F,G,H step class E1,E2,E3,E4,E5,E6 ev

Eval Taxonomy — 2×2 Matrix

Component-based End-to-end Objective Subjective

Objective + Component

Unit tests — "did intent classifier return REFUND?" Python assertions. Fastest to run.

Subjective + Component

LLM judge on draft tone. Rubric score 1-5 per criterion.

Objective + End-to-End

Full workflow test — was DB updated correctly? Did customer get right info?

Subjective + End-to-End

Human raters score full conversation. Most expensive, catches the most edge cases.

Deep DiveLLM Evaluation (Evals) Deep DiveDeterministic evals, LLM-as-judge, traces, benchmark design, and the eval-driven development loop.

→

No eval framework yet? We instrument AI systems with traces, component evals, and LLM judges so you can deploy with confidence.Learn More →

⚡

Free Tool

Claude Workspace Optimizer

Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.

→

07

Multi-Agent Systems

Why Multi-Agent?

Parallelism: run independent tasks simultaneously — climate, security, and energy agents all run at once.
Specialization: each agent is optimized for a narrow domain — easier to tune, test, and debug.
Reusability: a design agent built for the marketing team can also serve the product team.

Interaction Patterns

Pattern	How it works	When to use
Hierarchical	Orchestrator delegates to sub-agents; user only talks to orchestrator	Recommended default — cleaner UX, single point of control
Flat (peer-to-peer)	Agents communicate directly with each other	When two agents are tightly coupled and share state constantly
Hybrid	Hierarchical backbone with peer connections for coupled agents	Complex systems with both independent and dependent agents

Smart Home Case Study (In-Class Exercise)

Biometric/location agent: tracks where residents are in the home.
Climate agent: controls temperature, access to room sensors and thermostats.
Energy agent: monitors usage, can cut power to non-essential systems.
Security agent: manages entry, identifies who's entering, sets permissions per person (parent vs child access rules).
Fridge/grocery agent: monitors contents via camera, has access to grocery delivery API.
Weather agent: pulls external APIs, adjusts blinds/temperature based on conditions.
Orchestrator: single user-facing interface, delegates to all other agents.

Key insight: the orchestrator communicates with sub-agents via MCP — each sub-agent is just another typed tool interface. An agent is a tool.

Smart Home Multi-Agent Hierarchy

graph TD USER([User]) --> ORC[Orchestrator Agent] ORC --> CL[Climate Agent] ORC --> SEC[Security Agent] ORC --> EN[Energy Agent] ORC --> GR[Grocery Agent] ORC --> WX[Weather Agent] EN <-->|share state| CL CL -.->|MCP| THERM[(Thermostats)] SEC -.->|MCP| CAM[(Cameras)] GR -.->|MCP| SHOP[(Grocery API)] WX -.->|MCP| API[(Weather API)] classDef orch fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef agent fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef tool fill:#1c2333,stroke:#d4a843,color:#aaa classDef user fill:#0d1520,stroke:#5b9bd5,color:#e6edf3 class ORC orch class CL,SEC,EN,GR,WX agent class THERM,CAM,SHOP,API tool class USER user

Flat vs Hierarchical — Communication Patterns

Flat — All Peers Communicate

Climate

Security

Energy

Grocery

Weather

Lights

All ↔ all. Flexible but N² connections. Hard to debug.

Hierarchical — Recommended

Orchestrator

Climate

Security

Energy

User talks to orchestrator only. Easier to audit and extend.

Deep DiveMulti-Agent AI Systems Deep DiveOrchestrator-subagent patterns, shared state, coordination without deadlock, and production deployment.

→

Designing a multi-agent system? We plan the hierarchy, define tool interfaces, and build the orchestration layer end to end.Start a Conversation →

⚡

Free Tool

Claude Workspace Optimizer

Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.

→

08

What's Next in AI — Trends to Watch

Are We Plateauing?

Ilya Sutskever (OpenAI co-founder) raised this question publicly. Recent GPT releases showed less capability jump than previous generations; marginal returns on raw compute may be diminishing. Counter-argument: scaling laws still hold if compute and energy continue to scale. The question is unresolved.

Architecture Search as the Unlock

Most current LLMs are transformer-based. The transformer was a breakthrough — but not the end state. The human brain doesn't use transformers or backpropagation, yet learns faster, from less data, with far lower energy. This gap represents massive opportunity in architecture search. Labs are hiring thousands of engineers to run this search. The next "transformer discovery" — a new architecture that reduces compute by 10x — could restart exponential improvement.

Multi-Modality as a Source of Gain

Adding image understanding made text models better at text.
Adding audio will make image+text models better at everything.
Adding video will push further gains across all modalities.
Modalities reinforce each other: knowing what a cat looks like makes you better at writing about cats.
The endpoint: robotics, where all modalities fuse into physical action.

Multiple Learning Methods in Harmony

Current AI uses individual methods in isolation. Human learning uses all simultaneously:

Human learning	AI equivalent
Survival instincts (encoded in DNA)	Pre-training
Parent pointing at objects (good/bad)	Supervised learning
Falling down and getting hurt	Reinforcement learning reward signal
Observing other people/babies	Unsupervised learning

The Velocity Problem: Skills Have Short Half-Lives

Don't optimize for knowing the current best RAG technique. Optimize for understanding the problem space deeply enough to quickly evaluate and adopt new techniques as they emerge. The course teaches breadth — pull depth when you need it.

Key Trends Shaping AI Progress

Now

Scaling Plateau Question

Are marginal returns on compute diminishing? GPT updates showing smaller capability jumps. Unresolved.

Near Term

Architecture Search

Labs hiring thousands to find the next transformer. Brain doesn't use backprop — huge unexplored architecture space.

Medium Term

Multi-Modality Gains

Each added modality improves all others. Cross-modal reinforcement compounds. Endpoint: robotics.

Ongoing

Methods in Harmony

Human learning blends supervised, RL, unsupervised, and meta-learning simultaneously. AI will converge here.

Multi-Modality — Each Addition Improves Everything

Text

Baseline LLM capability

Text + Images

Image understanding improves text reasoning

+ Audio

Temporal understanding unlocked

+ Video

Spatial + temporal reasoning

→ Robotics

All modalities fuse into physical action

Want to stay ahead? Our team tracks AI architecture developments and translates them into implementation strategies for your business.Follow Our Work →

★

Quick Reference: Decision Framework

Decision Flowchart — Start Here

graph TD P{What is the problem?} --> A[Model lacks domain knowledge] P --> B[Output format inconsistent] P --> C[Multi-step reasoning fails] P --> D[Need real-world actions] P --> E[Tasks are parallel] P --> F[Considering fine-tuning] A --> A1([Use RAG]) B --> B1([Few-shot prompting]) C --> C1([Chain-of-Thought]) D --> D1([Agentic workflow]) E --> E1([Multi-agent system]) F --> F1([Try prompting + RAG first]) classDef prob fill:#1c2333,stroke:#d4a843,color:#e6edf3 classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef warn fill:#1a1208,stroke:var(--accent),color:#e6edf3 class P,A,B,C,D,E,F prob class A1,B1,C1,D1,E1 sol class F1 warn

Full Decision Table

Problem	First Approach
Model doesn't know your domain data	RAG
Need consistent output format	Prompt engineering + few-shot
Multi-step reasoning failing	Chain-of-Thought
Complex task with multiple sub-tasks	Prompt chaining
Need to take actions in the world	Agentic workflow with tools
Tasks that can run in parallel	Multi-agent system
Evaluating subjective output quality	LLM judges with rubrics
Evaluating objective correctness	Python assertion-based evals
System behaving unexpectedly	LLM traces + component debugging
Considering fine-tuning	Don't — try prompting + RAG first

↗

Go Deeper — Topic Guides

Each topic in this guide has a standalone deep dive with extended examples, implementation patterns, and decision frameworks. New guides publish as the space develops.

Available Now

Section 05 · Agentic AI

Agentic AI Workflows →Memory types, tool integration, MCP architecture, the autonomy spectrum, and reliability patterns for production agents.

More Deep Dives

Section 02 · Prompt Engineering

Prompt Engineering Guide →Chain-of-thought, few-shot patterns, structured output, and production-grade prompt systems.

Section 03 · Comparison

RAG vs Fine-Tuning vs Prompting →Decision framework for choosing the right technique for your use case and budget.

Section 04 · RAG

Retrieval-Augmented Generation →Chunking, embeddings, HyDE, and hybrid retrieval for production knowledge pipelines.

Section 06 · Evals

LLM Evaluation (Evals) →Traces, LLM judges, benchmark design, and the eval-driven development loop.

Section 07 · Multi-Agent

Multi-Agent AI Systems →Orchestrator-subagent patterns, shared state, and coordination without deadlock.

✓

Key Takeaways

01Start simple. Prompt engineering before RAG, RAG before fine-tuning, single agent before multi-agent.
02Measure everything. LLM traces + evals from day one. You can't debug what you can't observe.
03Decompose tasks. Break workflows into testable components; eval each one separately.
04Use LLM judges. For subjective quality at scale, LLMs grading LLMs works surprisingly well with rubrics.
05Prefer RAG over fine-tuning. RAG is debuggable, updatable, and cheaper in almost every scenario.
06MCP is the agent interface standard. Treat agents as typed tools; reuse them across systems.
07Multi-agent = parallelism + specialization. Not magic — just good engineering decomposition.
08Stay broad. AI techniques have short half-lives. Pull depth when you need it, don't accumulate in advance.

Based in Kansas City? We help businesses implement these techniques — RAG, agents, evals — without the trial and error. Oaken AI consulting for Kansas City businesses.Work With Us →