Stanford CS230 · Guest Lecture · April 2026

Building with LLMs:
A Practical Study Guide

109-minute lecture distilled — prompting, RAG, fine-tuning, agentic workflows, evals, and multi-agent systems with animated visualizations.

FREE TOOL — BUILT FROM THIS LECTURE
We turned these learnings into a Claude Code agent auditor
Scans your Claude Code setup for autonomy risks, missing observability hooks, and rule coverage gaps across all four failure modes. One command, three-page report.
Audit Your Agent →
Stanford CS230 — Guest Lecture 109 min · Use timestamps to jump to any section
Jump to:
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
01

LLM Limitations — What You're Working Around

Understanding the constraints shapes every design decision. All techniques in this guide exist to solve one or more of these four problems.

The Four Root Limitations
Domain Knowledge Gaps
Base models lack proprietary data, recent events, and internal docs.
Context Window Limits
Can't hold arbitrarily long history — forces architectural choices.
Hallucinations
Generates plausible-sounding but incorrect outputs with confidence.
Difficulty of Control
Hard to get consistent, structured outputs reliably across inputs.
Which Limitation Drives Which Technique
graph TD A[Domain Knowledge Gap] -->|solve with| B[RAG] C[Context Window Limit] -->|solve with| D[Chunking and Memory] E[Hallucinations] -->|reduce with| F[Grounding + Evals] G[Control Difficulty] -->|solve with| H[Prompt Engineering] classDef lim fill:#1c2333,stroke:var(--accent),color:#e6edf3 classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3 class A,C,E,G lim class B,D,F,H sol
These four problems are the root motivation for prompt engineering, RAG, fine-tuning, and agentic workflows. Every technique in this guide is solving one or more of them.
Need to identify your AI bottlenecks? Our AI Readiness Audit maps your gaps and prioritizes fixes by ROI.Get a Free Audit →
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
02

Prompt Engineering

Zero-Shot vs Few-Shot

  • Zero-shot: give the LLM a task description only, no examples. Always start here.
  • Few-shot: provide 2–5 input/output examples before the actual task.
  • Few-shot helps when the model needs to understand output format, tone, or domain-specific patterns.
  • Rule of thumb: try zero-shot first; add examples only when output quality is inconsistent.

Chain-of-Thought (CoT)

  • Instruct the model to "think step by step" before giving a final answer.
  • Dramatically improves performance on reasoning, math, and multi-step logic.
  • Works because intermediate tokens give the model space to work through the problem.
  • Can be explicit ("let's think step by step") or via few-shot examples that show reasoning.

Prompt Chaining

  • Break a complex task into sequential prompts, each with a focused sub-task.
  • Output of step N becomes input to step N+1.
  • Benefits: easier to debug, can inject validation between steps, shorter prompts perform better.
  • Use cases: research → draft → edit pipelines; extract → classify → respond flows.

Prompt Templates

  • Parameterize prompts so the structure is stable — only the inputs change.
  • Makes prompts testable, versionable, and reusable across similar tasks.
  • Store templates separately from business logic.

LLM Judges / Evals with Promptfoo

Use an LLM to evaluate whether another LLM's output meets quality criteria. Define rubrics: what does "good" look like for this task? Useful for subjective quality (tone, helpfulness, completeness) where rule-based checks fail.

  • Promptfoo: open-source tool for running LLM evals at scale.
  • Define test cases with expected behaviors; run across model versions to detect regressions.
  • Generates quality reports automatically.
Which Prompt Technique to Use — Decision Tree
graph TD START([New Task]) --> Q1{Output quality OK?} Q1 -->|Yes| SHIP([Ship Zero-Shot]) Q1 -->|No| Q2{Format or style issue?} Q2 -->|Yes| FS[Add Few-Shot Examples] Q2 -->|No| Q3{Multi-step reasoning?} Q3 -->|Yes| COT[Chain-of-Thought] Q3 -->|No| Q4{Complex pipeline?} Q4 -->|Yes| CHAIN[Prompt Chaining] Q4 -->|No| FS classDef dec fill:#1c2333,stroke:#d4a843,color:#e6edf3 classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef term fill:#1a110a,stroke:var(--accent),color:#e6edf3 class Q1,Q2,Q3,Q4 dec class FS,COT,CHAIN sol class START,SHIP term
Chain-of-Thought — Live Example
Prompt
"A store sells items at $12, $8, $15. Customer pays $50. Think step by step, then give the change."
Model thinking
Step 1: Total cost = $12 + $8 + $15 = $35
Step 2: Customer paid $50
Step 3: Change = $50 − $35 = $15
Answer
The change is $15.

Without CoT the model often outputs an arithmetic error. The intermediate tokens give it space to reason correctly. Works because the intermediate tokens themselves are meaningful computation.

Deep DivePrompt Engineering GuideChain-of-thought, few-shot patterns, structured output, system prompt design — with code examples.
Need production-grade prompts for your workflows? We build prompt systems that work reliably at scale - not just in the demo.See What We Build →
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
03

Fine-Tuning — When (Not) to Use It

Core recommendation: avoid fine-tuning. The lecturer strongly recommends against it for most teams. Prompt engineering and RAG cover the vast majority of cases.

Why to Avoid It

  • Overfitting risk: fine-tuning on small datasets leads to narrow behavior that fails on edge cases.
  • Stale quickly: a fine-tuned model becomes outdated as base models improve; you're constantly re-tuning.
  • Data collection is hard: requires high-quality labeled examples that are expensive to produce.
  • Prompt engineering often wins: a well-crafted prompt frequently achieves what fine-tuning would.

The Slack Fine-Tuning Failure (Case Study)

Slack attempted to fine-tune a model for internal use. The fine-tuned model regressed on general capabilities, became brittle on edge cases, and required ongoing maintenance as base models updated. The effort wasn't worth the outcome.

When Fine-Tuning Is Justified

  • Very specific output format that prompts can't reliably produce.
  • Latency/cost requirements that need a smaller model tuned to reach quality bar.
  • You have thousands of high-quality labeled examples.
  • Task is stable — won't change as base models improve.
Should I Fine-Tune? — Decision Tree
graph TD A([Considering Fine-Tuning]) --> B[Try Prompt Engineering First] B --> C{Works?} C -->|Yes| D([Ship It]) C -->|No| E[Try RAG] E --> F{Works?} F -->|Yes| D F -->|No| G{1000+ quality examples?} G -->|No| H([Collect data first]) G -->|Yes| I{Task stable?} I -->|No| J([Wait for better base model]) I -->|Yes| K([Fine-tune]) classDef dec fill:#1c2333,stroke:#d4a843,color:#e6edf3 classDef ship fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef warn fill:#1a1208,stroke:#d4a843,color:#e6edf3 classDef start fill:#1a110a,stroke:var(--accent),color:#e6edf3 class C,F,G,I dec class D,K ship class H,J warn class A start
RAG vs Fine-Tuning — Head to Head
RAGFine-Tuning
Fresh Data
Update index
Retrain
Debuggability
Inspect docs
Opaque
Cost
Inference + retrieve
Train + infer
Coverage
Scales freely
Fixed to training
Deep DiveRAG vs Fine-Tuning vs Prompt EngineeringDecision framework for choosing the right technique — cost, latency, accuracy tradeoffs.
Considering fine-tuning? We will tell you honestly if it is worth it - and show you what RAG or prompting can achieve first.Talk to Us →
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
04

Retrieval-Augmented Generation (RAG)

RAG solves the domain knowledge problem: instead of baking proprietary knowledge into the model, you retrieve it at query time.

Vector Databases

  • Store documents as embeddings — dense numerical representations of meaning.
  • Semantic search: find documents similar in meaning, not just keyword matches.
  • Common options: Pinecone, Weaviate, pgvector (Postgres), Chroma.
  • Embedding model choice matters — different models have different tradeoffs for your domain.

Chunking Strategy

  • Split source documents into chunks before embedding.
  • Chunk size is critical: too small loses context; too large dilutes relevance signal.
  • Common approaches: fixed-size with overlap, sentence-boundary aware, paragraph-level.
  • Experiment for your specific domain — no universal right answer.

HyDE — Hypothetical Document Embeddings

Problem: query embeddings and document embeddings live in different semantic spaces. A user asking "why is the sky blue?" doesn't match well against a physics textbook paragraph.

  1. Ask the LLM to generate a hypothetical answer to the query.
  2. Embed that hypothetical answer.
  3. Use that embedding for retrieval instead of the original query.

The hypothetical answer matches the document's "answer language" better than the question. Measurably improves retrieval quality on knowledge-intensive tasks.

RAG Pipeline — Indexing + Query Phases
graph TD subgraph INDEX["Indexing Phase (offline)"] D[Source Documents] --> CH[Chunk Text] CH --> EMB[Embed Chunks] EMB --> VDB[(Vector Database)] end subgraph QUERY["Query Phase (real-time)"] Q[User Query] --> QE[Embed Query] QE --> SR[Semantic Search] SR --> VDB VDB --> TOP[Top-K Docs Retrieved] TOP --> CTX[Inject into Context] CTX --> LLM[LLM] LLM --> ANS[Grounded Answer] end classDef idx fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef qry fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef db fill:#1c2333,stroke:#d4a843,color:#e6edf3 class D,CH,EMB idx class Q,QE,SR,TOP,CTX,LLM,ANS qry class VDB db
HyDE — Bridging the Semantic Space Gap
Query
Space
Answer
Space
query
embed
HyDE bridges
the gap
better
match
Standard query → mismatch · HyDE answer embedding → retrieves better documents

RAG vs Fine-Tuning Comparison

ConcernRAGFine-Tuning
Fresh dataEasy — update the indexHard — retrain required
DebuggingInspect retrieved docsOpaque
CostInference + retrievalTraining + inference
CoverageScales to large knowledge basesFixed to training distribution
Deep DiveRetrieval-Augmented Generation Deep DiveChunking strategies, embedding models, HyDE, hybrid search, and reranking for production pipelines.
Want RAG for your knowledge base? We design and deploy retrieval pipelines - chunking, embeddings, and HyDE - tailored to your data.Explore Content Systems →
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
05

Agentic Workflows

Andrew Ng's definition: An agentic workflow is one where an LLM iterates on its output — it can plan, act, observe results, and revise. The LLM isn't just doing one forward pass; it's deciding what to do next based on what happened.

Memory Types

  • Working memory (short-term): the current context window. What the agent knows "right now." Limited by context window size.
  • Archival memory (long-term): external storage (databases, vector stores, files) the agent can read/write. Enables persistence across sessions. This is what lets an agent "remember" past interactions.

Tools

Tools transform the LLM from a question-answerer into an actor.

  • Web search: access current information.
  • Code execution: run Python, test logic.
  • Database read/write: query and update persistent state.
  • API calls: interact with external services (calendar, email, CRM).
  • File system: read documents, save outputs.

MCP — Model Context Protocol

  • Standardized protocol for agents to call tools and other services.
  • Treats tools as typed interfaces: here's what this tool accepts, here's what it returns.
  • Enables tool reuse across different agents and systems.
  • In multi-agent systems: agents communicate with each other via MCP — an agent is just another tool.
  • Anthropic-backed standard gaining adoption across the ecosystem.

Autonomy Spectrum

LevelDescription
Low autonomyHuman approves every action
MediumAgent acts, human reviews outputs
High autonomyAgent runs end-to-end, humans get alerts on exceptions
Start low autonomy, earn trust, expand. Don't give full autonomy to a system you haven't validated.
The Agent Loop
graph TD GOAL([User Goal]) --> PLAN[Plan Next Action] PLAN --> TOOL[Select Tool] TOOL --> ACT[Execute Action] ACT --> OBS[Observe Result] OBS --> EVAL{Goal Achieved?} EVAL -->|No - revise| PLAN EVAL -->|Yes| RESP([Return to User]) MEM1[(Working Memory)] -.->|read/write| PLAN MEM2[(Archival Memory)] -.->|read/write| ACT classDef loop fill:#1c2333,stroke:var(--accent),color:#e6edf3 classDef mem fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef term fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef dec fill:#1a1208,stroke:#d4a843,color:#e6edf3 class PLAN,TOOL,ACT,OBS loop class MEM1,MEM2 mem class GOAL,RESP term class EVAL dec
Working Memory vs Archival Memory
Working Memory (short-term)
System: You are a helpful assistant...
User: Process order #4821
Agent: Looking up order #4821...
Context window usage
72% full — lost on session end
Archival Memory (long-term)
user_prefs: {theme: dark, lang: en}
last_order: {id: 4820, status: shipped}
session_2026-04-10: resolved issue
session_2026-04-09: asked about returns
→ persists across all sessions
Deep DiveAgentic AI Workflows Deep DiveMemory types, tool integration, MCP architecture, the autonomy spectrum, and production reliability patterns.
Ready to build agentic workflows? We architect and deploy AI agents with the right tool integrations, memory design, and autonomy level.See Workflow Automation →
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
06

Evals — Customer Support Agent Case Study

The lecture walked through building evals for a customer support AI agent handling email inquiries: order tracking, address changes, refund requests.

Step 1: Task Decomposition

Break the agent's workflow into discrete, testable components. Each step is a potential failure point — evals should exist at each one.

  1. Read and parse the customer email.
  2. Identify the intent (what is the customer asking?).
  3. Extract entities (order ID, address, email address).
  4. Look up relevant information (order status, account data).
  5. Draft a response.
  6. Update database if needed.
  7. Send the email.

LLM Judges with Rubrics

  • Write a rubric: "A good response is polite, addresses all questions, provides the order status, and includes a next step."
  • Ask an LLM to grade the output 1–5 on each rubric criterion.
  • Use multiple LLM judges and average for reliability.
  • Works better than rule-based checks for open-ended text.

Objective vs Subjective Evals

  • Objective (code/rules): did the agent extract the correct order ID? Did it look up the right account? Was the database updated correctly? These are pure Python assertions.
  • Subjective (LLM judges): is the response tone appropriate? Did the email address all the customer's questions? Is the response concise without being curt?

Component-Based vs End-to-End

  • Component-based: test each step in isolation. Faster, easier to pinpoint failures. Use during development. "The intent classifier is wrong 30% of the time on refund requests."
  • End-to-end: test the full workflow with real inputs. Catches emergent failures from step interactions. Slower, more expensive. Use before deploys.
  • Use both. Component evals for development; end-to-end for regression testing.

LLM Traces Are Non-Negotiable

If you're interviewing with an AI startup, ask: "Do you have LLM traces?" Without traces, debugging a multi-step agent is nearly impossible.

Good trace tooling: LangSmith, Braintrust, Helicone, Arize.

  • Full chain of every prompt sent.
  • Every LLM response at each step.
  • Tool call inputs/outputs.
  • Timing data per component.
Customer Support Agent — Steps + Eval Checkpoints
graph TD A[Customer Email] --> B[Parse Email] B --> C[Identify Intent] C --> D[Extract Entities] D --> E[DB Lookup] E --> F[Draft Response] F --> G[Update Database] G --> H[Send Email] B -.->|Objective eval| E1[Parsing accuracy] C -.->|Objective eval| E2[Intent accuracy] D -.->|Objective eval| E3[Entity extraction] E -.->|Objective eval| E4[Lookup correctness] F -.->|LLM judge| E5[Tone + completeness] G -.->|Objective eval| E6[DB state correct?] classDef step fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef ev fill:#0d2520,stroke:#0d9488,color:#aaa class A,B,C,D,E,F,G,H step class E1,E2,E3,E4,E5,E6 ev
Eval Taxonomy — 2×2 Matrix
Component-based End-to-end Objective Subjective
Objective + Component
Unit tests — "did intent classifier return REFUND?" Python assertions. Fastest to run.
Subjective + Component
LLM judge on draft tone. Rubric score 1-5 per criterion.
Objective + End-to-End
Full workflow test — was DB updated correctly? Did customer get right info?
Subjective + End-to-End
Human raters score full conversation. Most expensive, catches the most edge cases.
Deep DiveLLM Evaluation (Evals) Deep DiveDeterministic evals, LLM-as-judge, traces, benchmark design, and the eval-driven development loop.
No eval framework yet? We instrument AI systems with traces, component evals, and LLM judges so you can deploy with confidence.Learn More →
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
07

Multi-Agent Systems

Why Multi-Agent?

  • Parallelism: run independent tasks simultaneously — climate, security, and energy agents all run at once.
  • Specialization: each agent is optimized for a narrow domain — easier to tune, test, and debug.
  • Reusability: a design agent built for the marketing team can also serve the product team.

Interaction Patterns

PatternHow it worksWhen to use
HierarchicalOrchestrator delegates to sub-agents; user only talks to orchestratorRecommended default — cleaner UX, single point of control
Flat (peer-to-peer)Agents communicate directly with each otherWhen two agents are tightly coupled and share state constantly
HybridHierarchical backbone with peer connections for coupled agentsComplex systems with both independent and dependent agents

Smart Home Case Study (In-Class Exercise)

  • Biometric/location agent: tracks where residents are in the home.
  • Climate agent: controls temperature, access to room sensors and thermostats.
  • Energy agent: monitors usage, can cut power to non-essential systems.
  • Security agent: manages entry, identifies who's entering, sets permissions per person (parent vs child access rules).
  • Fridge/grocery agent: monitors contents via camera, has access to grocery delivery API.
  • Weather agent: pulls external APIs, adjusts blinds/temperature based on conditions.
  • Orchestrator: single user-facing interface, delegates to all other agents.
Key insight: the orchestrator communicates with sub-agents via MCP — each sub-agent is just another typed tool interface. An agent is a tool.
Smart Home Multi-Agent Hierarchy
graph TD USER([User]) --> ORC[Orchestrator Agent] ORC --> CL[Climate Agent] ORC --> SEC[Security Agent] ORC --> EN[Energy Agent] ORC --> GR[Grocery Agent] ORC --> WX[Weather Agent] EN <-->|share state| CL CL -.->|MCP| THERM[(Thermostats)] SEC -.->|MCP| CAM[(Cameras)] GR -.->|MCP| SHOP[(Grocery API)] WX -.->|MCP| API[(Weather API)] classDef orch fill:#1a110a,stroke:var(--accent),color:#e6edf3 classDef agent fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef tool fill:#1c2333,stroke:#d4a843,color:#aaa classDef user fill:#0d1520,stroke:#5b9bd5,color:#e6edf3 class ORC orch class CL,SEC,EN,GR,WX agent class THERM,CAM,SHOP,API tool class USER user
Flat vs Hierarchical — Communication Patterns
Flat — All Peers Communicate
Climate
Security
Energy
Grocery
Weather
Lights

All ↔ all. Flexible but N² connections. Hard to debug.

Hierarchical — Recommended
Orchestrator
Climate
Security
Energy

User talks to orchestrator only. Easier to audit and extend.

Deep DiveMulti-Agent AI Systems Deep DiveOrchestrator-subagent patterns, shared state, coordination without deadlock, and production deployment.
Designing a multi-agent system? We plan the hierarchy, define tool interfaces, and build the orchestration layer end to end.Start a Conversation →
Free Tool
Claude Workspace Optimizer
Automatically configure Claude Code for your stack. Rules, hooks, and memory set up in minutes.
08

What's Next in AI — Trends to Watch

Are We Plateauing?

Ilya Sutskever (OpenAI co-founder) raised this question publicly. Recent GPT releases showed less capability jump than previous generations; marginal returns on raw compute may be diminishing. Counter-argument: scaling laws still hold if compute and energy continue to scale. The question is unresolved.

Architecture Search as the Unlock

Most current LLMs are transformer-based. The transformer was a breakthrough — but not the end state. The human brain doesn't use transformers or backpropagation, yet learns faster, from less data, with far lower energy. This gap represents massive opportunity in architecture search. Labs are hiring thousands of engineers to run this search. The next "transformer discovery" — a new architecture that reduces compute by 10x — could restart exponential improvement.

Multi-Modality as a Source of Gain

  • Adding image understanding made text models better at text.
  • Adding audio will make image+text models better at everything.
  • Adding video will push further gains across all modalities.
  • Modalities reinforce each other: knowing what a cat looks like makes you better at writing about cats.
  • The endpoint: robotics, where all modalities fuse into physical action.

Multiple Learning Methods in Harmony

Current AI uses individual methods in isolation. Human learning uses all simultaneously:

Human learningAI equivalent
Survival instincts (encoded in DNA)Pre-training
Parent pointing at objects (good/bad)Supervised learning
Falling down and getting hurtReinforcement learning reward signal
Observing other people/babiesUnsupervised learning

The Velocity Problem: Skills Have Short Half-Lives

Don't optimize for knowing the current best RAG technique. Optimize for understanding the problem space deeply enough to quickly evaluate and adopt new techniques as they emerge. The course teaches breadth — pull depth when you need it.
Key Trends Shaping AI Progress
Now
Scaling Plateau Question
Are marginal returns on compute diminishing? GPT updates showing smaller capability jumps. Unresolved.
Near Term
Architecture Search
Labs hiring thousands to find the next transformer. Brain doesn't use backprop — huge unexplored architecture space.
Medium Term
Multi-Modality Gains
Each added modality improves all others. Cross-modal reinforcement compounds. Endpoint: robotics.
Ongoing
Methods in Harmony
Human learning blends supervised, RL, unsupervised, and meta-learning simultaneously. AI will converge here.
Multi-Modality — Each Addition Improves Everything
Text
Baseline LLM capability
Text + Images
Image understanding improves text reasoning
+ Audio
Temporal understanding unlocked
+ Video
Spatial + temporal reasoning
→ Robotics
All modalities fuse into physical action
Want to stay ahead? Our team tracks AI architecture developments and translates them into implementation strategies for your business.Follow Our Work →

Quick Reference: Decision Framework

Decision Flowchart — Start Here
graph TD P{What is the problem?} --> A[Model lacks domain knowledge] P --> B[Output format inconsistent] P --> C[Multi-step reasoning fails] P --> D[Need real-world actions] P --> E[Tasks are parallel] P --> F[Considering fine-tuning] A --> A1([Use RAG]) B --> B1([Few-shot prompting]) C --> C1([Chain-of-Thought]) D --> D1([Agentic workflow]) E --> E1([Multi-agent system]) F --> F1([Try prompting + RAG first]) classDef prob fill:#1c2333,stroke:#d4a843,color:#e6edf3 classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3 classDef warn fill:#1a1208,stroke:var(--accent),color:#e6edf3 class P,A,B,C,D,E,F prob class A1,B1,C1,D1,E1 sol class F1 warn
Full Decision Table
ProblemFirst Approach
Model doesn't know your domain dataRAG
Need consistent output formatPrompt engineering + few-shot
Multi-step reasoning failingChain-of-Thought
Complex task with multiple sub-tasksPrompt chaining
Need to take actions in the worldAgentic workflow with tools
Tasks that can run in parallelMulti-agent system
Evaluating subjective output qualityLLM judges with rubrics
Evaluating objective correctnessPython assertion-based evals
System behaving unexpectedlyLLM traces + component debugging
Considering fine-tuningDon't — try prompting + RAG first

Go Deeper — Topic Guides

Each topic in this guide has a standalone deep dive with extended examples, implementation patterns, and decision frameworks. New guides publish as the space develops.

Key Takeaways

  • 01Start simple. Prompt engineering before RAG, RAG before fine-tuning, single agent before multi-agent.
  • 02Measure everything. LLM traces + evals from day one. You can't debug what you can't observe.
  • 03Decompose tasks. Break workflows into testable components; eval each one separately.
  • 04Use LLM judges. For subjective quality at scale, LLMs grading LLMs works surprisingly well with rubrics.
  • 05Prefer RAG over fine-tuning. RAG is debuggable, updatable, and cheaper in almost every scenario.
  • 06MCP is the agent interface standard. Treat agents as typed tools; reuse them across systems.
  • 07Multi-agent = parallelism + specialization. Not magic — just good engineering decomposition.
  • 08Stay broad. AI techniques have short half-lives. Pull depth when you need it, don't accumulate in advance.
Based in Kansas City? We help businesses implement these techniques — RAG, agents, evals — without the trial and error. Oaken AI consulting for Kansas City businesses.Work With Us →