109-minute lecture distilled — prompting, RAG, fine-tuning, agentic workflows, evals, and multi-agent systems with animated visualizations.
FREE TOOL — BUILT FROM THIS LECTURE
We turned these learnings into a Claude Code agent auditor
Scans your Claude Code setup for autonomy risks, missing observability hooks, and rule coverage gaps across all four failure modes. One command, three-page report.
Understanding the constraints shapes every design decision. All techniques in this guide exist to solve one or more of these four problems.
The Four Root Limitations
Domain Knowledge Gaps
Base models lack proprietary data, recent events, and internal docs.
Context Window Limits
Can't hold arbitrarily long history — forces architectural choices.
Hallucinations
Generates plausible-sounding but incorrect outputs with confidence.
Difficulty of Control
Hard to get consistent, structured outputs reliably across inputs.
Which Limitation Drives Which Technique
graph TD
A[Domain Knowledge Gap] -->|solve with| B[RAG]
C[Context Window Limit] -->|solve with| D[Chunking and Memory]
E[Hallucinations] -->|reduce with| F[Grounding + Evals]
G[Control Difficulty] -->|solve with| H[Prompt Engineering]
classDef lim fill:#1c2333,stroke:var(--accent),color:#e6edf3
classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3
class A,C,E,G lim
class B,D,F,H sol
These four problems are the root motivation for prompt engineering, RAG, fine-tuning, and agentic workflows. Every technique in this guide is solving one or more of them.
Need to identify your AI bottlenecks? Our AI Readiness Audit maps your gaps and prioritizes fixes by ROI.Get a Free Audit →
Zero-shot: give the LLM a task description only, no examples. Always start here.
Few-shot: provide 2–5 input/output examples before the actual task.
Few-shot helps when the model needs to understand output format, tone, or domain-specific patterns.
Rule of thumb: try zero-shot first; add examples only when output quality is inconsistent.
Chain-of-Thought (CoT)
Instruct the model to "think step by step" before giving a final answer.
Dramatically improves performance on reasoning, math, and multi-step logic.
Works because intermediate tokens give the model space to work through the problem.
Can be explicit ("let's think step by step") or via few-shot examples that show reasoning.
Prompt Chaining
Break a complex task into sequential prompts, each with a focused sub-task.
Output of step N becomes input to step N+1.
Benefits: easier to debug, can inject validation between steps, shorter prompts perform better.
Use cases: research → draft → edit pipelines; extract → classify → respond flows.
Prompt Templates
Parameterize prompts so the structure is stable — only the inputs change.
Makes prompts testable, versionable, and reusable across similar tasks.
Store templates separately from business logic.
LLM Judges / Evals with Promptfoo
Use an LLM to evaluate whether another LLM's output meets quality criteria. Define rubrics: what does "good" look like for this task? Useful for subjective quality (tone, helpfulness, completeness) where rule-based checks fail.
Promptfoo: open-source tool for running LLM evals at scale.
Define test cases with expected behaviors; run across model versions to detect regressions.
Generates quality reports automatically.
Which Prompt Technique to Use — Decision Tree
graph TD
START([New Task]) --> Q1{Output quality OK?}
Q1 -->|Yes| SHIP([Ship Zero-Shot])
Q1 -->|No| Q2{Format or style issue?}
Q2 -->|Yes| FS[Add Few-Shot Examples]
Q2 -->|No| Q3{Multi-step reasoning?}
Q3 -->|Yes| COT[Chain-of-Thought]
Q3 -->|No| Q4{Complex pipeline?}
Q4 -->|Yes| CHAIN[Prompt Chaining]
Q4 -->|No| FS
classDef dec fill:#1c2333,stroke:#d4a843,color:#e6edf3
classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3
classDef term fill:#1a110a,stroke:var(--accent),color:#e6edf3
class Q1,Q2,Q3,Q4 dec
class FS,COT,CHAIN sol
class START,SHIP term
Chain-of-Thought — Live Example
Prompt
"A store sells items at $12, $8, $15. Customer pays $50. Think step by step, then give the change."
Model thinking
Step 1: Total cost = $12 + $8 + $15 = $35
Step 2: Customer paid $50
Step 3: Change = $50 − $35 = $15
Answer
The change is $15.
Without CoT the model often outputs an arithmetic error. The intermediate tokens give it space to reason correctly. Works because the intermediate tokens themselves are meaningful computation.
Core recommendation: avoid fine-tuning. The lecturer strongly recommends against it for most teams. Prompt engineering and RAG cover the vast majority of cases.
Why to Avoid It
Overfitting risk: fine-tuning on small datasets leads to narrow behavior that fails on edge cases.
Stale quickly: a fine-tuned model becomes outdated as base models improve; you're constantly re-tuning.
Data collection is hard: requires high-quality labeled examples that are expensive to produce.
Prompt engineering often wins: a well-crafted prompt frequently achieves what fine-tuning would.
The Slack Fine-Tuning Failure (Case Study)
Slack attempted to fine-tune a model for internal use. The fine-tuned model regressed on general capabilities, became brittle on edge cases, and required ongoing maintenance as base models updated. The effort wasn't worth the outcome.
When Fine-Tuning Is Justified
Very specific output format that prompts can't reliably produce.
Latency/cost requirements that need a smaller model tuned to reach quality bar.
You have thousands of high-quality labeled examples.
Task is stable — won't change as base models improve.
Should I Fine-Tune? — Decision Tree
graph TD
A([Considering Fine-Tuning]) --> B[Try Prompt Engineering First]
B --> C{Works?}
C -->|Yes| D([Ship It])
C -->|No| E[Try RAG]
E --> F{Works?}
F -->|Yes| D
F -->|No| G{1000+ quality examples?}
G -->|No| H([Collect data first])
G -->|Yes| I{Task stable?}
I -->|No| J([Wait for better base model])
I -->|Yes| K([Fine-tune])
classDef dec fill:#1c2333,stroke:#d4a843,color:#e6edf3
classDef ship fill:#0d2520,stroke:#0d9488,color:#e6edf3
classDef warn fill:#1a1208,stroke:#d4a843,color:#e6edf3
classDef start fill:#1a110a,stroke:var(--accent),color:#e6edf3
class C,F,G,I dec
class D,K ship
class H,J warn
class A start
RAG solves the domain knowledge problem: instead of baking proprietary knowledge into the model, you retrieve it at query time.
Vector Databases
Store documents as embeddings — dense numerical representations of meaning.
Semantic search: find documents similar in meaning, not just keyword matches.
Common options: Pinecone, Weaviate, pgvector (Postgres), Chroma.
Embedding model choice matters — different models have different tradeoffs for your domain.
Chunking Strategy
Split source documents into chunks before embedding.
Chunk size is critical: too small loses context; too large dilutes relevance signal.
Common approaches: fixed-size with overlap, sentence-boundary aware, paragraph-level.
Experiment for your specific domain — no universal right answer.
HyDE — Hypothetical Document Embeddings
Problem: query embeddings and document embeddings live in different semantic spaces. A user asking "why is the sky blue?" doesn't match well against a physics textbook paragraph.
Ask the LLM to generate a hypothetical answer to the query.
Embed that hypothetical answer.
Use that embedding for retrieval instead of the original query.
The hypothetical answer matches the document's "answer language" better than the question. Measurably improves retrieval quality on knowledge-intensive tasks.
RAG Pipeline — Indexing + Query Phases
graph TD
subgraph INDEX["Indexing Phase (offline)"]
D[Source Documents] --> CH[Chunk Text]
CH --> EMB[Embed Chunks]
EMB --> VDB[(Vector Database)]
end
subgraph QUERY["Query Phase (real-time)"]
Q[User Query] --> QE[Embed Query]
QE --> SR[Semantic Search]
SR --> VDB
VDB --> TOP[Top-K Docs Retrieved]
TOP --> CTX[Inject into Context]
CTX --> LLM[LLM]
LLM --> ANS[Grounded Answer]
end
classDef idx fill:#0d2520,stroke:#0d9488,color:#e6edf3
classDef qry fill:#1a110a,stroke:var(--accent),color:#e6edf3
classDef db fill:#1c2333,stroke:#d4a843,color:#e6edf3
class D,CH,EMB idx
class Q,QE,SR,TOP,CTX,LLM,ANS qry
class VDB db
Want RAG for your knowledge base? We design and deploy retrieval pipelines - chunking, embeddings, and HyDE - tailored to your data.Explore Content Systems →
Halfway there
⚡ Finding this useful? Buy me a coffee
All of Oaken AI's tools and guides are free and open-source. A small donation keeps the momentum going.
Andrew Ng's definition: An agentic workflow is one where an LLM iterates on its output — it can plan, act, observe results, and revise. The LLM isn't just doing one forward pass; it's deciding what to do next based on what happened.
Memory Types
Working memory (short-term): the current context window. What the agent knows "right now." Limited by context window size.
Archival memory (long-term): external storage (databases, vector stores, files) the agent can read/write. Enables persistence across sessions. This is what lets an agent "remember" past interactions.
Tools
Tools transform the LLM from a question-answerer into an actor.
Web search: access current information.
Code execution: run Python, test logic.
Database read/write: query and update persistent state.
API calls: interact with external services (calendar, email, CRM).
File system: read documents, save outputs.
MCP — Model Context Protocol
Standardized protocol for agents to call tools and other services.
Treats tools as typed interfaces: here's what this tool accepts, here's what it returns.
Enables tool reuse across different agents and systems.
In multi-agent systems: agents communicate with each other via MCP — an agent is just another tool.
Anthropic-backed standard gaining adoption across the ecosystem.
Autonomy Spectrum
Level
Description
Low autonomy
Human approves every action
Medium
Agent acts, human reviews outputs
High autonomy
Agent runs end-to-end, humans get alerts on exceptions
Start low autonomy, earn trust, expand. Don't give full autonomy to a system you haven't validated.
The Agent Loop
graph TD
GOAL([User Goal]) --> PLAN[Plan Next Action]
PLAN --> TOOL[Select Tool]
TOOL --> ACT[Execute Action]
ACT --> OBS[Observe Result]
OBS --> EVAL{Goal Achieved?}
EVAL -->|No - revise| PLAN
EVAL -->|Yes| RESP([Return to User])
MEM1[(Working Memory)] -.->|read/write| PLAN
MEM2[(Archival Memory)] -.->|read/write| ACT
classDef loop fill:#1c2333,stroke:var(--accent),color:#e6edf3
classDef mem fill:#0d2520,stroke:#0d9488,color:#e6edf3
classDef term fill:#1a110a,stroke:var(--accent),color:#e6edf3
classDef dec fill:#1a1208,stroke:#d4a843,color:#e6edf3
class PLAN,TOOL,ACT,OBS loop
class MEM1,MEM2 mem
class GOAL,RESP term
class EVAL dec
Ready to build agentic workflows? We architect and deploy AI agents with the right tool integrations, memory design, and autonomy level.See Workflow Automation →
Look up relevant information (order status, account data).
Draft a response.
Update database if needed.
Send the email.
LLM Judges with Rubrics
Write a rubric: "A good response is polite, addresses all questions, provides the order status, and includes a next step."
Ask an LLM to grade the output 1–5 on each rubric criterion.
Use multiple LLM judges and average for reliability.
Works better than rule-based checks for open-ended text.
Objective vs Subjective Evals
Objective (code/rules): did the agent extract the correct order ID? Did it look up the right account? Was the database updated correctly? These are pure Python assertions.
Subjective (LLM judges): is the response tone appropriate? Did the email address all the customer's questions? Is the response concise without being curt?
Component-Based vs End-to-End
Component-based: test each step in isolation. Faster, easier to pinpoint failures. Use during development. "The intent classifier is wrong 30% of the time on refund requests."
End-to-end: test the full workflow with real inputs. Catches emergent failures from step interactions. Slower, more expensive. Use before deploys.
Use both. Component evals for development; end-to-end for regression testing.
LLM Traces Are Non-Negotiable
If you're interviewing with an AI startup, ask: "Do you have LLM traces?" Without traces, debugging a multi-step agent is nearly impossible.
Good trace tooling: LangSmith, Braintrust, Helicone, Arize.
Full chain of every prompt sent.
Every LLM response at each step.
Tool call inputs/outputs.
Timing data per component.
Customer Support Agent — Steps + Eval Checkpoints
graph TD
A[Customer Email] --> B[Parse Email]
B --> C[Identify Intent]
C --> D[Extract Entities]
D --> E[DB Lookup]
E --> F[Draft Response]
F --> G[Update Database]
G --> H[Send Email]
B -.->|Objective eval| E1[Parsing accuracy]
C -.->|Objective eval| E2[Intent accuracy]
D -.->|Objective eval| E3[Entity extraction]
E -.->|Objective eval| E4[Lookup correctness]
F -.->|LLM judge| E5[Tone + completeness]
G -.->|Objective eval| E6[DB state correct?]
classDef step fill:#1a110a,stroke:var(--accent),color:#e6edf3
classDef ev fill:#0d2520,stroke:#0d9488,color:#aaa
class A,B,C,D,E,F,G,H step
class E1,E2,E3,E4,E5,E6 ev
Eval Taxonomy — 2×2 Matrix
Component-basedEnd-to-endObjectiveSubjective
Objective + Component
Unit tests — "did intent classifier return REFUND?" Python assertions. Fastest to run.
Subjective + Component
LLM judge on draft tone. Rubric score 1-5 per criterion.
Objective + End-to-End
Full workflow test — was DB updated correctly? Did customer get right info?
Subjective + End-to-End
Human raters score full conversation. Most expensive, catches the most edge cases.
Ilya Sutskever (OpenAI co-founder) raised this question publicly. Recent GPT releases showed less capability jump than previous generations; marginal returns on raw compute may be diminishing. Counter-argument: scaling laws still hold if compute and energy continue to scale. The question is unresolved.
Architecture Search as the Unlock
Most current LLMs are transformer-based. The transformer was a breakthrough — but not the end state. The human brain doesn't use transformers or backpropagation, yet learns faster, from less data, with far lower energy. This gap represents massive opportunity in architecture search. Labs are hiring thousands of engineers to run this search. The next "transformer discovery" — a new architecture that reduces compute by 10x — could restart exponential improvement.
Multi-Modality as a Source of Gain
Adding image understanding made text models better at text.
Adding audio will make image+text models better at everything.
Adding video will push further gains across all modalities.
Modalities reinforce each other: knowing what a cat looks like makes you better at writing about cats.
The endpoint: robotics, where all modalities fuse into physical action.
Multiple Learning Methods in Harmony
Current AI uses individual methods in isolation. Human learning uses all simultaneously:
Human learning
AI equivalent
Survival instincts (encoded in DNA)
Pre-training
Parent pointing at objects (good/bad)
Supervised learning
Falling down and getting hurt
Reinforcement learning reward signal
Observing other people/babies
Unsupervised learning
The Velocity Problem: Skills Have Short Half-Lives
Don't optimize for knowing the current best RAG technique. Optimize for understanding the problem space deeply enough to quickly evaluate and adopt new techniques as they emerge. The course teaches breadth — pull depth when you need it.
Key Trends Shaping AI Progress
Now
Scaling Plateau Question
Are marginal returns on compute diminishing? GPT updates showing smaller capability jumps. Unresolved.
Near Term
Architecture Search
Labs hiring thousands to find the next transformer. Brain doesn't use backprop — huge unexplored architecture space.
Medium Term
Multi-Modality Gains
Each added modality improves all others. Cross-modal reinforcement compounds. Endpoint: robotics.
Ongoing
Methods in Harmony
Human learning blends supervised, RL, unsupervised, and meta-learning simultaneously. AI will converge here.
Multi-Modality — Each Addition Improves Everything
Text
Baseline LLM capability
Text + Images
Image understanding improves text reasoning
+ Audio
Temporal understanding unlocked
+ Video
Spatial + temporal reasoning
→ Robotics
All modalities fuse into physical action
Want to stay ahead? Our team tracks AI architecture developments and translates them into implementation strategies for your business.Follow Our Work →
graph TD
P{What is the problem?} --> A[Model lacks domain knowledge]
P --> B[Output format inconsistent]
P --> C[Multi-step reasoning fails]
P --> D[Need real-world actions]
P --> E[Tasks are parallel]
P --> F[Considering fine-tuning]
A --> A1([Use RAG])
B --> B1([Few-shot prompting])
C --> C1([Chain-of-Thought])
D --> D1([Agentic workflow])
E --> E1([Multi-agent system])
F --> F1([Try prompting + RAG first])
classDef prob fill:#1c2333,stroke:#d4a843,color:#e6edf3
classDef sol fill:#0d2520,stroke:#0d9488,color:#e6edf3
classDef warn fill:#1a1208,stroke:var(--accent),color:#e6edf3
class P,A,B,C,D,E,F prob
class A1,B1,C1,D1,E1 sol
class F1 warn
Each topic in this guide has a standalone deep dive with extended examples, implementation patterns, and decision frameworks. New guides publish as the space develops.
01Start simple. Prompt engineering before RAG, RAG before fine-tuning, single agent before multi-agent.
02Measure everything. LLM traces + evals from day one. You can't debug what you can't observe.
03Decompose tasks. Break workflows into testable components; eval each one separately.
04Use LLM judges. For subjective quality at scale, LLMs grading LLMs works surprisingly well with rubrics.
05Prefer RAG over fine-tuning. RAG is debuggable, updatable, and cheaper in almost every scenario.
06MCP is the agent interface standard. Treat agents as typed tools; reuse them across systems.
07Multi-agent = parallelism + specialization. Not magic — just good engineering decomposition.
08Stay broad. AI techniques have short half-lives. Pull depth when you need it, don't accumulate in advance.
Based in Kansas City? We help businesses implement these techniques — RAG, agents, evals — without the trial and error. Oaken AI consulting for Kansas City businesses.Work With Us →
You made it to the end
⚡ If this saved you hours, buy me a coffee
This guide is free. So are the workspace optimizer and agent auditor. A small donation fuels more open-source AI tools for your stack.