Mar 08, 2026
How I Built a Lightweight AI Agent Framework in Python (And Battle-Tested It in One Morning)
I built MAF — a minimal AI agent framework in Python with one core loop, typed tool schemas, and JSONL traces. Here's how to build an AI agent from scratch, what broke against real APIs, and why minimal beats monolithic.
How I Built a Lightweight AI Agent Framework in Python (And Battle-Tested It in One Morning)
Most AI agent frameworks want you to learn a new religion before you ship anything. I wanted to build AI agent workflows that could run shell commands, read files, hit APIs, and report back — without importing half the internet. So I built MAF: the Minimal Agent Framework. Python 3.10+. MIT license. One loop you can read in 15 minutes.
Then I pointed it at real APIs and broke it. Here's the full story.
Why Build AI Agent Frameworks From Scratch?
This is the question people skip. If you need a practical build AI agent process, you have to start from constraints, not abstractions. "Just use LangChain." "Just use CrewAI." Sure — if you enjoy frameworks that have more abstractions than your problem has requirements.
LangChain has over 3,000 files in its repository. Three thousand. That's not a tool, it's a bureaucracy. You want to add a custom tool? Good luck navigating the chain-of-responsibility-factory-adapter-pattern they've built to manage tool registration. CrewAI went a different route: they built a plugin system for their plugin system. I'm not exaggerating — you configure agents that configure other agents, through YAML that generates Python that calls an orchestrator.
The complexity isn't free. Every abstraction layer is a layer you can't debug when things go sideways. And things always go sideways with LLMs. The model hallucinates a tool name. The JSON comes back malformed. The agent loops forever because the stop condition is buried six files deep. When that happens in LangChain, you're reading stack traces through five layers of middleware. When it happens in MAF, you open one file and read the loop.
I decided to build AI agent infrastructure from scratch because I needed three things no existing framework gave me simultaneously: total transparency over what the model sees and does, real budget controls that actually stop runaway agents, and traces I could replay deterministically. Minimal doesn't mean limited. It means you can hold the entire system in your head.
How Does a Build AI Agent Workflow Loop Work?
Every agent framework — no matter how many files it ships — eventually reduces to the same loop. MAF just makes that loop explicit and keeps it in one place.
Here's the core:
while budget_remaining(state):
result = call_model(state)
if is_final_answer(result):
return result.content
tool_output = execute_tool(result.action)
state.history.append(tool_output)
That's it. Four steps, repeated until done:
- Check the budget. Has the agent exceeded its max steps? Hit the wall-clock timeout? If yes, halt with a reason. No silent infinite loops.
- Call the model. Send the full conversation history — system prompt, user request, every prior tool call and result. The model either returns a final answer or requests a tool action.
- Check for completion. If the model says it's done, return the answer. This is an explicit check, not a heuristic.
- Execute the tool. Run the requested tool, capture the output, append it to state. The model sees this on the next iteration.
Budget controls are first-class. You configure max_steps (hard cap on iterations) and max_seconds (wall-clock timeout). When either trips, the agent halts and tells you why it stopped — not just that it stopped. This matters when you're running agents autonomously. An agent that silently spins forever is worse than one that fails loudly.
The state object carries everything: the conversation history, budget counters, timestamps, and the halt reason if one triggers. Nothing is hidden in framework internals. You can serialize the state, inspect it, replay it.
This is the same loop that powered our real CLI analytics pipeline — MAF orchestrating datafast-cli and seo-cli against live data. One loop, no magic.
How to Build AI Agent Tools with Typed Schemas
Tools in MAF are typed JSON schemas. No decorators, no base classes, no registration ceremony. You define what the tool accepts, and the framework validates it before execution.
Here's what a tool schema looks like:
{
"name": "shell.exec",
"description": "Execute a shell command and return stdout/stderr",
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The shell command to execute"
},
"timeout_seconds": {
"type": "integer",
"description": "Max execution time in seconds",
"default": 30
}
},
"required": ["command"]
}
}
MAF ships four built-in tools:
- shell.exec — Run shell commands, capture stdout and stderr. This is how the agent interacts with CLI tools like trustmrr-cli and datafast-cli.
- fs (read/write/list) — Filesystem access within a sandboxed root. The agent can read data files, write reports, list directories — but never escape the boundary you set.
- http.fetch — Make HTTP requests to URLs on an explicit allowlist. No rogue API calls.
- kv — A key-value store for the agent to stash intermediate results between steps. Simple but surprisingly useful for multi-step analysis.
The typed schema approach has a critical advantage: when the model sends garbage, you know immediately. If the model hallucinates a parameter name, sends a string where an integer was expected, or invents a tool that doesn't exist — validation catches it before anything executes. The error message tells you exactly what went wrong: which field, which type, what the model actually sent. Compare this to frameworks where tool calls pass through three layers of string interpolation before anyone checks whether the input makes sense. By the time you get an error in LangChain, you're five stack frames deep and the original malformed input is long gone.
Adding custom tools follows the same pattern. Define a JSON schema, implement the handler function, register it. No base class inheritance, no decorator chains, no YAML configuration files that reference other YAML files. If you can write a function and a JSON object, you can build a tool.
Sandboxing works the same way — explicit boundaries, not implicit trust. You set a filesystem root and the agent can't read outside it. You set a URL allowlist and the agent can't fetch anything else. You set a tool allowlist and the agent can only use what you've approved. Simple rules, enforced consistently.
What Makes JSONL Traces Essential for AI Agent Debugging?
Every MAF run produces a JSONL trace file. Every model call, every tool result, every timestamp — logged as structured data, one JSON object per line. This isn't optional logging you can turn on. It's baked in.
Here's a snippet from an actual trace:
{"ts":"2025-06-19T08:41:02Z","event":"model_call","step":1,"prompt_tokens":847,"model":"gemini-2.5-flash"}
{"ts":"2025-06-19T08:41:03Z","event":"tool_exec","step":1,"tool":"shell.exec","args":{"command":"datafast kpi --site starkslab.com --days 30 --format json"}}
{"ts":"2025-06-19T08:41:06Z","event":"tool_result","step":1,"stdout_bytes":2341,"exit_code":0}
{"ts":"2025-06-19T08:41:07Z","event":"model_call","step":2,"prompt_tokens":3188,"model":"gemini-2.5-flash"}
Why does this matter? Three reasons.
Deterministic replay. You can take a trace and feed it back to reconstruct exactly what the agent saw at each step. When a run produces a weird result, you don't guess — you replay.
Cost tracking. Every model call logs token counts. When you're running agents autonomously, you need to know what they're costing you. Not approximately — exactly.
Debug without printf. When the agent does something unexpected at step 4, you open the trace and read steps 1-3. You see exactly what the model received, what it returned, and what the tool produced. No breakpoints, no log-level tuning. Just structured data.
This is the kind of thing that sounds like a nice-to-have until you're debugging an agent that ran at 3 AM via OpenClaw's autonomous scheduling. Then it's the only thing standing between you and blind guesswork.
How We Battle-Tested the AI Agent Framework Against Real APIs
Theory is comfortable. Production is where frameworks actually prove themselves. So I pointed MAF at a real task: autonomous KPI analysis for starkslab.com using datafast-cli and seo-cli.
The agent's job: pull 30 days of analytics, run SEO checks, and produce a complete KPI report. No hand-holding. No intermediate prompts. Just a goal and a set of tools.
Here's what happened:
- Step 1: Agent called
datafast kpi --site starkslab.com --days 30 --format jsonto pull traffic and engagement data. - Step 2: Agent called
seo-cli audit --url starkslab.com --format jsonto run a technical SEO check. - Step 3: Agent analyzed both outputs, cross-referencing traffic trends with SEO issues.
- Step 4: Agent called
shell.execto format the report as markdown. - Step 5: Agent returned the final KPI report as its answer.
5 steps. 24 seconds. Complete KPI report with real numbers. Not a toy demo — real APIs returning real data about a real site. The JSONL trace confirmed every step, every token count, every tool execution time.
This is what I mean by battle-tested. The agent framework didn't just run — it ran against production APIs, handled real output formats, and made sensible analytical decisions. The loop held. The budget controls held. The traces captured everything.
What surprised me most: the agent's tool-call sequence was exactly what a human analyst would do. Pull the data, audit the site, cross-reference, format, report. No wasted steps. No hallucinated API calls. The tight tool schema meant the agent couldn't drift into nonsense — every tool call was validated before execution, and the budget cap meant it couldn't spiral into an infinite analysis loop. Five steps was all it needed.
What Went Wrong (And How Codex Fixed It in an Hour)
Two bugs. Both caught during the battle test, both fixed within an hour. Here's what happened.
Bug #1: Gemini's missing type field. The OpenAI tool-calling spec requires a type field in the action JSON. Gemini's API (via the OpenAI-compatible endpoint) sometimes omits it. MAF's validation rejected the response as malformed. The fix: infer the action type from context when the field is missing. Five lines of Python. Not a hack — a reasonable fallback for a known provider quirk.
Bug #2: No endpoint override in the CLI. MAF is OpenAI-compatible — it works with OpenAI, Gemini (via their OpenAI endpoint), Groq, Cerebras, and local models. But the CLI only had --model and --api-key flags. There was no --endpoint flag to point at a non-OpenAI base URL. If you wanted to use Gemini, you had to set an environment variable. Added --endpoint and --api-key as proper CLI flags.
Both bugs were fixed by Codex (the coding agent), not by hand. I described the issue, Codex produced the fix, and the PRs merged within an hour. That's the workflow: MAF runs the agent, the agent hits an edge case, Codex fixes MAF, and the next run works. The tools sharpen each other.
Neither bug was a design flaw. The core loop never broke. The architecture held — it was the edges (provider quirks, CLI ergonomics) that needed filing down. That's what battle-testing is for.
How to Build AI Agent Systems From Scratch: Getting Started
If you want to build an AI agent from scratch, here's the fastest path with MAF.
Prerequisites: Python 3.10+ and an API key for any OpenAI-compatible provider (OpenAI, Gemini, Groq, Cerebras, or a local model).
Install:
pip install maf-agent
Run your first agent:
maf run --model gpt-4o --prompt "List the Python files in the current directory and summarize what each one does"
The agent will use shell.exec to run ls *.py, then fs.read to examine each file, and return a summary. You'll see the JSONL trace in your working directory afterward.
Configure budget controls:
maf run --model gpt-4o --max-steps 10 --max-seconds 60 --prompt "Analyze the git log for the last week and identify the biggest changes"
Use a non-OpenAI provider:
maf run --model gemini-2.5-flash --endpoint https://generativelanguage.googleapis.com/v1beta/openai --api-key $GEMINI_KEY --prompt "Run the test suite and report failures"
Start small. One tool, one task, one trace to read. Once you understand the loop, everything else is just configuration. Open the JSONL trace after your first run. Read each line. You'll understand more about how AI agents work from those 10-15 lines of structured data than from any tutorial that hand-waves over the internals.
The Case for Building Small
The AI agent ecosystem has a complexity addiction. Every new framework ships with more abstractions, more configuration, more indirection. MAF is a bet that the opposite works better: one loop, four tools, typed schemas, JSONL traces, and real budget controls.
You can build AI agent systems that do real work — not just answer questions, but run commands, hit APIs, analyze data, and produce reports. You don't need 3,000 files to do it. You need a loop that's transparent enough to trust and constrained enough to control.
MAF is MIT licensed, open source, and small enough to read in one sitting. If you've been wanting to build an AI agent framework that you actually understand, start here. The traces don't lie, and the loop is always one file away.
Every AI agent framework is a maze of abstractions. You can't trace what happened, you can't replay a failed run, and when something breaks you're debugging the framework instead of your agent. You need something you can actually read.
You want a real agent workspace — not a chat tab. Something multi-workspace, tool-enabled, with files, repeatable runs, and BYOK keys per workspace — so you can build and ship agent workflows without duct-taping scripts together.
You need verified startup revenue data — MRR, growth, churn, customer counts — but TrustMRR only has a web UI. No way to query it from your terminal or pipe it into agent workflows.
DataFast has a clean analytics API, but there's no CLI. You can't check your site stats from the terminal, pipe them to scripts, or hand them to an AI agent as a tool. You're stuck in a browser dashboard.
Your AI agent needs to post to X on a schedule — without paying for bloated tools or losing control.
A practical field guide to running coding agents safely: scope, isolation, verification, and review.