Build an AI Agent Framework in Python:... | Authority

Build an AI Agent Framework in Python: MAF Loop, Tools, Traces

If you want to build an AI agent that can run shell commands, read files, hit APIs, and report back, the useful first framework is not a giant orchestration stack. It is one inspectable loop, typed tools, explicit budgets, and traces you can replay. That is why I built MAF: the Minimal Agent Framework. Python 3.10+. MIT license. One loop you can read in 15 minutes.

Then I pointed it at real APIs and broke it. This page is the Build AI Agent framework owner in the Starkslab library: it shows the MAF implementation shape, the tool boundary, the trace model, and the failure modes. If you need the beginner path first, start with the first AI agent tutorial. If you need the broader systems model, use AI agent architecture: build factories, not fake teams. If your agent is already editing code, go to the AI coding agent workflow.

What this page covers

why a small Python AI agent framework is easier to trust than a sprawling abstraction stack

the MAF runtime loop: model call, tool call, state, budget, halt

typed tool schemas for shell, filesystem, HTTP, and key-value state

JSONL traces as the debugging and cost-control surface

what broke when MAF hit real Starkslab API workflows

how to clone the repo and run the framework without pretending there is a PyPI package

What this page is not

not the beginner step-by-step path; use Build Your First AI Agent

not the broad architecture thesis; use AI Agent Architecture

not the coding-agent delegation manual; use AI Coding Agent Workflow

Proof basis

MAF was battle-tested against Starkslab analytics and SEO workflows, with JSONL traces capturing model calls, tool calls, token counts, exit codes, and provider edge cases. The public drop is Minimal Agent Framework, and the canonical source repo is linked below.

Why build an AI agent framework from scratch?
How does the MAF agent loop work?
How do typed tools make a Python AI agent framework safer?
Why are JSONL traces essential for AI agent debugging?
How did MAF handle real API work?
What broke during the MAF battle test?
How do you build an AI agent framework with MAF?
Why keep the framework small?

Why Build an AI Agent Framework From Scratch?

This is the question people skip. If you need a practical way to build an AI agent, you have to start from constraints, not abstractions. "Just use LangChain." "Just use CrewAI." Sure — if you enjoy frameworks that have more abstractions than your problem has requirements.

LangChain has over 3,000 files in its repository. Three thousand. That's not a tool, it's a bureaucracy. You want to add a custom tool? Good luck navigating the chain-of-responsibility-factory-adapter-pattern they've built to manage tool registration. CrewAI went a different route: they built a plugin system for their plugin system. I'm not exaggerating — you configure agents that configure other agents, through YAML that generates Python that calls an orchestrator.

The complexity isn't free. Every abstraction layer is a layer you can't debug when things go sideways. And things always go sideways with LLMs. The model hallucinates a tool name. The JSON comes back malformed. The agent loops forever because the stop condition is buried six files deep. When that happens in LangChain, you're reading stack traces through five layers of middleware. When it happens in MAF, you open one file and read the loop.

I decided to build an AI agent framework from scratch because I needed three things no existing framework gave me simultaneously: total transparency over what the model sees and does, real budget controls that actually stop runaway agents, and traces I could replay deterministically. Minimal doesn't mean limited. It means you can hold the entire system in your head.

That boundary is also what keeps this page from competing with the rest of the Build AI Agent cluster. MAF is the reusable framework implementation. The first-agent tutorial is the beginner build path. The architecture page is the factory-system thesis. The coding-agent workflow is the operator process once agents start touching code.

How Does the MAF Agent Loop Work?

Every agent framework — no matter how many files it ships — eventually reduces to the same loop. MAF just makes that loop explicit and keeps it in one place.

Here's the core:

while budget_remaining(state):
    result = call_model(state)
    
    if is_final_answer(result):
        return result.content
    
    tool_output = execute_tool(result.action)
    state.history.append(tool_output)

That's it. Four steps, repeated until done:

Check the budget. Has the agent exceeded its max steps? Hit the wall-clock timeout? If yes, halt with a reason. No silent infinite loops.
Call the model. Send the full conversation history — system prompt, user request, every prior tool call and result. The model either returns a final answer or requests a tool action.
Check for completion. If the model says it's done, return the answer. This is an explicit check, not a heuristic.
Execute the tool. Run the requested tool, capture the output, append it to state. The model sees this on the next iteration.

Budget controls are first-class. You configure max_steps (hard cap on iterations) and max_seconds (wall-clock timeout). When either trips, the agent halts and tells you why it stopped — not just that it stopped. This matters when you're running agents autonomously. An agent that silently spins forever is worse than one that fails loudly.

The state object carries everything: the conversation history, budget counters, timestamps, and the halt reason if one triggers. Nothing is hidden in framework internals. You can serialize the state, inspect it, replay it.

This is the same loop that powered our real CLI analytics pipeline for AI agents — MAF orchestrating datafast-cli and seo-cli against live data. One loop, no magic.

How Do Typed Tools Make a Python AI Agent Framework Safer?

Tools in MAF are typed JSON schemas. No decorators, no base classes, no registration ceremony. You define what the tool accepts, and the framework validates it before execution.

Here's what a tool schema looks like:

{
  "name": "shell.exec",
  "description": "Execute a shell command and return stdout/stderr",
  "parameters": {
    "type": "object",
    "properties": {
      "command": {
        "type": "string",
        "description": "The shell command to execute"
      },
      "timeout_seconds": {
        "type": "integer",
        "description": "Max execution time in seconds",
        "default": 30
      }
    },
    "required": ["command"]
  }
}

MAF ships four built-in tools:

shell.exec — Run shell commands, capture stdout and stderr. This is how the agent interacts with CLI tools like trustmrr-cli and datafast-cli.
fs (read/write/list) — Filesystem access within a sandboxed root. The agent can read data files, write reports, list directories — but never escape the boundary you set.
http.fetch — Make HTTP requests to URLs on an explicit allowlist. No rogue API calls.
kv — A key-value store for the agent to stash intermediate results between steps. Simple but surprisingly useful for multi-step analysis.

The typed schema approach has a critical advantage: when the model sends garbage, you know immediately. If the model hallucinates a parameter name, sends a string where an integer was expected, or invents a tool that doesn't exist — validation catches it before anything executes. The error message tells you exactly what went wrong: which field, which type, what the model actually sent. Compare this to frameworks where tool calls pass through three layers of string interpolation before anyone checks whether the input makes sense. By the time you get an error in LangChain, you're five stack frames deep and the original malformed input is long gone.

Adding custom tools follows the same pattern. Define a JSON schema, implement the handler function, register it. No base class inheritance, no decorator chains, no YAML configuration files that reference other YAML files. If you can write a function and a JSON object, you can build a tool.

Sandboxing works the same way — explicit boundaries, not implicit trust. You set a filesystem root and the agent can't read outside it. You set a URL allowlist and the agent can't fetch anything else. You set a tool allowlist and the agent can only use what you've approved. Simple rules, enforced consistently.

Why Are JSONL Traces Essential for AI Agent Debugging?

Every MAF run produces a JSONL trace file. Every model call, every tool result, every timestamp — logged as structured data, one JSON object per line. This isn't optional logging you can turn on. It's baked in.

Here's a snippet from an actual trace:

{"ts":"2025-06-19T08:41:02Z","event":"model_call","step":1,"prompt_tokens":847,"model":"gemini-2.5-flash"}
{"ts":"2025-06-19T08:41:03Z","event":"tool_exec","step":1,"tool":"shell.exec","args":{"command":"datafast kpi --site starkslab.com --days 30 --format json"}}
{"ts":"2025-06-19T08:41:06Z","event":"tool_result","step":1,"stdout_bytes":2341,"exit_code":0}
{"ts":"2025-06-19T08:41:07Z","event":"model_call","step":2,"prompt_tokens":3188,"model":"gemini-2.5-flash"}

Why does this matter? Three reasons.

Deterministic replay. You can take a trace and feed it back to reconstruct exactly what the agent saw at each step. When a run produces a weird result, you don't guess — you replay.

Cost tracking. Every model call logs token counts. When you're running agents autonomously, you need to know what they're costing you. Not approximately — exactly.

Debug without printf. When the agent does something unexpected at step 4, you open the trace and read steps 1-3. You see exactly what the model received, what it returned, and what the tool produced. No breakpoints, no log-level tuning. Just structured data.

This is the kind of thing that sounds like a nice-to-have until you're debugging an agent that ran at 3 AM via OpenClaw's autonomous scheduling. Then it's the only thing standing between you and blind guesswork.

How Did MAF Handle Real API Work?

Theory is comfortable. Production is where frameworks actually prove themselves. So I pointed MAF at a real task: autonomous KPI analysis for starkslab.com using datafast-cli and seo-cli.

The agent's job: pull 30 days of analytics, run SEO checks, and produce a complete KPI report. No hand-holding. No intermediate prompts. Just a goal and a set of tools.

Here's what happened:

Step 1: Agent called datafast kpi --site starkslab.com --days 30 --format json to pull traffic and engagement data.
Step 2: Agent called seo-cli audit --url starkslab.com --format json to run a technical SEO check.
Step 3: Agent analyzed both outputs, cross-referencing traffic trends with SEO issues.
Step 4: Agent called shell.exec to format the report as markdown.
Step 5: Agent returned the final KPI report as its answer.

5 steps. 24 seconds. Complete KPI report with real numbers. Not a toy demo — real APIs returning real data about a real site. The JSONL trace confirmed every step, every token count, every tool execution time.

This is what I mean by battle-tested. The agent framework didn't just run — it ran against production APIs, handled real output formats, and made sensible analytical decisions. The loop held. The budget controls held. The traces captured everything. That makes MAF a useful framework page, not just a demo narrative: it shows the implementation surfaces you need before you widen into larger agent architecture.

What surprised me most: the agent's tool-call sequence was exactly what a human analyst would do. Pull the data, audit the site, cross-reference, format, report. No wasted steps. No hallucinated API calls. The tight tool schema meant the agent couldn't drift into nonsense — every tool call was validated before execution, and the budget cap meant it couldn't spiral into an infinite analysis loop. Five steps was all it needed.

What Broke During the MAF Battle Test?

Two bugs. Both caught during the battle test, both fixed within an hour. Here's what happened.

Bug #1: Gemini's missing type field. The OpenAI tool-calling spec requires a type field in the action JSON. Gemini's API (via the OpenAI-compatible endpoint) sometimes omits it. MAF's validation rejected the response as malformed. The fix: infer the action type from context when the field is missing. Five lines of Python. Not a hack — a reasonable fallback for a known provider quirk.

Bug #2: No endpoint override in the CLI. MAF is OpenAI-compatible — it works with OpenAI, Gemini (via their OpenAI endpoint), Groq, Cerebras, and local models. But the CLI only had --model and --api-key flags. There was no --endpoint flag to point at a non-OpenAI base URL. If you wanted to use Gemini, you had to set an environment variable. Added --endpoint and --api-key as proper CLI flags.

Both bugs were fixed by Codex (the coding agent), not by hand. I described the issue, Codex produced the fix, and the PRs merged within an hour. That's the workflow: MAF runs the agent, the agent hits an edge case, Codex fixes MAF, and the next run works. The tools sharpen each other.

Neither bug was a design flaw. The core loop never broke. The architecture held — it was the edges (provider quirks, CLI ergonomics) that needed filing down. That's what battle-testing is for.

How Do You Build an AI Agent Framework With MAF?

If you want to build an AI agent from scratch, here's the fastest path with MAF. This is the framework path after you understand the first loop; if you need a slower tutorial, use the first AI agent build guide first.

Prerequisites: Python 3.10+ and an API key for any OpenAI-compatible provider (OpenAI, Gemini, Groq, Cerebras, or a local model).

Canonical repo: MAF on GitHub

Starkslab drop: Minimal Agent Framework

Verified setup path: clone the repo, create a virtualenv, install the local package in editable mode, and run the mock provider first:

git clone https://github.com/fedewedreamlabsio/minimal-agent-framework
cd minimal-agent-framework
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
maf run --provider mock --input "Hello from MAF"

MAF is repo-led today. Do not claim a PyPI package install unless a public package is published and re-verified.

Run your first agent:

maf run --model gpt-4o --prompt "List the Python files in the current directory and summarize what each one does"

The agent will use shell.exec to run ls *.py, then fs.read to examine each file, and return a summary. You'll see the JSONL trace in your working directory afterward.

Configure budget controls:

maf run --model gpt-4o --max-steps 10 --max-seconds 60 --prompt "Analyze the git log for the last week and identify the biggest changes"

Use a non-OpenAI provider:

maf run --model gemini-2.5-flash --endpoint https://generativelanguage.googleapis.com/v1beta/openai --api-key $GEMINI_KEY --prompt "Run the test suite and report failures"

Start small. One tool, one task, one trace to read. Once you understand the loop, everything else is just configuration. Open the JSONL trace after your first run. Read each line. You'll understand more about how AI agents work from those 10-15 lines of structured data than from any tutorial that hand-waves over the internals.

Why Keep the Framework Small?

The AI agent ecosystem has a complexity addiction. Every new framework ships with more abstractions, more configuration, more indirection. MAF is a bet that the opposite works better: one loop, four tools, typed schemas, JSONL traces, and real budget controls.

You can build AI agent systems that do real work — not just answer questions, but run commands, hit APIs, analyze data, and produce reports. You don't need 3,000 files to do it. You need a loop that's transparent enough to trust and constrained enough to control.

MAF is MIT licensed, open source, and small enough to read in one sitting. If you've been wanting to build an AI agent framework that you actually understand, start here. If you want to keep scaling from this point, the next useful Starkslab reads are AI Agent Architecture for the factory model and How to Build CLI Tools That AI Agents Can Actually Use for the tool layer that makes agents useful. The traces don't lie, and the loop is always one file away.

Build an AI Agent Framework in Python: MAF Loop, Tools, Traces

Build an AI Agent Framework in Python: MAF Loop, Tools, Traces

Jump to

Why Build an AI Agent Framework From Scratch?

How Does the MAF Agent Loop Work?

How Do Typed Tools Make a Python AI Agent Framework Safer?

Why Are JSONL Traces Essential for AI Agent Debugging?

How Did MAF Handle Real API Work?

What Broke During the MAF Battle Test?

How Do You Build an AI Agent Framework With MAF?

Why Keep the Framework Small?