Skip to main content

Build Log

The messy, honest version of development.

Not polished release notes. Real decisions, wrong turns, and why things took longer than expected. Updated as we build.

Build log entries

  1. 27 bugs, 8 features, zero regressions

    Big session. Ran a full end-to-end audit of the entire app, closed 27 bugs, then started on 8 features pulled from competitive research. 2,523 tests passing at the end.

    The security ones matter most. Three sandbox escape paths in GrepTool, GlobTool, and DiffFileTool -- arguments weren't being re-anchored to the workspace root, so a carefully crafted path could read outside it. Command injection in RunTestsTool: test arguments flowed straight into the shell. All four closed before anything touched a release tag.

    The "why is this broken" ones. Image attachments via the file picker rejected every image as "binary." Chat mode's agent was going rogue with docs attached: trying code tools, burning iterations, then hallucinating a fake answer. Research mode was firing web searches against the embedded doc content instead of the actual question. All three were integration bugs that passed unit tests but fell apart under real usage.

    The off-by-ones. Circuit breaker was tripping at exactly 80% of budget, killing valid conversations at the threshold. Budget enforcement blocked at the limit instead of when exceeded. Air-gap mode wasn't disabling the cloud boost toggle in the UI -- it was enforcing air-gap at the network layer, but the toggle was still live. These are the bugs that make people lose trust. Fixed.

    Onboarding completion never persisted, so the checklist reappeared every restart. Session deletion orphaned memory entries. The memory route crashed on corrupt JSON. The write mutex had no timeout and could hang the app forever. Plus 363 lines of dead CSS from themes we removed weeks ago. And 12 more across the stack.

    What I shipped on top of that:

    • Chat mode now has read-only code tools (grep, glob, file read, code search). You can ask the model about your codebase without switching to Code mode. This was the #1 feature request after the last audit.
    • FIM inline completions now default ON. They were behind an opt-in toggle that nobody found. Better first-run experience.
    • Smart tool stripping. Some small models repeatedly try to call tools that are blocked for their context. After a few failures, the tool list gets removed entirely and the model is forced to respond as text. Reduces the "agent flailing" loop that kills small-model usability.
    • Settings progressive disclosure. New users see Theme, Models, Profile, Boost. That's it. Everything advanced lives behind a toggle. The settings page was turning into a wall.

    Next session: cloud provider as primary (not just boost -- for users without a local GPU), auto-verify after code changes (run tests and build, feed errors back), panel hand-off buttons ("Fix this" from Debug or Research jumps into Agent), and conversation export (JSON + ZIP backup).

  2. The agent that learns mid-run

    First time I watched Qwen 3.5 hallucinate spawn_worker three times in one run, I knew the cross-session learning system wouldn't be enough. Today we closed both gaps.

    Within-loop learning. Before today, Bodega's agent learning was cross-session only. The model makes a mistake, we record it, and the next session gets a rule injected into its context: "don't do that." Works great. 5-stage pipeline, Bayesian confidence tracking, hard pre-execution blocking. But if the model hallucinates a tool in iteration 3 of a 15-iteration run, it could repeat that mistake in iterations 4, 5, and 6 before the session ends and the rule kicks in.

    Now it can't. SessionRuleBuffer records failures in-memory during the run. Iteration 3 fails → iteration 4 sees a SESSION RULES block and the tool is hard-blocked before execution. Max 3 temp rules per session to avoid prompt instability. Rules are ephemeral. injected before each LLM call, stripped after. One new file, 76 lines.

    Rule confidence persistence. We had a Bayesian tracker computing which learned rules were actually working (Beta distribution, alpha/beta posteriors). It flagged rules with confidence below 0.3 for demotion. The math was there. The DB write wasn't. Rules that looked good on paper but triggered false positives were accumulating forever. 20-line fix: shouldDemote() now calls deactivateRule() and the rule goes inactive in SQLite.

  3. 156 E2E tests, Playwright Electron, dark installer, Sentry crash reporting

    Spent the day doing something most indie devs skip: writing real end-to-end tests that launch the actual Electron app and exercise every feature through the UI.

    Not unit tests. Not API tests. Full Playwright Electron tests that boot the app, switch between modes, send messages to local LLMs, verify tool calls hit the disk, and confirm that context survives compaction. 156 tests across 21 spec files. 98.7% pass rate.

    What the tests found: compact was using a stale model config, agent panels weren't auto-scrolling, reasoning-only models showed blank responses, and the auto-router was sending simple prompts to tiny models. All fixed. Also shipped dark-themed NSIS installer with matching uninstaller, crash reporting via Sentry (opt-in, respects air-gap), TopBar layout rearranged per feedback.

  4. Loop write guard, approval card fix, E2E Round 2

    Two things were driving me crazy about the agentic loop. One: the agent would write a file, re-verify, decide it wasn't done, and write the file again. And again. Same content, same path, different iteration. Two: approval cards would appear mid-stream and you'd never see them because they rendered outside the scroll container.

    Both fixed. The repeat-write guard now tracks writes per file path across the loop -- after 3 writes to the same file in a single session, it injects a system message, marks the deliverable satisfied, and breaks the cycle. Approval cards moved inside the scroll container so they actually travel with the content. E2E Round 2 ran 31 tests. Found 11 bugs across todo_write registration, model routing, panel scroll, web search iteration caps, and VRAM warning noise. All 11 fixed and committed.

  5. Chat → Runtime → Loop → QEL

    Shipped the Runtime Layer today. This one is more architectural than visible, but it matters.

    The problem: before each agentic loop, there were ~150 lines of scattered conditionals spread across the chat orchestrator. Is this session in a panel? What iteration budget applies? Does this model support tools? What happens after 3 consecutive failures? These questions were answered in different places with inconsistent logic.

    RuntimeLayer.ts consolidates all of that into a single typed LoopPolicy that gets produced before the loop starts. The classify() call looks at the request, the model's capability profile, the panel context, and the session failure history -- then produces a LoopPolicy with a single executionLane value.

    Four execution lanes:

    • advisory -- bypasses the loop entirely, single LLM call, no tools. Fast. For panels that just need a quick answer.
    • guided -- up to 8 iterations, limited tool set. For supervised agent work.
    • restricted -- panel-constrained tool allowlist. Research panel only gets research tools.
    • full -- complete tool access, computed iteration budget. Normal code mode.

    The capability detection piece is new: CapabilityProfile reads the model's known abilities (tool calling tier: native/xml/weak/none; structured output; reasoning) and can downgrade the lane automatically if the model can't support what was requested. No more sending tool calls to a model that'll ignore them.

    Dynamic failure tracking: if a session sees 3 consecutive tool failures, the lane downgrades automatically for the rest of the session. The model gets fewer chances to break things.

  6. Mar 24-26 -- Phase 9A through 9E

    Shipped the full memory system this week. Five phases in three days. This is the one I'm most proud of so far.

    The problem: every agentic loop iteration starts from scratch. The model has no memory of which files you've been editing together, what patterns you prefer, what errors you've hit before. Every session is day one.

    Phase 9 changes that. Here's what we built:

    • 9A -- HeuristicExtractor wired into the post-loop processor. After every agentic iteration, it extracts facts from what the agent observed and stores them in SQLite. Compression ratio confirmed at 5x+ on real sessions.
    • 9B -- FileAffinityTracker (tracks which files you co-edit, how often, how recently) + ImportGraphExtractor (static import graph for JS/TS/Python/Rust/Go). The context assembler uses both to inject the right files into the next session without you having to specify them.
    • 9C -- LLMObserver -- a second-pass LLM call that extracts implicit facts from assistant turns. Things the heuristic extractor misses. Runs async post-loop on hardware that can afford it, falls back to heuristic-only on low VRAM.
    • 9D -- Memory time decay. Observations have configurable half-lives by type. Stale memories fade instead of polluting context forever. BM25 relevance scoring added alongside recency decay.
    • 9E -- Evaluation harness. 25 scenarios covering injection, retrieval, dedup, decay, and cross-session recall. Memory metrics API exposed for debugging.

    Total: 8 new service files, 2 new tools (CreateDocument, DeepResearch), memory pipeline fully wired end to end. This is what makes Bodega feel like it knows you over time.

  7. Mar 23 -- 30 bugs, one session

    Ran what we're calling Operation Fumigate last Sunday. The goal: clear every known bug before the next beta tag. Final count: 30 bugs fixed in one session.

    It was deliberately parallel. Stood up 4 squads, each with a defined scope and a dedicated branch. No overlap, no conflicts.

    • Squad 1 hit the code editor and FIM (fill-in-middle): 9 bugs. Monaco diff decoration bugs, inline fix streaming edge cases, FIM fence stripping failures.
    • Squad 2 took terminal and the Problems panel: 7 bugs. Terminal duplicate input handlers, xterm focus tracking using the wrong event, OSC 133 command block edge cases.
    • Squad 3 handled streaming and session infrastructure: 8 bugs. Double SSE events, streaming interrupted on panel navigation, session data leaks, permission mode enforcement in chat mode.
    • Squad 4 closed out settings, memory, and project management: 6 bugs. Settings not persisting across restarts, memory rate limit bypasses, orphaned settings keys.

    All 4 squads merged to dev by end of day. Doc sweep ran afterward -- all counts, changelogs, and references updated to match. Tagged beta.6 that same evening.

    The thing that made this work: clear blast radius per squad, no shared files, every fix against a real test case. 30 bugs with no regressions.

  8. Mar 17-18 -- Brain MCP + 13-agent team

    This is the part that doesn't look like normal solo indie dev.

    I've been building with an AI agent team. Not AI-assisted -- an actual team of 15 specialized agents coordinated through a shared memory system called Bodega Brain. Each agent has a defined role, its own identity file, and stays in its lane.

    The roster: Co-Dev (lead), Architect (structural health), Engineer (implementation), Fixer (bugs), Sentinel (security scanning), Scout (competitive intel), Strategist (product direction), QA Engineer, Doc Guardian, Performance Profiler, Integration Tester, Release Manager, Reviewer, UX Auditor, Writer.

    Each one runs on its own git branch. Co-Dev reviews their work, creates PRs, merges after CI passes. I have final say on anything touching main. It's a proper dev workflow, just with agents instead of contractors.

    The Brain is how they coordinate -- a shared system with messaging, task queues, workspace claiming, decision logging, and a live dashboard. When two agents might conflict on the same files, they claim workspaces and check for conflicts before starting.

    This session: 8 PRs reviewed, 5 merged to dev (LSP integration, unified model hub, god-file splits, security hardening, test coverage). The acceleration this enables is real. Phase 0-3 of the V2 overhaul shipped in 48 hours.

  9. QEL ships

    Spent the last few days hardening what I'm calling the QEL -- Quality Enforcement Layer. This was the biggest early architecture decision and it's worth explaining why it exists.

    Most AI coding assistants work like this: you ask a question, the model responds, done. There's no verification that what was produced actually matches what was asked. No check that the code compiles. No detection of stubs. The model hallucinates a solution and calls it a day.

    QEL changes that. Every agentic loop iteration runs three passes: contract extraction (what did the user actually ask for?), completion verification (did the response satisfy it?), and a mode firewall that prevents the wrong class of task from sneaking through. There's a test suite with a letter-grade output system -- the agent has to get an A or B before the response goes out.

    The architecture underneath is Express + SQLite for the backend, with a streaming pipeline that pushes Server-Sent Events to the frontend in real time. 15 defined SSE event types covering everything from tool calls to plan approvals to QEL verification results.

    The other decision I made early: no god files. I've worked on enough codebases that became unmaintainable from one class doing everything. Bodega has hard line limits: 700 lines for service files, 400 for React components. When something hits the limit, it splits. This decision has already paid off four times.

    Current state: QEL shipping, 630 tests passing, agentic loop running on Ollama and OpenAI-compatible providers.

  10. Initial commit day

    Started building Bodega One. Here's what it is and why I'm building it.

    It's a local-first AI desktop IDE. Two modes: Chat Mode for general AI conversation, Code Mode for agentic software development. Runs entirely on your machine. No cloud dependency unless you want one.

    I got tired of tools that route everything through someone else's servers. Not because I have something to hide -- because I don't want to depend on a company's uptime, rate limits, or pricing decisions to do my work. Your code, your hardware, your data.

    The tech stack: Electron 40 for the desktop shell, React 19 + TypeScript on the frontend, Express + SQLite on the backend. It supports Ollama out of the box, with OpenAI-compatible endpoints as a fallback for when you need a heavier model.

    The thing I kept noticing with other AI coding tools is that they're mostly fancy autocomplete with a chat window bolted on. What I wanted was something that could actually reason about what it's doing -- extract requirements from what you ask, verify its own output, and refuse to ship half-finished work. That's the Quality Enforcement Layer. More on that later.

    First commit dropped today. It's rough but it runs. The bones are there.

    Building this in public. Wins, bugs, architecture decisions -- all of it.

For polished release notes, see the Changelog · Join Discord

Follow the build.

Beta is live now for the first 200 users. Join the waitlist for full launch.