Josh Kappler

I build autonomous
AI agents.

I build production AI agents from scratch. I write the orchestration layer myself: tool loops, state machines, memory, and multi-provider routing. No LangChain, no CrewAI. Before engineering I grew a YouTube channel (Boffy) to 2.1M subscribers, so I can ship long projects to the end and explain technical things in a way people actually want to watch.

Open to founding, forward-deployed, applied-AI, and DevRel roles · San Francisco or remote

View Résumé Live Demo Book a Call YouTube GitHub LinkedIn Email

01 / Projects

What I have built

Everything here was built from scratch. I write the orchestration layer myself. No LangChain, no CrewAI, no agent frameworks.

memo-engine

Deal Analysis + Investment Memo Platform

Live demo→GitHub→

Live demo: the AMC Entertainment run, end to end

memo-engine is an AI deal-analysis and investment-memo platform built for a private credit investment firm. It takes a messy deal data room (PDFs, Excel models, Word drafts, Outlook emails, scans) and produces an institutional-format credit memo where every claim is cited back to the exact source page or cell. Reasoning passes run on Claude Fable 5 over agentic RAG with pgvector; parsing, extraction, drafting, and export run as durable workflow steps. The client build is under NDA. The public demo is the same system pointed at public data: an 80-file SEC data room for AMC Entertainment, ingested and analyzed end to end, browsable down to each citation.

Next.js 16TypeScriptAnthropic SDKClaude Fable 5PostgrespgvectorVoyage AIVercel Workflow DevKit

Technical Details

Contextual retrieval: per-chunk Sonnet 4.6 prefixes run over the full document, with the first 400K chars cached via ephemeral prompt caching so every call reads at the $0.30/M cached rate

Voyage AI voyage-3 embeddings (1024-dim) batched by byte budget (≤400KB, ≤96 items) to respect the 320K-token-per-batch cap on dense financial text

Forced tool_use with Zod-to-JSON-schema for ~40-field structured extraction: credit snapshot, capital structure, financials, covenants, management, comps, scenarios

Reasoning and extraction split by API constraint: Fable 5 thinks through the deal (thinking is always on), then Sonnet runs the forced tool_use extraction, because the API rejects thinking combined with forced tool choice

Durable pipeline orchestration via Vercel Workflow DevKit: parse, analysis, research, internal memo, and external memo each run as a step with its own 800s budget

Multi-format export: PDF via @sparticuz/chromium + puppeteer-core (Vercel-compatible headless Chromium), Excel with ExcelJS formulas and sensitivity tables, DOCX, ZIP bundle

claim-wright

Full-Stack Claim Adjudication · Built in 36 Hours

GitHub→

The claim workspace, batch history, and calibration workbench (sample data)

claim-wright is a fully working full-stack claim-adjudication system built end to end in a single 36-hour sprint. It reads the documents behind a security-deposit insurance claim (lease, tenant ledger, deposit-waiver addendum, move-out itemization, repair invoices) and recommends a payout capped at the policy benefit, or a decline, with a line-by-line audit trail behind every dollar. The split is the whole point: Claude Opus 4.8 does the reading and extracts structured facts, then a pure-Python deterministic engine applies the caps, rules, and eligibility gates, so a payout can never be a number the model invented. On a held-out test split it lands within $250 of the human adjudicator on 91% of claims with a median error of $0, at about $0.33 per claim.

Python 3.13Anthropic SDKClaude Opus 4.8PydanticDjango-NinjaReact 19ViteSQLite

Technical Details

Full stack in a weekend: a Python adjudication core, a Django-Ninja API, a React 19 single-page app, and a packaged desktop build, all shipped end to end in 36 hours

Model reads, engine decides: forced tool_use extraction pulls charges, ledger balance, and eligibility, then a pure-Python function computes the payout, so every dollar traces back to code and a document line and nothing is hallucinated

91% of claims within $250 of the human decision, median error of $0, mean absolute error of $62, at about $0.33 per claim on the held-out test split

Multi-user with per-tenant SQLite databases and workspace sharing: a run can be snapshotted and shared read-only into a space, copied on share so a viewer never touches the originator's live data

Security hardening throughout: master-approved signup, PBKDF2 passwords, session tokens stored only as SHA-256 hashes, and an allow-list column projection that structurally blocks the human-answer fields from ever reaching the model

Built-in white-hat security pass: the code is reviewed by autohack, my own autonomous bug-hunter, which traces user input to sinks and has a second model try to disprove each finding

Hybrid document reading routes each PDF page by text density: about 75% read free with pure-Python pdfminer, scanned pages go to vision, and no poppler or tesseract binaries means the same code runs everywhere including the desktop build

Calibration with zero API calls: extractions are stored and the engine is a pure function, so a candidate rulebook (JSON, not code) re-scores against the human decisions by replaying stored reads

autohack

Autonomous Security Agent

GitHub→

The real-time hunt dashboard (sample data)

A 5-package TypeScript monorepo that polls four bounty platforms, spawns hour-long Claude sessions to hunt for vulnerabilities, validates its own findings through adversarial review, and submits reports without human intervention. A separate Sonnet pass compresses verbose findings before submission. The system writes hunt outcomes, near-misses, and triager feedback to a JSON memory store so every future session starts with context from every past one. The same harness also runs a bounty agent on the Algora platform: it spawns Claude Code sessions for long autonomous runs, executes the test suite, opens PRs, and addresses review feedback on its own.

TypeScriptAnthropic SDKNext.js 15SQLiteDrizzletRPC

Technical Details

12-state finding lifecycle from discovery through submission across HackerOne, Immunefi, Huntr, and an aggregator covering Bugcrowd, Intigriti, and YesWeHack

Adversarial review: a separate Claude instance scores findings on a 0-15 binary rubric. Anything below 8 is rejected before it reaches a triager

Ephemeral prompt caching cuts input tokens by roughly 90% across repeated hunt sessions, with a local backend fallback for development

Cross-process coordination via lock files, shared runtime-override JSON with a 2-second TTL cache, and stale-PID detection on startup

Error classification (transient, permanent, validation, timeout) decides whether to retry, skip, or kill the hunt

Real-time tRPC dashboard with xterm.js terminal streaming live Claude tool calls and reasoning

pinch

Claude Code, Driven From an Apple Watch

GitHub→

Real screens from the watchOS app

pinch drives a real Claude Code session from an Apple Watch over cellular. A native watchOS SwiftUI app is the thin client; a Node and TypeScript backend runs the Claude Agent SDK against the live repos on my Mac, and a tunnel exposes it to the wrist. watchOS refuses WebSockets on the watch's network path, so the transport is plain HTTP request/response with a short-poll loop instead of a socket. Prompts go through a durable on-device outbox that retries until a confirmed 2xx, and the backend dedups by client prompt id so an at-least-once retry never double-runs a turn. Session state is recorded durably, so a backend restart or idle sweep revives the same conversation with full context through the SDK's resume.

SwiftSwiftUIwatchOSTypeScriptClaude Agent SDKNode.jsngrok

Technical Details

HTTP request/response with a short-poll loop instead of a socket: watchOS refuses URLSessionWebSocketTask on the watch's cellular path, so the watch polls /api/* while the browser simulator keeps the WebSocket, both driving one shared session lifecycle

Durable persisted outbox on the watch: a prompt is removed only on a confirmed 2xx, drained FIFO single-flight with Sending / Sent / Not sent states, and the backend dedups by client prompt id so a retry can never double-run a turn

Session resume across restarts: our session id maps to the SDK session id in a durable record, so an idle-swept or restarted backend rebuilds the conversation with options.resume and Claude keeps full context

Poll-cursor invariant kills duplicate replies: a resumed session continues its persisted cursor while a revived session resets to zero on a backend reset signal, so the event log never re-delivers history

Watch-aware output: a cached system-prompt append tells the model it is speaking to a wrist screen with text-to-speech, so replies stay plain-text and brief without touching tools, edits, or rigor

Stable ngrok static domain with bearer-token auth on every request; the watch can restart the backend from Settings, which builds first and only swaps the process if the build succeeds

fleetview

Terminal Cockpit for Parallel AI Agents

GitHub→

The control center and a tab of parallel Claude agents

fleetview is a terminal control center for running a fleet of Claude Code agents in parallel and keeping every machine I work on in sync. It is a single WezTerm + Zellij window: a folder picker launches 1 to 8 agents into named tabs, live gauges track the 5-hour and weekly rate limits, and a floating pane inspects what each subagent is doing. The whole TUI is zero-dependency Node, built-ins and raw ANSI only, with nothing to install beyond the Node that Claude Code already needs. One button syncs every repo across machines, and a background daemon keeps a fresh machine converging to my full project set on its own, updating the tool itself along the way.

Node.jsZellijWezTermANSI TUIlaunchdGitHub CLIchezmoi

Technical Details

Zero npm dependencies: the entire tabbed TUI is Node built-ins and raw ANSI escape codes, and every script is self-locating, so the folder can be cloned anywhere and just runs

Launches 1 to 8 Claude Code agents from a folder picker into a Zellij tab layout, each tab named after its repo, with live 5-hour and weekly rate-limit gauges parsed from the status line

Conflict-safe multi-machine Git sync: pull clones missing repos and fast-forwards the rest, while push refuses any non-fast-forward, so a stale device is skipped rather than overwritten and dirty or diverged repos are left untouched

Always-on background daemon self-updates the tool, then clones and fast-forwards every GitHub repo at login and every 10 minutes, so a new machine converges to the full project set with nothing to run by hand

Nest-aware sync finds clones one level deep and updates them in place instead of re-cloning top-level duplicates

Cross-platform wiring: on Windows an AutoHotkey hotkey plus a watchdog kills the headless Zellij server when the window closes, on macOS a launchd agent runs the sync, and the OS-level config lives in a chezmoi dotfiles repo that points back at the app

property-leads

Autonomous Lead-Finding Pipeline

property-leads is an autonomous lead-finding pipeline for a real-estate cash buyer, built as private client work. A four-agent chain runs on an hourly cron: a Haiku orchestrator, Sonnet research and scoring, and a Sonnet writer gated by a Haiku reviewer. Outreach runs at about $0.22 per 33-property batch. Private client work, no public repository.

Next.js 16Anthropic SDKNeon PostgresDrizzleApifyResendLeaflet

Technical Details

4-stage agent pipeline with tiered models: Haiku for orchestration and the outreach reviewer, Sonnet for research, scoring, and draft writing

Research agent folds FEMA flood zones and municipal violation and permit data into a single MAO with cited reasoning per property

Scoring returns 0-100 with hot/warm/cold tiering and a breakdown so an analyst can disagree with the model in one read

Outreach has a Sonnet drafter and a separate Haiku reviewer that can block or rewrite a draft before it reaches Resend. emailPolicy defaults to off so test runs never blast

Scheduling is three knobs on a versioned config row: pause, interval in minutes, and time-of-day with IANA timezone. Vercel cron fires hourly and the route gates itself

Idempotent ALTER TABLE migration runner, fingerprint-based dedup across runs, Nominatim geocoding queue with a hard 1 req/sec rate limit

survival-station

Offline-First AI Survival Appliance

GitHub→

An air-gapped PC, built for a non-technical user, that runs with no internet on solar power. A stdlib-only Python server (the machine had Windows Smart App Control blocking unsigned binaries, so no numpy, torch, or Open WebUI) proxies a local Ollama for streaming chat and multimodal photo identification. Answers are grounded with retrieval over an offline Kiwix encyclopedia, fetched in parallel with the model and fail-open. Offline maps run on a pure-PowerShell PMTiles server with a MapLibre viewer.

Python (stdlib)Ollamagemma3 / moondreamKiwixPowerShellMapLibre

Technical Details

Stdlib-only Python web server: streaming NDJSON chat, multimodal photo input, and a single-file inline UI, with zero pip packages because Smart App Control blocked unsigned binaries during the build

Retrieval grounding over an offline Kiwix library: the lookup runs in parallel with the model answer, carries its own short timeout, and fails open so a slow or missing library never blocks the chat

Hand-rolled HTMLParser scrapes Kiwix search results and maps ZIM slugs to readable source labels (Medical Encyclopedia, iFixit, Prepper Pack, Wikipedia)

Local multimodal vision via gemma3:4b and moondream for plant and wound identification, behind a hard safety prompt that refuses to ever call a wild plant edible from a photo alone

Pure-PowerShell PMTiles server resolves map tiles out of a single binary archive over HTTP range requests, paired with a vendored MapLibre viewer for fully offline maps

Localhost-only with no runtime telemetry; about 87 GB of encyclopedias, tiles, and models stays out of git and rebuilds from a manifest (aria2 download list plus extract scripts)

sniply

Live Two-Sided Marketplace

Live site→GitHub→

The live marketplace: discovery, map filters, and a pro profile

A booking marketplace for barbers and stylists, live at sniply.biz. Customers find pros by map, specialty, and availability; pros run their book, services, and hours from a dashboard. This is the one project here with no AI in the runtime path. It exists to prove the unglamorous parts: real auth, race-condition-safe booking, and a test suite that covers both sides of the product.

Next.js 16React 19TypeScriptPostgreSQLTailwind 4Playwright

Technical Details

Double-booking prevention with PostgreSQL advisory locks: pg_advisory_xact_lock on barber + date serializes concurrent booking requests, which row locks alone cannot do for empty slots

Custom HMAC-SHA256 session auth with timing-safe comparison, httpOnly cookies, and role separation between customers and pros

291 test cases across 29 files, including 54 Playwright end-to-end tests covering browse, booking, messaging, reviews, and the pro dashboard

Map-based discovery with Leaflet plus filters for hair type, specialty, and availability windows

In-app messaging threads, verified reviews with pro replies, and a typed data access layer from rows to API responses

Seed data for 22 pro profiles so local dev and demos work without production data

02 / YouTube

2.1M

subscribers on YouTube

270M+

Total Views

136+

Videos

Years

I have been creating content on YouTube for over seven years under the name Boffy. I grew the channel from zero to 2.1 million subscribers, mostly gaming. No team at first. A lot of the videos were technical: game modding, and how PC parts like graphics cards change the way a game runs. I learned how to make that watchable for a huge audience.

Eventually I hired editors and designers, negotiated sponsorships with RedMagic, Wargaming, GeoGuessr, and others, and spent a lot of time in analytics trying to figure out what was actually working.

Running a YouTube channel at this scale is mostly a feedback loop. You put something out, look at how people respond, and adjust. Same instinct I bring to shipping and explaining software.

Brand Partnerships

RedMagicWargamingGeoGuessrYouToozFactorGamerSuppsEllify

03 / About

How I got here

I build AI agents from scratch. I write the orchestration layer myself. Tool-use loops, state machines, memory management, multi-provider routing. Every system in the project list was built solo, no LangChain, no CrewAI, no agent frameworks.

Before this I spent seven years growing a YouTube channel from zero to 2.1 million subscribers, mostly gaming. A lot of it meant taking something complicated, like a modding workflow or why one graphics card beats another, and making a general audience actually want to sit through it. That is the same skill developer advocacy runs on, which is why it interests me as much as engineering does.

What I work with

TypeScriptPythonNext.jsPostgreSQLSQLiteZodPydanticAnthropic SDKClaude Fable 5GroqOpenRouter

How I build

Hand-rolled orchestration, no LangChain, no CrewAI
Claude Code as primary dev tool
Model tiering per step: Fable 5 reasons, Sonnet extracts, Haiku routes
Multi-provider LLM routing (Claude, Groq, OpenRouter, Ollama)
Full-stack: backend, frontend, dashboards, deployment
State machines for agent lifecycle management
Recording outcomes and feeding them back into future runs

04 / Contact

Get in touch.

I'm looking for AI engineering or developer-advocacy roles at early-stage startups in the Bay Area or remote. I spent seven years making gaming videos for a 2.1M-subscriber audience, a lot of it explaining game modding and PC hardware, and now I build AI agents from scratch. If you're building something interesting, I want to hear about it.

Joshua.Kappler@gmail.com Book a call Live demo LinkedIn

Josh Kappler · 2026

YouTube GitHub

I build autonomousAI agents.

What I have built

memo-engine

claim-wright

autohack

pinch

fleetview

2.1M

How I got here

What I work with

How I build

Get in touch.

I build autonomous
AI agents.