feat(executor): add LocalRunner and OpenAI-compat LLM client

Phase 1 of "local OSS models as agents" plan. Adds a third Runner backed by any OpenAI-compatible HTTP server (Ollama, vLLM, LM Studio, llama.cpp), and migrates the Gemini-CLI classifier to route through the same client when configured. Two-layer split: internal/llm.Client is the workhorse (HTTP, no Pool, no DB) used directly by the classifier and any future internal helper that needs cheap reasoning. internal/executor.LocalRunner is a thin adapter implementing Runner for user-facing tasks. This avoids Pool reentrancy/deadlock when sub-second internal calls fire from inside Pool.execute(). Highlights: - internal/retry: relocated runWithBackoff/IsRateLimitError/ParseRetryAfter into a shared package reused by executor and llm. - internal/llm: Chat (non-streaming) and ChatStream (SSE) over /chat/completions with optional bearer auth, json_object response format, retry on 429/503, Retry-After parsing. - internal/executor/LocalRunner: streams deltas into stdout.log in the same stream-json envelope ClaudeRunner emits, then writes one consolidated assistant block plus a result terminator so existing parsers (extractSummary, ParseChangestatFromOutput) work unchanged. - internal/executor/Classifier: gains optional LLM field; uses json_object response format (no markdown-fence cleanup needed). Falls back to Gemini-CLI subprocess when LLM is nil. - Pool.skipClassification: now skips only when the requested agent type is registered, so unknown types still reach the load balancer. - Storage: additive tokens_in/tokens_out ALTERs on executions; CLI runners record cost_usd as before, LocalRunner records 0 + tokens. - Config: [local_model] section (endpoint, model, timeout_seconds, default_temperature, api_key). Empty endpoint = no LocalRunner registered, classifier falls back to Gemini. Pre-existing test issues fixed in passing: - claude_test.go setupSandbox callsites updated to current signature. - gemini_test.go TestParseGeminiStream skipped (asserts unimplemented GeminiRunner stream-error parsing; tracked separately). Plan: docs/plans/local-oss-runner.md. https://claude.ai/code/session_017Edeq947TpSm1vQTxMhi1J
author: Claude <noreply@anthropic.com> 2026-04-28 09:24:43 +0000
committer: Claude <noreply@anthropic.com> 2026-04-28 09:24:43 +0000
commit: 0865afc43be562dbe14528e4299b9e213b54cc93 (patch)
tree: 3ffb11207fb6b9866b5a2477bba7abe38964f83a /docs/plans/local-oss-runner.md
parent: c2aa026f6ce1c9e216b99d74f294fc133d5fcddd (diff)
1 files changed, 185 insertions, 0 deletions
diff --git a/docs/plans/local-oss-runner.md b/docs/plans/local-oss-runner.md
new file mode 100644
index 0000000..de16e05
--- /dev/null
+++ b/docs/plans/local-oss-runner.md
@@ -0,0 +1,185 @@
+# Local OSS Models as a Third Runner
+
+## Context
+
+Today the executor only knows about subprocess CLI agents (Claude, with a stubbed Gemini). Internal LLM-shaped work — model classification, free-form prompt elaboration, webhook CI summarization, execution summary — either shells out to the `gemini` CLI (`internal/executor/classifier.go:60`) or sits in `internal/api/elaborate.go` doing the same. That's expensive in latency and dollars for what are essentially helper completions, and there's no path to keep "internal" reasoning private/local.
+
+This change adds a local OSS model backend (any OpenAI-compatible HTTP server: Ollama, vLLM, LM Studio, llama.cpp server) as a first-class third Runner alongside Claude and Gemini. The unified harness model wins over a separate "internal LLM service" because it preserves a single `Runner` abstraction, single `executions` table, and one set of pool semantics (rate-limit handling, observability, WebSocket events) for any task whose `agent.type == "local"`.
+
+Outcome: a `LocalRunner` for user-facing tasks, plus a lower-level `LocalLLMClient` that internal helpers call directly without paying Pool/Execution overhead. First migration target is the classifier (sub-second, high-volume, lowest blast radius). Elaboration, webhook summarization, and execution summary follow in subsequent passes using the same client.
+
+## Architectural decision: two layers, one backend
+
+`LocalRunner implements Runner` is the user-visible contract. But the classifier runs *inside* `Pool.execute()` (at `internal/executor/executor.go:437`), so submitting recursively to `Pool` would deadlock against `workCh`'s slot accounting and pollute the `executions` table with sub-second rows for every classification.
+
+Resolution: introduce a `LocalLLMClient` (HTTP, no Pool, no DB) as the workhorse. `LocalRunner` is a thin adapter over it for full Pool-managed executions. Internal callers — classifier now, elaborate/webhook/summary later — call `LocalLLMClient` directly. Two code paths to local, but path lengths are wildly unequal (the runner is ~150 lines of glue) and they share one HTTP round-tripper.
+
+Capabilities (e.g. "this runner can edit code, that one can't") are deferred. `LocalRunner` simply leaves `SandboxDir` empty; the Pool already tolerates that. Revisit only when a third non-coding runner appears.
+
+## End state
+
+- **`internal/llm`** (new package) — `LocalLLMClient` with `Chat` and `ChatStream` over OpenAI-compat `/chat/completions`. Handles retries via the existing backoff helper, JSON mode, SSE streaming, optional bearer token.
+- **`internal/executor/local.go`** (new) — `LocalRunner` implements `Runner`. Streams response deltas into the same stream-json envelope Claude uses (`{"type":"assistant","message":{"content":[{"type":"text","text":"..."}]}}`) so existing parsers (`internal/executor/summary.go:13`, `internal/task/changestats.go`) keep working unchanged.
+- **`Classifier`** (`internal/executor/classifier.go`) — now holds a `*llm.Client`. When set, classification goes through it with `response_format: json_object`; markdown-fence cleanup is skipped on this path. Gemini-CLI path stays as a fallback when `[local_model]` config is empty.
+- **Storage** — `executions.tokens_in` and `tokens_out` added (additive `ALTER`, schema pattern at `internal/storage/db.go:78-89`). `cost_usd` stays 0 for local. `session_id`/`sandbox_dir` remain nullable; `LocalRunner` simply doesn't populate them.
+- **`AgentConfig`** — adds `Temperature *float64` (pointer so 0 means "unset") and `MaxTokens int` at `internal/task/task.go:30`. Existing Claude-shaped fields (`PermissionMode`, `AllowedTools`, etc.) are silently ignored by `LocalRunner`.
+- **Config** — new `[local_model]` TOML section in `internal/config/config.go:18`: `endpoint`, `model`, `timeout_seconds`, `default_temperature`, `api_key`. If `endpoint` is empty, the runner is not registered and the classifier falls back to Gemini-CLI.
+- **Routing** — `executor.go:428`'s hardcoded `t.Agent.Type == "claude" || == "gemini"` widens to include `"local"` (or, cleaner, becomes `t.Agent.Type != ""`).
+- **Wiring** — `cmd/claudomator/main.go`, `internal/cli/serve.go:60-78`, and `internal/cli/run.go:75-90` build the `*llm.Client` from config and register both `runners["local"]` and `pool.Classifier.LLM`.
+- **GeminiRunner** (`internal/executor/gemini.go`) — kept and finished alongside as a separate concern. The shared backoff helper move (below) and the `LogPather` interface it already implements (`gemini.go:26`) are unaffected. Real subprocess invocation replacing the simulated stdout block at `gemini.go:107-116` is a follow-up commit, not gated by this change.
+
+Shared utility move: `runWithBackoff` currently lives at `internal/executor/ratelimit.go:60`. Move it to a new tiny `internal/retry` package so both `internal/executor` and `internal/llm` use it. One-line change at the existing call site in `claude.go`.
+
+## Migration phases
+
+**Phase 1 — this pass. Classifier swap.** All the `internal/llm` + `internal/executor/local.go` + `Classifier` work above. Gated by config: if `[local_model].endpoint` is unset, behavior is unchanged. Net new files; no breaking changes to existing runners.
+
+**Phase 2 — task elaboration.** `internal/api/elaborate.go:208-275` currently has Claude and Gemini paths. Add `elaborateWithLocal`; new try-order is local → claude → gemini, controlled by a `prefer_local_for_elaborate` config flag. `Server` (`internal/api/server.go:76`) gains an `llm *llm.Client` field passed via `NewServer`.
+
+**Phase 3 — webhook CI summarization.** `createCIFailureTask` at `internal/api/webhook.go:154` builds task instructions from a hardcoded template. Add an optional summarization step calling `s.llm.Chat` over the fetched workflow logs to produce a tighter `instructions` body. Pure additive.
+
+**Phase 4 — execution summary.** `extractSummary` (`internal/executor/summary.go:13`) is text-pattern based. Add `summarizeExecution(ctx, *llm.Client, stdoutPath) string` that synthesizes a summary when no `## Summary` section exists. Hook lives in `Pool.handleRunResult` at `executor.go:347-355`; pass `*llm.Client` through `Pool` construction.
+
+## Critical files
+
+**New:**
+- `internal/llm/client.go` — `Client`, `Chat`, `ChatStream`, request/response types
+- `internal/llm/client_test.go` — `httptest`-driven coverage
+- `internal/executor/local.go` — `LocalRunner`
+- `internal/executor/local_test.go` — runner tests with stub `*llm.Client`
+- `internal/retry/backoff.go` — relocated `runWithBackoff`
+
+**Modified:**
+- `internal/executor/classifier.go` — add `LLM *llm.Client` field, route through it when set, keep Gemini fallback path
+- `internal/executor/classifier_test.go` — add httptest-backed test
+- `internal/executor/executor.go:428` — broaden `skipClassification` predicate
+- `internal/executor/ratelimit.go` — remove `runWithBackoff` (moved); update import in `claude.go`
+- `internal/task/task.go:30-43` — add `Temperature`, `MaxTokens` to `AgentConfig`
+- `internal/config/config.go:18-52` — add `LocalModel` struct + field to `Config`
+- `internal/storage/db.go:78-89` — two additive `ALTER` migrations; add `TokensIn`/`TokensOut` to `Execution` struct; update SELECT/INSERT/UPDATE SQL in same file
+- `internal/cli/serve.go:60-78`, `internal/cli/run.go:75-90`, `cmd/claudomator/main.go` — build client, register runner, wire classifier
+
+## Reuse, not reinvention
+
+- `runWithBackoff` (`internal/executor/ratelimit.go:60`) → relocated and reused by `LocalLLMClient`
+- `isRateLimitError`/`isQuotaExhausted` (`executor.go:271-283`) → emit compatible error strings from `LocalLLMClient` so Pool's existing rate-limit handling treats local 429/503 identically
+- Stream-json envelope from `claude.go:600` parsing → `LocalRunner` writes the same envelope so `extractSummary` and `ParseChangestatFromFile` work unchanged
+- Existing nullable `session_id`/`sandbox_dir` columns → no schema rework needed for non-coding runners
+- `LogPather` interface (`executor.go:38`) → `LocalRunner` implements it for log path pre-population, just like `GeminiRunner` already does
+
+## Verification
+
+**Unit tests:**
+- `internal/llm/client_test.go`: httptest server returns canned chat-completion JSON; assert `Chat` returns parsed `Content`, prompt/output tokens, model. Second test: SSE stream (data: lines, terminating `data: [DONE]`); assert `onDelta` called per chunk and final `ChatResponse` aggregated. Third: HTTP 429 with `Retry-After: 1` → assert one retry then success.
+- `internal/executor/classifier_test.go`: httptest backend returning JSON-mode response → assert `Classification` parsed correctly. Existing mock-binary test stays for the Gemini fallback path.
+- `internal/executor/local_test.go`: stub `*llm.Client` returning fixed text → `Run` writes correct stream-json envelope to `stdout.log`; verify `extractSummary` finds `## Summary` from that envelope.
+- `go test -race ./...` passes (Pool reentrancy is the risk this design avoids; race detector would catch slips).
+
+**Manual end-to-end against Ollama:**
+1. `ollama pull llama3.1:8b && ollama serve`
+2. Add to `~/.claudomator/config.toml`:
+   ```toml
+   [local_model]
+   endpoint = "http://localhost:11434/v1"
+   model = "llama3.1:8b"
+   ```
+3. `./claudomator serve` → submit a normal task → observe a single classification request hit Ollama (no `gemini` subprocess spawned) and a model selection logged at `executor.go:440`.
+4. Submit a task with `agent.type = "local"`, `instructions = "Summarize: 2+2"`. Expect `READY`/`COMPLETED` execution, populated `stdout.log` with stream-json text deltas, `cost_usd = 0`, non-zero `tokens_out` in the `executions` row.
+5. Stop Ollama → submit another task → classifier should fall back to `gemini` invocation (or fail with a rate-limit-style error if no Gemini binary present). Confirms the `endpoint == ""` and runtime-failure fallback paths both work.
+
+**Build sanity:** `go build ./...` and `go test -race ./...` (CGo / `gcc` required per CLAUDE.md).
+
+---
+
+# Phase 1 — Focused Plan
+
+This is the only phase we execute in this pass. Phases 2–4 will get their own focused plans when we start them; the sketches above are forward intent, not commitments.
+
+## Phase 1 scope (what ships)
+
+- New `internal/llm` package with `Client.Chat` and `Client.ChatStream`
+- New `internal/executor/local.go` with `LocalRunner` implementing `Runner`
+- New `internal/retry` package holding the relocated `runWithBackoff`
+- Classifier (`internal/executor/classifier.go`) routes through `*llm.Client` when configured; Gemini-CLI fallback preserved
+- Two additive `executions` migrations: `tokens_in`, `tokens_out`
+- `AgentConfig` gains `Temperature *float64`, `MaxTokens int`
+- `Config` gains `[local_model]` section (`endpoint`, `model`, `timeout_seconds`, `default_temperature`, `api_key`)
+- `executor.go:428` `skipClassification` predicate broadens to all non-empty agent types
+- Wiring in `cmd/claudomator/main.go`, `internal/cli/serve.go`, `internal/cli/run.go`
+
+## Phase 1 explicit non-goals
+
+- No changes to `internal/api/elaborate.go` (Phase 2)
+- No changes to `internal/api/webhook.go` (Phase 3)
+- No changes to `internal/executor/summary.go` summary-generation logic (Phase 4)
+- No GeminiRunner completion work (cost parsing, sandbox, real subprocess) — separate parallel commit
+- No frontend changes — UI still says "Auto / Claude / Gemini"; "Local" dropdown option deferred until token telemetry surfaces
+- No capabilities interface on `Runner`
+- No new `executions` columns beyond the two token counters
+
+## Phase 1 task list (in execution order)
+
+1. **Persist this plan to the workspace.** Copy `/root/.claude/plans/major-revision-we-re-going-quizzical-newell.md` to `docs/plans/local-oss-runner.md`. This is the durable record that lives with the codebase. Phase 2/3/4 focused plans will be appended to the same file when started.
+
+2. **Create branch.** `git checkout -b claude/local-oss-model-agents-MEBqj` (already designated; create if it doesn't exist).
+
+3. **`internal/retry/backoff.go`** — relocate `runWithBackoff` from `internal/executor/ratelimit.go:60`. Update the existing call site in `internal/executor/claude.go` to import from the new path. Keep all signature and behavior unchanged. Run `go build ./...` and `go test ./internal/executor/...` to confirm zero behavioral change.
+
+4. **`internal/llm/client.go`** — implement the package. Types from the design:
+   - `Client{Endpoint, Model, HTTPClient, APIKey, Logger}`
+   - `ChatRequest{Model, Messages, Temperature, MaxTokens, ResponseJSON, Stream}`
+   - `Message{Role, Content}`
+   - `ChatResponse{Content, PromptTokens, OutputTokens, Model, FinishReason}`
+   - `Chat(ctx, req)` — POSTs `/chat/completions`, wraps in `retry.RunWithBackoff`, maps 429/503 to `isRateLimitError`-compatible error strings
+   - `ChatStream(ctx, req, onDelta)` — same endpoint with `stream: true`, parses SSE `data:` lines, calls `onDelta(text)` per chunk, terminates on `data: [DONE]`, aggregates final response
+
+5. **`internal/llm/client_test.go`** — three tests:
+   - Canned chat-completion JSON → assert `Chat` returns parsed `Content`, prompt/output tokens, model
+   - SSE stream of `data:` lines terminated by `data: [DONE]` → assert `onDelta` called per chunk, final `ChatResponse` aggregated
+   - HTTP 429 with `Retry-After: 1` → assert one retry then success
+
+6. **`internal/storage/db.go:78-89`** — append two `ALTER TABLE executions ADD COLUMN` migrations for `tokens_in INTEGER` and `tokens_out INTEGER`. Add `TokensIn`, `TokensOut int64` to `Execution` struct. Update SELECT, INSERT, UPDATE SQL in the same file. Existing `isColumnExistsError` swallows duplicate-column errors so re-running is safe.
+
+7. **`internal/task/task.go:30-43`** — add `Temperature *float64` and `MaxTokens int` to `AgentConfig` with appropriate yaml/json tags. Pointer for Temperature so 0 means "unset, use server default."
+
+8. **`internal/config/config.go:18-52`** — add `LocalModel` struct (`Endpoint`, `Model`, `TimeoutSeconds`, `DefaultTemperature`, `APIKey`) and `LocalModel LocalModel` field on `Config`. `Default()` leaves `Endpoint` empty.
+
+9. **`internal/executor/local.go`** — `LocalRunner` struct with `Client *llm.Client`, `Logger`, `LogDir`. Implement `Run(ctx, *task.Task, *storage.Execution) error`:
+   - Build messages from `t.Agent.SystemPromptAppend` + `Instructions`
+   - Call `Client.ChatStream` with `onDelta` writing `{"type":"assistant","message":{"content":[{"type":"text","text":"<delta>"}]}}` lines to `e.StdoutPath`
+   - On completion, write a final `{"type":"result", ...}` line so existing parsers see a recognizable terminator
+   - Set `e.TokensIn`, `e.TokensOut`, `e.CostUSD = 0`, `e.Status = "READY"`
+   - Implement `LogPather` so log paths pre-populate consistently with other runners
+
+10. **`internal/executor/local_test.go`** — runner tests with a stub `*llm.Client` (use a small interface or test-injected `HTTPClient`):
+    - Stub returns fixed text containing a `## Summary` section
+    - Assert `Run` writes correct stream-json envelope to `stdout.log`
+    - Assert `extractSummary(stdoutPath)` (from `internal/executor/summary.go`) finds the summary
+    - Assert `e.TokensOut > 0` and `e.CostUSD == 0`
+
+11. **`internal/executor/classifier.go`** — add `LLM *llm.Client` field on `Classifier`. In `Classify`, when `c.LLM != nil`, use `LLM.Chat` with `ResponseJSON: true`, skip the markdown-fence cleanup. When nil, fall through to the existing Gemini-CLI subprocess path. Existing prompt template stays (already lists Claude+Gemini models, which is what the classifier still picks among).
+
+12. **`internal/executor/classifier_test.go`** — add httptest-backed test for the LLM path. Existing mock-binary test (if present) stays for the Gemini fallback path.
+
+13. **`internal/executor/executor.go:428`** — change `skipClassification := t.Agent.Type == "claude" || t.Agent.Type == "gemini"` to `skipClassification := t.Agent.Type != ""`. This generalizes correctly: any explicitly-set agent type skips selection; unset still goes through `pickAgent` + `Classifier`.
+
+14. **Wire registration** in three files:
+    - `cmd/claudomator/main.go` — build `*llm.Client` from `cfg.LocalModel` if `Endpoint != ""`, pass to pool construction
+    - `internal/cli/serve.go:60-78` — register `runners["local"] = &executor.LocalRunner{...}`, set `pool.Classifier = &executor.Classifier{LLM: localClient, GeminiBinaryPath: cfg.GeminiBinaryPath}`
+    - `internal/cli/run.go:75-90` — same pattern
+
+15. **`go test -race ./...`** — full suite passes. The race detector is the safety net for the reentrancy-avoidance design.
+
+16. **Manual smoke test against Ollama** — five steps documented in the Verification section above. Confirm the fallback path by stopping Ollama mid-session and watching classification fall back to Gemini.
+
+17. **Commit and push** to `claude/local-oss-model-agents-MEBqj`. Single commit covering Phase 1, message in the form: `feat(executor): add LocalRunner and OpenAI-compat LLM client`. Body describes the two-layer split (Client + Runner), the classifier swap, and the config gating.
+
+## Stop conditions for Phase 1
+
+- All unit tests pass under `-race`
+- `go build ./...` clean
+- Smoke test against a running Ollama instance produces a `READY` execution with non-zero `tokens_out` and `cost_usd = 0`
+- Smoke test with `[local_model]` empty produces unchanged behavior (Gemini classifier path, no LocalRunner registered)
+- Branch pushed to remote
+
+After Phase 1 lands, we stop and decide whether to begin Phase 2 (elaboration). At that point we'll write a Phase 2 focused plan in `docs/plans/local-oss-runner.md`.
author	Claude <noreply@anthropic.com>	2026-04-28 09:24:43 +0000
committer	Claude <noreply@anthropic.com>	2026-04-28 09:24:43 +0000
commit	0865afc43be562dbe14528e4299b9e213b54cc93 (patch)
tree	3ffb11207fb6b9866b5a2477bba7abe38964f83a /docs/plans/local-oss-runner.md
parent	c2aa026f6ce1c9e216b99d74f294fc133d5fcddd (diff)