docs/plans/local-oss-runner.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385

# Local OSS Models as a Third Runner

## Context

Today the executor only knows about subprocess CLI agents (Claude, with a stubbed Gemini). Internal LLM-shaped work — model classification, free-form prompt elaboration, webhook CI summarization, execution summary — either shells out to the `gemini` CLI (`internal/executor/classifier.go:60`) or sits in `internal/api/elaborate.go` doing the same. That's expensive in latency and dollars for what are essentially helper completions, and there's no path to keep "internal" reasoning private/local.

This change adds a local OSS model backend (any OpenAI-compatible HTTP server: Ollama, vLLM, LM Studio, llama.cpp server) as a first-class third Runner alongside Claude and Gemini. The unified harness model wins over a separate "internal LLM service" because it preserves a single `Runner` abstraction, single `executions` table, and one set of pool semantics (rate-limit handling, observability, WebSocket events) for any task whose `agent.type == "local"`.

Outcome: a `LocalRunner` for user-facing tasks, plus a lower-level `LocalLLMClient` that internal helpers call directly without paying Pool/Execution overhead. First migration target is the classifier (sub-second, high-volume, lowest blast radius). Elaboration, webhook summarization, and execution summary follow in subsequent passes using the same client.

## Architectural decision: two layers, one backend

`LocalRunner implements Runner` is the user-visible contract. But the classifier runs *inside* `Pool.execute()` (at `internal/executor/executor.go:437`), so submitting recursively to `Pool` would deadlock against `workCh`'s slot accounting and pollute the `executions` table with sub-second rows for every classification.

Resolution: introduce a `LocalLLMClient` (HTTP, no Pool, no DB) as the workhorse. `LocalRunner` is a thin adapter over it for full Pool-managed executions. Internal callers — classifier now, elaborate/webhook/summary later — call `LocalLLMClient` directly. Two code paths to local, but path lengths are wildly unequal (the runner is ~150 lines of glue) and they share one HTTP round-tripper.

Capabilities (e.g. "this runner can edit code, that one can't") are deferred. `LocalRunner` simply leaves `SandboxDir` empty; the Pool already tolerates that. Revisit only when a third non-coding runner appears.

## End state

- **`internal/llm`** (new package) — `LocalLLMClient` with `Chat` and `ChatStream` over OpenAI-compat `/chat/completions`. Handles retries via the existing backoff helper, JSON mode, SSE streaming, optional bearer token.
- **`internal/executor/local.go`** (new) — `LocalRunner` implements `Runner`. Streams response deltas into the same stream-json envelope Claude uses (`{"type":"assistant","message":{"content":[{"type":"text","text":"..."}]}}`) so existing parsers (`internal/executor/summary.go:13`, `internal/task/changestats.go`) keep working unchanged.
- **`Classifier`** (`internal/executor/classifier.go`) — now holds a `*llm.Client`. When set, classification goes through it with `response_format: json_object`; markdown-fence cleanup is skipped on this path. Gemini-CLI path stays as a fallback when `[local_model]` config is empty.
- **Storage** — `executions.tokens_in` and `tokens_out` added (additive `ALTER`, schema pattern at `internal/storage/db.go:78-89`). `cost_usd` stays 0 for local. `session_id`/`sandbox_dir` remain nullable; `LocalRunner` simply doesn't populate them.
- **`AgentConfig`** — adds `Temperature *float64` (pointer so 0 means "unset") and `MaxTokens int` at `internal/task/task.go:30`. Existing Claude-shaped fields (`PermissionMode`, `AllowedTools`, etc.) are silently ignored by `LocalRunner`.
- **Config** — new `[local_model]` TOML section in `internal/config/config.go:18`: `endpoint`, `model`, `timeout_seconds`, `default_temperature`, `api_key`. If `endpoint` is empty, the runner is not registered and the classifier falls back to Gemini-CLI.
- **Routing** — `executor.go:428`'s hardcoded `t.Agent.Type == "claude" || == "gemini"` widens to include `"local"` (or, cleaner, becomes `t.Agent.Type != ""`).
- **Wiring** — `cmd/claudomator/main.go`, `internal/cli/serve.go:60-78`, and `internal/cli/run.go:75-90` build the `*llm.Client` from config and register both `runners["local"]` and `pool.Classifier.LLM`.
- **GeminiRunner** (`internal/executor/gemini.go`) — kept and finished alongside as a separate concern. The shared backoff helper move (below) and the `LogPather` interface it already implements (`gemini.go:26`) are unaffected. Real subprocess invocation replacing the simulated stdout block at `gemini.go:107-116` is a follow-up commit, not gated by this change.

Shared utility move: `runWithBackoff` currently lives at `internal/executor/ratelimit.go:60`. Move it to a new tiny `internal/retry` package so both `internal/executor` and `internal/llm` use it. One-line change at the existing call site in `claude.go`.

## Migration phases

**Phase 1 — this pass. Classifier swap.** All the `internal/llm` + `internal/executor/local.go` + `Classifier` work above. Gated by config: if `[local_model].endpoint` is unset, behavior is unchanged. Net new files; no breaking changes to existing runners.

**Phase 2 — task elaboration.** `internal/api/elaborate.go:208-275` currently has Claude and Gemini paths. Add `elaborateWithLocal`; new try-order is local → claude → gemini, controlled by a `prefer_local_for_elaborate` config flag. `Server` (`internal/api/server.go:76`) gains an `llm *llm.Client` field passed via `NewServer`.

**Phase 3 — webhook CI summarization.** `createCIFailureTask` at `internal/api/webhook.go:154` builds task instructions from a hardcoded template. Add an optional summarization step calling `s.llm.Chat` over the fetched workflow logs to produce a tighter `instructions` body. Pure additive.

**Phase 4 — execution summary.** `extractSummary` (`internal/executor/summary.go:13`) is text-pattern based. Add `summarizeExecution(ctx, *llm.Client, stdoutPath) string` that synthesizes a summary when no `## Summary` section exists. Hook lives in `Pool.handleRunResult` at `executor.go:347-355`; pass `*llm.Client` through `Pool` construction.

## Critical files

**New:**
- `internal/llm/client.go` — `Client`, `Chat`, `ChatStream`, request/response types
- `internal/llm/client_test.go` — `httptest`-driven coverage
- `internal/executor/local.go` — `LocalRunner`
- `internal/executor/local_test.go` — runner tests with stub `*llm.Client`
- `internal/retry/backoff.go` — relocated `runWithBackoff`

**Modified:**
- `internal/executor/classifier.go` — add `LLM *llm.Client` field, route through it when set, keep Gemini fallback path
- `internal/executor/classifier_test.go` — add httptest-backed test
- `internal/executor/executor.go:428` — broaden `skipClassification` predicate
- `internal/executor/ratelimit.go` — remove `runWithBackoff` (moved); update import in `claude.go`
- `internal/task/task.go:30-43` — add `Temperature`, `MaxTokens` to `AgentConfig`
- `internal/config/config.go:18-52` — add `LocalModel` struct + field to `Config`
- `internal/storage/db.go:78-89` — two additive `ALTER` migrations; add `TokensIn`/`TokensOut` to `Execution` struct; update SELECT/INSERT/UPDATE SQL in same file
- `internal/cli/serve.go:60-78`, `internal/cli/run.go:75-90`, `cmd/claudomator/main.go` — build client, register runner, wire classifier

## Reuse, not reinvention

- `runWithBackoff` (`internal/executor/ratelimit.go:60`) → relocated and reused by `LocalLLMClient`
- `isRateLimitError`/`isQuotaExhausted` (`executor.go:271-283`) → emit compatible error strings from `LocalLLMClient` so Pool's existing rate-limit handling treats local 429/503 identically
- Stream-json envelope from `claude.go:600` parsing → `LocalRunner` writes the same envelope so `extractSummary` and `ParseChangestatFromFile` work unchanged
- Existing nullable `session_id`/`sandbox_dir` columns → no schema rework needed for non-coding runners
- `LogPather` interface (`executor.go:38`) → `LocalRunner` implements it for log path pre-population, just like `GeminiRunner` already does

## Verification

**Unit tests:**
- `internal/llm/client_test.go`: httptest server returns canned chat-completion JSON; assert `Chat` returns parsed `Content`, prompt/output tokens, model. Second test: SSE stream (data: lines, terminating `data: [DONE]`); assert `onDelta` called per chunk and final `ChatResponse` aggregated. Third: HTTP 429 with `Retry-After: 1` → assert one retry then success.
- `internal/executor/classifier_test.go`: httptest backend returning JSON-mode response → assert `Classification` parsed correctly. Existing mock-binary test stays for the Gemini fallback path.
- `internal/executor/local_test.go`: stub `*llm.Client` returning fixed text → `Run` writes correct stream-json envelope to `stdout.log`; verify `extractSummary` finds `## Summary` from that envelope.
- `go test -race ./...` passes (Pool reentrancy is the risk this design avoids; race detector would catch slips).

**Manual end-to-end against Ollama:**
1. `ollama pull llama3.1:8b && ollama serve`
2. Add to `~/.claudomator/config.toml`:
   ```toml
   [local_model]
   endpoint = "http://localhost:11434/v1"
   model = "llama3.1:8b"
   ```
3. `./claudomator serve` → submit a normal task → observe a single classification request hit Ollama (no `gemini` subprocess spawned) and a model selection logged at `executor.go:440`.
4. Submit a task with `agent.type = "local"`, `instructions = "Summarize: 2+2"`. Expect `READY`/`COMPLETED` execution, populated `stdout.log` with stream-json text deltas, `cost_usd = 0`, non-zero `tokens_out` in the `executions` row.
5. Stop Ollama → submit another task → classifier should fall back to `gemini` invocation (or fail with a rate-limit-style error if no Gemini binary present). Confirms the `endpoint == ""` and runtime-failure fallback paths both work.

**Build sanity:** `go build ./...` and `go test -race ./...` (CGo / `gcc` required per CLAUDE.md).

---

# Phase 1 — Focused Plan

This is the only phase we execute in this pass. Phases 2–4 will get their own focused plans when we start them; the sketches above are forward intent, not commitments.

## Phase 1 scope (what ships)

- New `internal/llm` package with `Client.Chat` and `Client.ChatStream`
- New `internal/executor/local.go` with `LocalRunner` implementing `Runner`
- New `internal/retry` package holding the relocated `runWithBackoff`
- Classifier (`internal/executor/classifier.go`) routes through `*llm.Client` when configured; Gemini-CLI fallback preserved
- Two additive `executions` migrations: `tokens_in`, `tokens_out`
- `AgentConfig` gains `Temperature *float64`, `MaxTokens int`
- `Config` gains `[local_model]` section (`endpoint`, `model`, `timeout_seconds`, `default_temperature`, `api_key`)
- `executor.go:428` `skipClassification` predicate broadens to all non-empty agent types
- Wiring in `cmd/claudomator/main.go`, `internal/cli/serve.go`, `internal/cli/run.go`

## Phase 1 explicit non-goals

- No changes to `internal/api/elaborate.go` (Phase 2)
- No changes to `internal/api/webhook.go` (Phase 3)
- No changes to `internal/executor/summary.go` summary-generation logic (Phase 4)
- No GeminiRunner completion work (cost parsing, sandbox, real subprocess) — separate parallel commit
- No frontend changes — UI still says "Auto / Claude / Gemini"; "Local" dropdown option deferred until token telemetry surfaces
- No capabilities interface on `Runner`
- No new `executions` columns beyond the two token counters

## Phase 1 task list (in execution order)

1. **Persist this plan to the workspace.** Copy `/root/.claude/plans/major-revision-we-re-going-quizzical-newell.md` to `docs/plans/local-oss-runner.md`. This is the durable record that lives with the codebase. Phase 2/3/4 focused plans will be appended to the same file when started.

2. **Create branch.** `git checkout -b claude/local-oss-model-agents-MEBqj` (already designated; create if it doesn't exist).

3. **`internal/retry/backoff.go`** — relocate `runWithBackoff` from `internal/executor/ratelimit.go:60`. Update the existing call site in `internal/executor/claude.go` to import from the new path. Keep all signature and behavior unchanged. Run `go build ./...` and `go test ./internal/executor/...` to confirm zero behavioral change.

4. **`internal/llm/client.go`** — implement the package. Types from the design:
   - `Client{Endpoint, Model, HTTPClient, APIKey, Logger}`
   - `ChatRequest{Model, Messages, Temperature, MaxTokens, ResponseJSON, Stream}`
   - `Message{Role, Content}`
   - `ChatResponse{Content, PromptTokens, OutputTokens, Model, FinishReason}`
   - `Chat(ctx, req)` — POSTs `/chat/completions`, wraps in `retry.RunWithBackoff`, maps 429/503 to `isRateLimitError`-compatible error strings
   - `ChatStream(ctx, req, onDelta)` — same endpoint with `stream: true`, parses SSE `data:` lines, calls `onDelta(text)` per chunk, terminates on `data: [DONE]`, aggregates final response

5. **`internal/llm/client_test.go`** — three tests:
   - Canned chat-completion JSON → assert `Chat` returns parsed `Content`, prompt/output tokens, model
   - SSE stream of `data:` lines terminated by `data: [DONE]` → assert `onDelta` called per chunk, final `ChatResponse` aggregated
   - HTTP 429 with `Retry-After: 1` → assert one retry then success

6. **`internal/storage/db.go:78-89`** — append two `ALTER TABLE executions ADD COLUMN` migrations for `tokens_in INTEGER` and `tokens_out INTEGER`. Add `TokensIn`, `TokensOut int64` to `Execution` struct. Update SELECT, INSERT, UPDATE SQL in the same file. Existing `isColumnExistsError` swallows duplicate-column errors so re-running is safe.

7. **`internal/task/task.go:30-43`** — add `Temperature *float64` and `MaxTokens int` to `AgentConfig` with appropriate yaml/json tags. Pointer for Temperature so 0 means "unset, use server default."

8. **`internal/config/config.go:18-52`** — add `LocalModel` struct (`Endpoint`, `Model`, `TimeoutSeconds`, `DefaultTemperature`, `APIKey`) and `LocalModel LocalModel` field on `Config`. `Default()` leaves `Endpoint` empty.

9. **`internal/executor/local.go`** — `LocalRunner` struct with `Client *llm.Client`, `Logger`, `LogDir`. Implement `Run(ctx, *task.Task, *storage.Execution) error`:
   - Build messages from `t.Agent.SystemPromptAppend` + `Instructions`
   - Call `Client.ChatStream` with `onDelta` writing `{"type":"assistant","message":{"content":[{"type":"text","text":"<delta>"}]}}` lines to `e.StdoutPath`
   - On completion, write a final `{"type":"result", ...}` line so existing parsers see a recognizable terminator
   - Set `e.TokensIn`, `e.TokensOut`, `e.CostUSD = 0`, `e.Status = "READY"`
   - Implement `LogPather` so log paths pre-populate consistently with other runners

10. **`internal/executor/local_test.go`** — runner tests with a stub `*llm.Client` (use a small interface or test-injected `HTTPClient`):
    - Stub returns fixed text containing a `## Summary` section
    - Assert `Run` writes correct stream-json envelope to `stdout.log`
    - Assert `extractSummary(stdoutPath)` (from `internal/executor/summary.go`) finds the summary
    - Assert `e.TokensOut > 0` and `e.CostUSD == 0`

11. **`internal/executor/classifier.go`** — add `LLM *llm.Client` field on `Classifier`. In `Classify`, when `c.LLM != nil`, use `LLM.Chat` with `ResponseJSON: true`, skip the markdown-fence cleanup. When nil, fall through to the existing Gemini-CLI subprocess path. Existing prompt template stays (already lists Claude+Gemini models, which is what the classifier still picks among).

12. **`internal/executor/classifier_test.go`** — add httptest-backed test for the LLM path. Existing mock-binary test (if present) stays for the Gemini fallback path.

13. **`internal/executor/executor.go:428`** — change `skipClassification := t.Agent.Type == "claude" || t.Agent.Type == "gemini"` to `skipClassification := t.Agent.Type != ""`. This generalizes correctly: any explicitly-set agent type skips selection; unset still goes through `pickAgent` + `Classifier`.

14. **Wire registration** in three files:
    - `cmd/claudomator/main.go` — build `*llm.Client` from `cfg.LocalModel` if `Endpoint != ""`, pass to pool construction
    - `internal/cli/serve.go:60-78` — register `runners["local"] = &executor.LocalRunner{...}`, set `pool.Classifier = &executor.Classifier{LLM: localClient, GeminiBinaryPath: cfg.GeminiBinaryPath}`
    - `internal/cli/run.go:75-90` — same pattern

15. **`go test -race ./...`** — full suite passes. The race detector is the safety net for the reentrancy-avoidance design.

16. **Manual smoke test against Ollama** — five steps documented in the Verification section above. Confirm the fallback path by stopping Ollama mid-session and watching classification fall back to Gemini.

17. **Commit and push** to `claude/local-oss-model-agents-MEBqj`. Single commit covering Phase 1, message in the form: `feat(executor): add LocalRunner and OpenAI-compat LLM client`. Body describes the two-layer split (Client + Runner), the classifier swap, and the config gating.

## Stop conditions for Phase 1

- All unit tests pass under `-race`
- `go build ./...` clean
- Smoke test against a running Ollama instance produces a `READY` execution with non-zero `tokens_out` and `cost_usd = 0`
- Smoke test with `[local_model]` empty produces unchanged behavior (Gemini classifier path, no LocalRunner registered)
- Branch pushed to remote

After Phase 1 lands, we stop and decide whether to begin Phase 2 (elaboration). At that point we'll write a Phase 2 focused plan in `docs/plans/local-oss-runner.md`.

---

# Post-epic follow-up: deep cleanup

After all four phases land, plan and execute a deep cleanup pass. Things noticed in flight that we deliberately did not chase mid-epic:

- **Sandbox/git tests fail in this environment** because `git commit` invokes a signing server that returns 400 ("missing source"). Affected: `TestSandboxCloneSource_*`, `TestSetupSandbox_*`, `TestTeardownSandbox_*`, `TestBlockedError_IncludesSandboxDir`, `TestClaudeRunner_Run_StaleSandboxDir_ClonesAfresh`. Fix: set `commit.gpgsign=false` in test setup so sandbox tests run hermetically.
- **`TestParseGeminiStream_ParsesStructuredOutput` is currently `t.Skip`** as a pre-existing gemini-stub gap. Either implement result-error/cost parsing in `parseGeminiStream` or delete the test until the stub is finished.
- **`TestPool_ActivePerAgent_DeletesZeroEntries` flakes** under `-race` when run with the full suite (passes in isolation and on `-count=3`). Likely goroutine-ordering in the `activePerAgent` map cleanup path. Audit dispatch/finish ordering.
- **`setupSandbox` test signature drift** was just fixed; audit other tests for similar staleness from prior refactors.
- **Pre-existing `executor` tests didn't compile on trunk** until the setupSandbox fix landed. Verify CI reality — is it green via something we're missing, or quietly broken?
- **GeminiRunner is still simulated** (`gemini.go:107-116`). Decide: finish it (real subprocess + cost parsing + sandbox) or delete it and leave only Claude + Local.
- **Frontend "Local" agent option** — UI dropdown still says "Auto / Claude / Gemini". Add Local once token telemetry has a place to render.
- **Audit `*_test.go` for `t.Skip` and other dormant breakage** before shipping more code on top.
- **`TestGeminiLogs_ParsedCorrectly`** in `internal/api` returns 404 from `GET /log` for a gemini execution — pre-existing on Phase 1 baseline. Some routing or log-path resolution mismatch specific to gemini executions. Likely related to the GeminiRunner stub status above.

Goal: clean `go test -race ./...` with zero skips and zero environmental failures on whatever platform CI runs on.

## Cleanup pass — DONE

All eight items in the cleanup queue above have been addressed in the post-epic cleanup commit. Summary of fixes:

- `gitSafe` now disables `commit.gpgsign` and `tag.gpgsign` so sandbox tests pass on hosts with surprise signing config; matching `safe.directory=*` literals in test helpers updated for parity.
- Real bug found and fixed: `setupSandbox(...)` callsites in `claude.go` used `sandboxDir, err := ...` which shadowed the outer variable. `BlockedError.SandboxDir` was always empty as a result; `TestBlockedError_IncludesSandboxDir` now passes for the right reason.
- `parseGeminiStream` now parses `result` events for `is_error`/`total_cost_usd` and returns errors/cost accordingly; `TestParseGeminiStream_ParsesStructuredOutput` is unskipped.
- `GeminiRunner.Run` propagates parsed cost to `Execution.CostUSD`.
- `TestGeminiLogs_ParsedCorrectly` test URL fixed (`/api/tasks/{id}/executions/{exec-id}/log` → `/api/executions/{id}/log`, matching the actual route).
- `TestPool_ActivePerAgent_DeletesZeroEntries` flake root-caused: `handleRunResult` was sending on `resultCh` before `execute()`'s deferred cleanup ran, so consumers could observe a zero-count map entry. Extracted `decActiveAgent(agentType, *cleaned)` helper, called explicitly before each `resultCh` send, defer becomes no-op via the cleaned flag. Verified clean over `-count=10` under `-race`.
- `TestSandboxCloneSource_FallsBackToOrigin` updated to use a local-FS origin URL, matching `sandboxCloneSource`'s actual semantics (it filters non-local URLs to avoid network clones).
- All bare repos in tests created with `git init --bare -b main` so `HEAD` symbolically points at `main` (not the default `master`), unblocking the `git log` queries the tests perform after pushing.

Test-suite state after cleanup: `go test -race ./...` is green across all packages with zero `t.Skip` calls and zero excluded tests.

Items not chased (deferred deliberately):
- **GeminiRunner is still simulated** (`gemini.go` `Run` writes hardcoded stream data instead of executing the binary). The result/cost parsing now exists, so finishing the runner is a smaller, contained change. Kept on the queue but doesn't block anything else.
- **Frontend "Local" agent option** — UI dropdown still says "Auto / Claude / Gemini". Pending token telemetry surface.
- **`storage.db.go:706` TODO comment** — minor logger plumbing nit. Skipping unless it blocks something.

## Deferred work — DONE

Follow-up commit closed the three deferred items above:

- `GeminiRunner.execOnce` now invokes the actual `gemini` binary via `exec.CommandContext`, mirroring the `ClaudeRunner` pattern: pipe stdout to `parseGeminiStream`, kill the process group on context cancel, capture stderr to file, surface exit codes. Hardcoded simulation removed.
- Test infrastructure bug uncovered and fixed in passing: the mock gemini script in `testServerWithGeminiMockRunner` was using `"\``json\`"` which bash interpreted as command substitution, so the script always produced empty output. Switched to a single-quoted heredoc. The bug was masked previously because the runner ignored the script entirely.
- Frontend `index.html` dropdown gains a `Local` option. No JS branching changes needed — the value flows through to `agent.type` verbatim and downstream display reads the type string as-is.
- Stale debug-comment scaffolding around `storage.db.go:706` deleted.

---

# Phase 2 — Focused Plan (Elaboration)

## Phase 2 scope

`internal/api/elaborate.go` currently has two paths: Claude and Gemini. Add a third (local) and make it the preferred path when local model is configured. Try-order: local → claude → gemini, with each next attempt only on hard failure of the prior.

Second-cheapest, second-highest-volume LLM call after classification (one per task creation, sub-second target). Routing through local removes another cost line and lets elaboration work offline.

## What ships

- `Server` (`internal/api/server.go`) gains `llm *llm.Client` threaded through `NewServer`
- `internal/api/elaborate.go` gains `elaborateWithLocal(ctx, *llm.Client, input string) (string, error)`
- Dispatch in `Server.elaborate` reorders to: local → claude → gemini, gated by `PreferLocalForElaborate`
- `Config` gains `PreferLocalForElaborate bool`, defaulted true when `LocalModel.Endpoint != ""`
- Wiring in `internal/cli/serve.go` passes the LLM client into `NewServer`

## Explicit non-goals

- No prompt rework — reuse existing elaboration prompt template verbatim
- No streaming the response into SSE/WebSocket (one-shot RPC)
- No changes to webhook (Phase 3) or summary (Phase 4)
- No UI changes — `/elaborate` endpoint signature stays the same

## Task list

1. Read `internal/api/elaborate.go` end-to-end: dispatch site, Claude path, Gemini path, prompt template
2. Read `internal/api/server.go` `NewServer` signature and `Server` fields
3. Thread `llm *llm.Client` through `NewServer` and update callers (`internal/cli/serve.go`)
4. Implement `elaborateWithLocal` using the same prompt template as Claude/Gemini, returning `(string, error)`
5. Add `PreferLocalForElaborate bool` to `config.Config`, default true when local endpoint configured
6. Reorder dispatch: `if s.llm != nil && cfg.PreferLocalForElaborate { try local; else fall through }` then existing claude → gemini chain
7. httptest-based unit test for `elaborateWithLocal`
8. Dispatch fallback test: local fails → claude attempted
9. `go build ./... && go test -race ./...`
10. Commit Phase 2 on the same branch
11. Push

## Stop conditions

- Tests green under `-race`
- `prefer_local_for_elaborate=false` short-circuits to Claude path (preserves current behavior when user opts out)
- Local-failure fallback to Claude verified by test
- Branch pushed

---

# Phase 3 — Focused Plan (CI Failure Triage)

## Scope adjustment from the original sketch

The original Phase 3 sketch was "summarize fetched workflow logs". Fetching GitHub workflow logs requires authenticated GitHub API access (PAT or app token), which is out of scope and would balloon this phase into a GitHub-integration epic. Narrow Phase 3 to **project-context-based triage** — use signals we already have without new dependencies.

What we have at webhook time: `repository.full_name`, `branch`, `SHA`, `check_name`, `html_url`, plus (when matched) a project directory we can read locally.

What the LLM can do with that: produce a tighter, project-aware investigation prompt that names the recent commits, points at suspect files, and gives the agent better starting hypotheses than the current generic 4-step template.

## What ships

- New helper `enrichCIInstructions(ctx, *llm.Client, ciContext, projectDir, fallback string) string`
- `createCIFailureTask` calls it when `s.llm != nil`; on any error, returns the existing hardcoded template (truly additive — webhook tests for the no-LLM path stay passing unchanged)
- Helper uses: recent git log (last 5 commits from project_dir if it's a git repo), CLAUDE.md content if present, plus all webhook metadata
- One configuration knob: reuse `LocalModel.UseForElaborate()` semantics? No — separate flag. Add `LocalModel.PreferForCITriage *bool` defaulting true when endpoint set, opt-out symmetrical with `PreferForElaborate`.

## Explicit non-goals

- No GitHub API integration (no log fetching, no auth)
- No changes to webhook routing, signature validation, project matching, or task scheduling
- No changes to the task schema (instructions stays a string)
- No streaming — one-shot LLM call, sub-2s target

## Task list

1. Add `LocalModel.PreferForCITriage *bool` and `UseForCITriage()` helper, mirroring elaborate
2. Add `enrichCIInstructions` in `internal/api/webhook.go` (or `webhook_llm.go` if it grows)
3. Read recent git log from project_dir via `git log --oneline -n 5` (best-effort, swallow errors)
4. Read CLAUDE.md from project_dir (best-effort)
5. Build a focused prompt: "CI just failed on this project. Here's metadata + recent commits + project context. Produce a 6-12 line investigation plan that names suspect files/commits when you can, otherwise gives concrete starting steps." Plain text out, not JSON.
6. Update `createCIFailureTask` to call enrichment when `s.llm != nil && cfg.LocalModel.UseForCITriage()`. Note: the server doesn't currently see the cfg directly — pass the gate as a setter `SetCITriageEnabled(bool)` from serve.go, OR (simpler) just gate on `s.llm != nil` and let users opt out by not calling `SetLLM`. Going with the simpler option since it matches the elaborate split: same `s.llm` for both, server doesn't track per-feature gates.
7. Wiring in `serve.go`: when `cfg.LocalModel.Endpoint != ""`, `SetLLM(localClient)`. (Already done in Phase 2.) Per-feature opt-out via the `PreferFor*` config flags is read at wire time and could conditionally not call SetLLM, but that gives elaborate/CI an all-or-nothing toggle which is wrong. Better: introduce a separate setter `SetLLMForCITriage` so each feature can be controlled independently.

   Actually, simplest and cleanest: keep one `SetLLM` setter, and gate each call site (`elaborateWithLocal`, `enrichCIInstructions`) by reading a per-feature config flag passed via separate setters. That's getting fiddly. Step back.

   **Final decision:** the per-feature gate doesn't pull its weight in Phase 3. Ship it as: `s.llm != nil` enables both elaborate and CI triage. Users who want elaborate-yes/CI-triage-no can revisit later. The deferred per-feature toggles get added in the post-epic cleanup along with token telemetry — there's no real demand for the asymmetric case yet.

   Revised: drop `PreferForCITriage` entirely; ship a simpler thing.
8. Tests:
   - `enrichCIInstructions` with stub LLM returns the LLM body
   - `enrichCIInstructions` with failing LLM returns `fallback` unchanged
   - `enrichCIInstructions` includes recent git log when project_dir is a real git repo (use `t.TempDir()` + `git init` + a commit)
   - Webhook handler test: LLM configured → instructions reflect LLM output
   - Webhook handler test: LLM not configured → instructions match the existing template byte-for-byte (regression guard)
9. `go build ./... && go test -race ./...`
10. Commit as Phase 3 on the same branch
11. Push

## Stop conditions

- All new tests green under `-race`
- Existing webhook tests pass byte-for-byte when LLM not configured
- Build clean; pushed

---

# Phase 4 — Focused Plan (Execution Summary)

## Scope

`extractSummary` in `internal/executor/summary.go` is text-pattern based: it returns the body following the last `## Summary` heading in any assistant text block. When the agent didn't write one, summary stays empty. This is fine for Claude (which is prompted to write a summary), but not for arbitrary local-runner outputs, and not for cases where Claude exits early or hits a budget cap before the summary section.

Phase 4 adds an LLM-based fallback: when `extractSummary` returns "" and the Pool has an LLM client, synthesize a 2-4 sentence summary from the tail of the stdout log.

## What ships

- New `synthesizeSummary(ctx, *llm.Client, stdoutPath string) string` in `internal/executor/summary.go`. Reads the last ~16 KB of the stdout log, strips stream-json envelopes to extract just the text content, and asks the LLM to summarize.
- New `LLM *llm.Client` field on `executor.Pool` (wired identically to `Classifier.LLM` in Phase 1).
- Hook into `Pool.handleRunResult` at the existing summary block: after `extractSummary` returns "", call `synthesizeSummary` if `p.LLM != nil`.
- Wiring in `cmd/claudomator/main.go` (none — main.go is a thin wrapper), `internal/cli/serve.go`, `internal/cli/run.go`: pass `localClient` to Pool.

## Explicit non-goals

- No changes to the Claude prompt or the `## Summary` extraction (that path stays primary)
- No changes to the storage schema (summary is already a `tasks.summary` TEXT column via `UpdateTaskSummary`)
- No streaming the summary — one-shot 2-4 sentence completion
- No new config knob for "prefer local for summary" — same `s.llm`/`p.LLM` gate applies; users opt out by not setting LocalModel.Endpoint
- No retroactive backfill of summaries on existing executions

## Task list

1. Add `LLM *llm.Client` field on `executor.Pool` (matches the `Classifier` pattern from Phase 1)
2. Implement `synthesizeSummary(ctx, *llm.Client, stdoutPath) string` in `internal/executor/summary.go`. Reads last ~16 KB, parses each line as a stream-json event, joins the assistant text content, calls `Chat` with a 6-second timeout asking for 2-4 sentences plain text. Returns "" on any error so the caller's existing empty-summary path stays unchanged.
3. Modify `Pool.handleRunResult`: after `extractSummary` returns empty, if `p.LLM != nil`, try `synthesizeSummary(ctx, p.LLM, exec.StdoutPath)`. If it returns non-empty, persist via `UpdateTaskSummary`.
4. Wire `Pool.LLM = localClient` in `internal/cli/serve.go` and `internal/cli/run.go`
5. Tests in `internal/executor/summary_test.go` (or a new file):
   - `synthesizeSummary` with stub LLM: stdout.log containing stream-json text → assistant content extracted → LLM called → returned summary
   - `synthesizeSummary` with no `## Summary` heading anywhere → still produces synthesized summary
   - `synthesizeSummary` LLM failure → returns ""
   - `synthesizeSummary` empty stdout file → returns ""
   - Pool integration test: LocalRunner produces a stdout with no `## Summary` section, Pool's LLM is set, after handleRunResult the task's summary is non-empty
6. `go build ./... && go test -race ./...`
7. Commit as Phase 4 on the branch
8. Push

## Stop conditions

- New tests green under `-race`
- Existing tests unchanged (the extractSummary primary path keeps winning whenever a `## Summary` heading exists)
- Build clean; pushed
- Epic complete: `## Local OSS Models as a Third Runner` shipped end-to-end

After Phase 4 lands, execute the post-epic deep cleanup using the queue at the top of this section.