summaryrefslogtreecommitdiff
path: root/docs/RAW_NARRATIVE.md
blob: 834d812443e634fc78967b06d059e2a008de2431 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
# Claudomator: Development Narrative

This document is a chronological engineering history of the Claudomator project,
reconstructed from the git log, ADRs, and source code.

---

## 1. Initial commit — core scaffolding (2e2b218)

The project started with a single commit that established the full skeleton:
task model, executor, API server, CLI, storage layer, and reporter. The Go module
was `github.com/thepeterstone/claudomator`. The initial `Task` struct had a
`ClaudeConfig` field (later renamed to `AgentConfig`) holding the model,
instructions, `working_dir`, budget, permission mode, and tool lists. SQLite was
chosen as the storage backend (see ADR-001). The executor pool used a bounded
goroutine model. The API server was plain `net/http` with no external framework.
The CLI was Cobra.

## 2. JSON tags, module rename, gitignore (8ee1fb5, 46ba3f5, 2bf317d)

Early housekeeping: added JSON struct tags to all exported types, renamed the Go
module to its final identifier, and set up the `.gitignore` to exclude the compiled
binary and local Claude settings.

## 3. Verbose flag, logs CLI command (0377c06, f27d4f7)

Added `--verbose` to the Claude subprocess invocation and a `logs` CLI subcommand
for tailing execution output.

## 4. Embedded web UI and HTTP wiring (135d8eb)

The first web UI was embedded into the binary using `go:embed`. This made the
binary fully self-contained: no separate static file server was needed.

## 5. CLAUDE.md, clickable fold, subtask support (bdcc33f, 3881f80, 704d007)

Added the project-level `CLAUDE.md` guidance document. Added a clickable fold to
the web UI to expand hidden completed/failed tasks. Added `parent_task_id` to the
`Task` struct, `ListSubtasks` to storage, and `UpdateTask` — the foundational
subtask plumbing.

## 6. Dependency waiting and planning preamble (f527972)

The executor gained dependency waiting: tasks with `depends_on` now block in a
polling loop until all dependencies reach `COMPLETED`. Any dependency entering a
terminal failure state (`FAILED`, `TIMED_OUT`, `CANCELLED`, `BUDGET_EXCEEDED`)
immediately fails the waiting task.

The planning preamble was also introduced here — a system prompt prefix injected
into every task's instructions that explains to the agent how to write question
files, how to break tasks into subtasks via the `claudomator` CLI, and how to
commit all changes in git sandboxes.

## 7. Elaborate, logs-stream, templates, subtask-list endpoints (74cc740)

The API gained several new endpoints:
- `POST /api/elaborate` — calls Claude to expand a brief task description into
  structured YAML.
- `GET /api/executions/{id}/stream` — live-streams the execution log.
- `GET /api/templates` / `POST /api/templates` — task template CRUD (later removed).
- `GET /api/tasks/{id}/subtasks` — lists subtasks for a parent task.

## 8. Web UI: tabs, new task modal, templates panel (e8d1b80)

The web UI got a tabbed layout (Running / Done / Templates), a modal for creating
new tasks with AI-drafted instructions, and a templates panel. This was the first
version of the UI that matched the current design.

## 9. READY state for human-in-the-loop review (6511d6e)

A critical design point: when a top-level task's runner exits successfully, the
task does not immediately go to `COMPLETED`. Instead it transitions to `READY`,
meaning it paused for the operator to review the agent's output and explicitly
accept or reject it. `READY → COMPLETED` requires `POST /api/tasks/{id}/accept`.
`READY → PENDING` (for re-running) requires `POST /api/tasks/{id}/reject`.

This is specific to top-level tasks. Subtasks (`parent_task_id != ""`) bypass READY
and go directly to `COMPLETED` — only the root task requires human sign-off.

## 10. Fix working_dir failures, hardcoded /root removed (3962597)

Early deployments hardcoded `/root` as the base path for `working_dir`. This was
removed. `working_dir` is now validated to exist before the subprocess starts.

## 11. Scripts, debug-execution, deploy (2bbae74, f7c6de4)

Added the `scripts/` directory with `debug-execution` (inspects a specific
execution's logs) and `deploy` (builds and deploys the binary to the production
server). Added a CLI `start` command and the `version` package.

## 12. Rescue from recovery branch — question/answer, rate limiting, start-next-task (cf83444)

A batch of features rescued from a detached-work branch:
- **Question/answer flow (`BLOCKED` state)**: agents can write a `question.json`
  file before exiting. The pool detects this and transitions the task to `BLOCKED`,
  storing the question for the user. `POST /api/tasks/{id}/answer` resumes the
  Claude session with the user's answer injected as the next message.
- **Rate limiting**: the pool tracks which agents are rate-limited and when.
  `isRateLimitError` and `isQuotaExhausted` distinguish transient throttles from
  5-hour quota exhaustion. The per-agent `rateLimited` map stores the deadline.
- **Start-next-task script**: a shell script that picks the highest-priority pending
  task and starts it.

## 13. Accept/Reject for READY tasks, Start Next button in UI (9e790e3)

The web UI gained explicit Accept/Reject buttons for tasks in the `READY` state
and a "Start Next" button in the header that triggers the `start-next-task` script.

## 14. Stream-level failure detection when claude exits 0 (4c0ee5c)

Claude can exit 0 even when the task actually failed — for example when the
permission mode denies a tool_use and Claude exits politely. `parseStream` was
updated to detect `is_error: true` in the result message and
`tool_result.is_error: true` with permission-denial text, returning an error in
both cases so the task goes to `FAILED` rather than silently succeeding.

## 15. Persist log paths at CreateExecution time (f8b5f25)

Previously, `StdoutPath`, `StderrPath`, and `ArtifactDir` were only written to the
execution record at `UpdateExecution` time (after the subprocess finished). This
prevented live log tailing. Introduced the `LogPather` interface: runners that
implement `ExecLogDir(execID)` allow the pool to pre-populate paths before calling
`CreateExecution`, making them available for streaming before the process ends.

## 16. bypassPermissions as executor default (a33211d)

`permission_mode` defaults to `bypassPermissions` when not set in the task YAML.
This was a deliberate trade-off: unattended automation needs to proceed without
tool-use confirmation prompts. Operators can override per-task via `permission_mode`.

## 17. Cancel endpoint and pool cancel mechanism (3672981)

`POST /api/tasks/{id}/cancel` was implemented. The pool maintains a `cancels` map
from taskID to context cancel functions. Cancellation sends a SIGKILL to the
entire process group (via `syscall.Kill(-pgid, SIGKILL)`) to reap MCP servers and
bash children that the claude subprocess spawned.

## 18. BLOCKED state, session resume, fix: persist session_id (7466b17, 40d9ace)

The full BLOCKED cycle was wired end-to-end:
1. Agent writes `question.json` to `$CLAUDOMATOR_QUESTION_FILE` and exits.
2. Runner detects the file and returns `*BlockedError`.
3. Pool transitions task to `BLOCKED` and stores the question JSON.
4. User answers via `POST /api/tasks/{id}/answer`.
5. Pool calls `SubmitResume` with a new `Execution` carrying `ResumeSessionID`
   and `ResumeAnswer`.
6. Runner invokes `claude --resume <session-id> -p <answer>`.

A bug was found and fixed: `session_id` was not persisted in `UpdateExecution`,
causing the BLOCKED → answer → resume cycle to fail because `GetLatestExecution`
returned no session ID.

## 19. Context.Background for resume execution; CANCELLED→QUEUED restart (7d4890c)

Resume executions now use `context.Background()` instead of inheriting a potentially
stale context. `CANCELLED → QUEUED` was added as a valid transition so cancelled
tasks can be manually restarted.

## 20. git sandbox execution, project_dir rename (1f36e23)

The `working_dir` field was renamed to `project_dir` across all layers (task YAML,
storage, API, UI). When `project_dir` is set, the runner no longer executes
directly in that directory. Instead it:

1. Detects whether `project_dir` is a git repo (initialising one if not).
2. Clones the repo into `/tmp/claudomator-sandbox-*` (using `--no-hardlinks`
   to avoid permission issues with mixed-owner `.git/objects`).
3. Runs the agent in the sandbox clone.
4. After the agent exits, verifies no uncommitted changes remain and pushes
   new commits to the canonical bare repo.
5. Removes the sandbox.

On BLOCKED, the sandbox is preserved so the agent can resume where it left off
in the same working tree.

Concurrent push conflicts (two sandboxes pushing at the same time) are handled
by a fetch-rebase-retry sequence.

## 21. Storage: enforce valid state transitions in UpdateTaskState (8777bf2)

`storage.DB.UpdateTaskState` now calls `task.ValidTransition` before writing. If
the transition is not allowed by the state machine, the function returns an error
and no write occurs. This is the enforcement point for the state machine invariants.

## 22. Executor internal dispatch queue; remove at-capacity rejection (2cf6d97)

The previous pool rejected `Submit` when all slots were taken. This was replaced
with an internal `workCh` channel and a `dispatch` goroutine: tasks submitted
while the pool is at capacity are buffered in the channel and picked up as soon
as a slot opens. `Submit` now only returns an error if the channel itself is full
(which requires an enormous backlog).

## 23. API hardening — WebSocket auth, per-IP rate limiter, script registry (363fc9e, 417034b, 181a376)

Several API reliability improvements:
- WebSocket connections now require an API token (if `SetAPIToken` was called) and
  are capped at a configurable maximum number of clients. A ping/pong keepalive
  prevents stale connections from accumulating.
- A per-IP rate limiter was added to the `/api/elaborate` and `/api/validate`
  endpoints to prevent abuse.
- The scripts endpoints were collapsed into a generic `ScriptRegistry`: instead of
  individual handlers per script, a single handler dispatches to registered scripts
  by name.

## 24. API: extend executions and log streaming endpoints (7914153)

`GET /api/executions` gained filtering and sorting. `GET /api/executions/{id}/logs`
was added for fetching completed log files. Live streaming via SSE and the log
tail endpoint were polished.

## 25. CLI: newLogger, shared HTTP client, report command (1ce83b6)

CLI utilities consolidated: a shared logger constructor (`newLogger`), a shared
HTTP client, a default server URL (`http://localhost:8484`). Added the `report`
CLI subcommand for fetching execution summaries from the server.

## 26. Generic agent architecture — transition from Claude-only (306482d to f2d6822)

This was a major refactor over several commits:
1. `ClaudeConfig` was renamed to `AgentConfig` with a new `Type` field (`"claude"`,
   `"gemini"`, etc.).
2. `Pool` was changed from holding a single `ClaudeRunner` to holding a
   `map[string]Runner` — one runner per agent type.
3. `GeminiRunner` was implemented, mirroring `ClaudeRunner` but invoking the
   `gemini` CLI.
4. The storage layer, API handlers, elaborate/validate endpoints, and all tests
   were updated to use `AgentConfig`.
5. The web UI was updated to expose agent type selection.

## 27. Gemini-based task classification and explicit load balancing (406247b)

`Classifier` and `pickAgent` were introduced to automate agent and model selection:

- **`pickAgent(SystemStatus)`** — explicit load balancing: picks the available
  (non-rate-limited) agent with the fewest active tasks. Falls back to fewest-active
  if all agents are rate-limited.
- **`Classifier`** — calls the Gemini CLI with a meta-prompt asking it to pick
  the best model for the task. This is intentionally model-picks-model: use a fast,
  cheap classifier to avoid wasting expensive tokens.

After this commit the flow is: `execute()` → pick agent → call classifier → set
`t.Agent.Type` and `t.Agent.Model` → dispatch to runner.

## 28. ADR-003: Security Model (93a4c85)

The security model was documented formally: no auth, permissive CORS, `bypassPermissions`
as default, and the known risk inventory (see `docs/adr/003-security-model.md`).

## 29. Various web UI improvements (91fd904, 7b53b9e, 560f42b, cdfdc30)

Running tasks became the default view. A "Running view" showing currently running
tasks alongside the 24h execution history was added. Agent type and model were
surfaced on running task cards. The Done/Interrupted tabs were filtered to 24h.

## 30. Quota exhaustion detection from stream (076c0fa)

Previously, quota exhaustion (the 5-hour usage limit) was treated identically to
generic failures. `isQuotaExhausted` was introduced to distinguish it: quota
exhaustion maps to `BUDGET_EXCEEDED` and sets a 5-hour rate-limit deadline on the
agent, rather than failing the task with a generic error.

## 31. Sandbox fixes — push via bare repo, fetch/rebase (cfbcc7b, f135ab8, 07061ac)

The sandbox teardown strategy was revised: instead of pushing directly into the
working copy (which fails for non-bare repos), the sandbox pushes to a bare repo
(`remote "local"` or `remote "origin"`) and the working copy is pulled separately
by the developer. This avoids permission errors from mixed-owner `.git/objects`.
The `--no-hardlinks` clone flag was added to prevent object sharing.

## 32. BLOCKED→READY for parent tasks with subtasks (441ed9e, c8e3b46)

When a top-level task exits the runner successfully but has subtasks, it transitions
to `BLOCKED` (waiting for subtasks to finish) rather than `READY`. A new
`maybeUnblockParent` function is called every time a subtask completes: if all
siblings are `COMPLETED`, the parent transitions `BLOCKED → READY` and is
presented for operator review.

## 33. Stale RUNNING task recovery on server startup (9159572)

`Pool.RecoverStaleRunning()` was added and called from `cli.serve`. It queries for
tasks still in `RUNNING` state (left over from a previous server crash) and marks
them `FAILED`, closing their open execution records. This prevents stuck tasks
after server restarts.

## 34. API: configurable mockRunner, async error-path tests (b33566b)

The `api` test suite was hardened with a configurable `mockRunner` that can be
injected into the test server. Async error paths (runner returns an error, DB
update fails mid-execution) were now exercised in tests.

## 35. Storage: missing indexes, ListRecentExecutions tests, DeleteTask atomicity (8b6c97e, 3610409)

Several storage correctness fixes:
- `idx_tasks_state`, `idx_tasks_parent_task_id`, `idx_executions_status`,
  `idx_executions_task_id`, and `idx_executions_start_time` indexes were added.
- `ListRecentExecutions` had an off-by-one that caused it to miss recent executions;
  tests were added to catch this.
- `DeleteTask` was made atomic using a recursive CTE to delete the task and all
  its subtasks in a single transaction.

## 36. API: validate ?state= param, standardize operation response shapes (933af81)

`GET /api/tasks?state=XYZ` now validates the state value. All mutating operation
responses (`/run`, `/cancel`, `/accept`, `/reject`, `/answer`) were standardised
to return `{"status": "ok"}` on success.

## 37. Re-classify on manual restart; handleRunResult extraction (0676f0f, 7d6943c)

Tasks that are manually restarted (from `FAILED`, `CANCELLED`, etc.) now go through
classification again so they pick up the latest agent/model selection logic. The
post-run error classification block was extracted into `handleRunResult` — a shared
helper called by both `execute` and `executeResume` — eliminating 60+ lines of
duplication.

## 38. Legacy Claude field removed (b4371d0, a782bbf)

The last remnants of the original `ClaudeConfig` type and backward-compat `working_dir`
shim were removed. The schema is now fully generic.

## 39. Kill-goroutine safety documentation, goroutine-leak test (3b4c50e)

A documented invariant was added to the `execOnce` goroutine that kills the
subprocess process group: it cannot block indefinitely. Tests were added to verify
no goroutine leak occurs when a task is cancelled.

## 40. Rate-limit avoidance in classifier; model list updates (8ec366d, fc1459b)

The classifier now skips calling itself if the selected agent is rate-limited,
avoiding a redundant Gemini API call when the rate-limited agent is already known.
The model list was updated to Claude 4.x (`claude-sonnet-4-6`, `claude-opus-4-6`,
`claude-haiku-4-5-20251001`) and current Gemini models (`gemini-2.5-flash-lite`,
`gemini-2.5-flash`, `gemini-2.5-pro`).

## 41. Map leak fixes — activePerAgent and rateLimited (7c7dd2b)

Two map leak bugs were fixed in the pool:
- `activePerAgent[agentType]` was decremented but never deleted when the count hit
  zero, so inactive agents accumulated as dead entries.
- Expired `rateLimited[agentType]` entries were not deleted, so the map grew
  unboundedly over long runs.

## 42. Sandbox teardown: remove working-copy pull, retry push on concurrent rejection (5c85624)

The sandbox teardown removed the `git pull` into the working copy (which was failing
due to mixed-owner object dirs). The retry-push-on-rejection path was tightened to
detect `"fetch first"` and `"non-fast-forward"` as the rejection signals.

## 43. Explicit load balancing separated from classification (e033504)

Previously the `Classifier` both picked the agent and selected the model. This was
split: `pickAgent` is deterministic code that picks the agent from the registered
runners using the load-balancing algorithm. The `Classifier` only picks the model
for the already-selected agent. This makes load balancing reliable and fast even
when the Gemini classifier is unavailable.

## 44. Session ID fix on second block-and-resume cycle (65c7638)

A bug was found where the second BLOCKED→answer→resume cycle passed the wrong
`--resume` session ID to Claude. The fix ensures that resume executions propagate
the original session ID rather than the new execution's UUID.

## 45. validTransitions promoted to package-level var (3226af3)

`validTransitions` was promoted to a package-level variable in `internal/task/task.go`
for clarity and potential reuse outside the package. ADR-002 was updated to reflect
the current state machine including the `BLOCKED→READY` transition for parent tasks.

---

## Feature Summary (current state)

| Feature | Status |
|---|---|
| Task YAML parsing, batch files | Done |
| SQLite persistence | Done |
| REST API (CRUD + lifecycle) | Done |
| WebSocket real-time events | Done |
| Claude subprocess execution | Done |
| Gemini subprocess execution | Done |
| Explicit load balancing (pickAgent) | Done |
| Gemini-based model classification | Done |
| BLOCKED / question-answer / resume | Done |
| git sandbox isolation | Done |
| Subtask creation and unblocking | Done |
| READY state / human accept-reject | Done |
| Rate-limit and quota tracking | Done |
| Stale RUNNING recovery on startup | Done |
| Per-IP rate limiter on elaborate | Done |
| Web UI (PWA) | Done |
| Push notifications (PWA) | Planned |

--- 2026-03-16T00:56:20Z ---
Converter sudoku to rust

--- 2026-03-16T01:14:27Z ---
For claudomator tasks that are ready, check the deployed server version against their fix commit

--- 2026-03-16T01:17:00Z ---
For every claudomator task that is ready, display on the task whether the currently deployed server includes the commit which fixes that task