diff options
| -rw-r--r-- | docs/RAW_NARRATIVE.md | 390 | ||||
| -rw-r--r-- | docs/adr/004-multi-agent-routing-and-classification.md | 107 | ||||
| -rw-r--r-- | docs/adr/005-sandbox-execution-model.md | 130 |
3 files changed, 627 insertions, 0 deletions
diff --git a/docs/RAW_NARRATIVE.md b/docs/RAW_NARRATIVE.md new file mode 100644 index 0000000..a9c1492 --- /dev/null +++ b/docs/RAW_NARRATIVE.md @@ -0,0 +1,390 @@ +# Claudomator: Development Narrative + +This document is a chronological engineering history of the Claudomator project, +reconstructed from the git log, ADRs, and source code. + +--- + +## 1. Initial commit — core scaffolding (2e2b218) + +The project started with a single commit that established the full skeleton: +task model, executor, API server, CLI, storage layer, and reporter. The Go module +was `github.com/thepeterstone/claudomator`. The initial `Task` struct had a +`ClaudeConfig` field (later renamed to `AgentConfig`) holding the model, +instructions, `working_dir`, budget, permission mode, and tool lists. SQLite was +chosen as the storage backend (see ADR-001). The executor pool used a bounded +goroutine model. The API server was plain `net/http` with no external framework. +The CLI was Cobra. + +## 2. JSON tags, module rename, gitignore (8ee1fb5, 46ba3f5, 2bf317d) + +Early housekeeping: added JSON struct tags to all exported types, renamed the Go +module to its final identifier, and set up the `.gitignore` to exclude the compiled +binary and local Claude settings. + +## 3. Verbose flag, logs CLI command (0377c06, f27d4f7) + +Added `--verbose` to the Claude subprocess invocation and a `logs` CLI subcommand +for tailing execution output. + +## 4. Embedded web UI and HTTP wiring (135d8eb) + +The first web UI was embedded into the binary using `go:embed`. This made the +binary fully self-contained: no separate static file server was needed. + +## 5. CLAUDE.md, clickable fold, subtask support (bdcc33f, 3881f80, 704d007) + +Added the project-level `CLAUDE.md` guidance document. Added a clickable fold to +the web UI to expand hidden completed/failed tasks. Added `parent_task_id` to the +`Task` struct, `ListSubtasks` to storage, and `UpdateTask` — the foundational +subtask plumbing. + +## 6. Dependency waiting and planning preamble (f527972) + +The executor gained dependency waiting: tasks with `depends_on` now block in a +polling loop until all dependencies reach `COMPLETED`. Any dependency entering a +terminal failure state (`FAILED`, `TIMED_OUT`, `CANCELLED`, `BUDGET_EXCEEDED`) +immediately fails the waiting task. + +The planning preamble was also introduced here — a system prompt prefix injected +into every task's instructions that explains to the agent how to write question +files, how to break tasks into subtasks via the `claudomator` CLI, and how to +commit all changes in git sandboxes. + +## 7. Elaborate, logs-stream, templates, subtask-list endpoints (74cc740) + +The API gained several new endpoints: +- `POST /api/elaborate` — calls Claude to expand a brief task description into + structured YAML. +- `GET /api/executions/{id}/stream` — live-streams the execution log. +- `GET /api/templates` / `POST /api/templates` — task template CRUD (later removed). +- `GET /api/tasks/{id}/subtasks` — lists subtasks for a parent task. + +## 8. Web UI: tabs, new task modal, templates panel (e8d1b80) + +The web UI got a tabbed layout (Running / Done / Templates), a modal for creating +new tasks with AI-drafted instructions, and a templates panel. This was the first +version of the UI that matched the current design. + +## 9. READY state for human-in-the-loop review (6511d6e) + +A critical design point: when a top-level task's runner exits successfully, the +task does not immediately go to `COMPLETED`. Instead it transitions to `READY`, +meaning it paused for the operator to review the agent's output and explicitly +accept or reject it. `READY → COMPLETED` requires `POST /api/tasks/{id}/accept`. +`READY → PENDING` (for re-running) requires `POST /api/tasks/{id}/reject`. + +This is specific to top-level tasks. Subtasks (`parent_task_id != ""`) bypass READY +and go directly to `COMPLETED` — only the root task requires human sign-off. + +## 10. Fix working_dir failures, hardcoded /root removed (3962597) + +Early deployments hardcoded `/root` as the base path for `working_dir`. This was +removed. `working_dir` is now validated to exist before the subprocess starts. + +## 11. Scripts, debug-execution, deploy (2bbae74, f7c6de4) + +Added the `scripts/` directory with `debug-execution` (inspects a specific +execution's logs) and `deploy` (builds and deploys the binary to the production +server). Added a CLI `start` command and the `version` package. + +## 12. Rescue from recovery branch — question/answer, rate limiting, start-next-task (cf83444) + +A batch of features rescued from a detached-work branch: +- **Question/answer flow (`BLOCKED` state)**: agents can write a `question.json` + file before exiting. The pool detects this and transitions the task to `BLOCKED`, + storing the question for the user. `POST /api/tasks/{id}/answer` resumes the + Claude session with the user's answer injected as the next message. +- **Rate limiting**: the pool tracks which agents are rate-limited and when. + `isRateLimitError` and `isQuotaExhausted` distinguish transient throttles from + 5-hour quota exhaustion. The per-agent `rateLimited` map stores the deadline. +- **Start-next-task script**: a shell script that picks the highest-priority pending + task and starts it. + +## 13. Accept/Reject for READY tasks, Start Next button in UI (9e790e3) + +The web UI gained explicit Accept/Reject buttons for tasks in the `READY` state +and a "Start Next" button in the header that triggers the `start-next-task` script. + +## 14. Stream-level failure detection when claude exits 0 (4c0ee5c) + +Claude can exit 0 even when the task actually failed — for example when the +permission mode denies a tool_use and Claude exits politely. `parseStream` was +updated to detect `is_error: true` in the result message and +`tool_result.is_error: true` with permission-denial text, returning an error in +both cases so the task goes to `FAILED` rather than silently succeeding. + +## 15. Persist log paths at CreateExecution time (f8b5f25) + +Previously, `StdoutPath`, `StderrPath`, and `ArtifactDir` were only written to the +execution record at `UpdateExecution` time (after the subprocess finished). This +prevented live log tailing. Introduced the `LogPather` interface: runners that +implement `ExecLogDir(execID)` allow the pool to pre-populate paths before calling +`CreateExecution`, making them available for streaming before the process ends. + +## 16. bypassPermissions as executor default (a33211d) + +`permission_mode` defaults to `bypassPermissions` when not set in the task YAML. +This was a deliberate trade-off: unattended automation needs to proceed without +tool-use confirmation prompts. Operators can override per-task via `permission_mode`. + +## 17. Cancel endpoint and pool cancel mechanism (3672981) + +`POST /api/tasks/{id}/cancel` was implemented. The pool maintains a `cancels` map +from taskID to context cancel functions. Cancellation sends a SIGKILL to the +entire process group (via `syscall.Kill(-pgid, SIGKILL)`) to reap MCP servers and +bash children that the claude subprocess spawned. + +## 18. BLOCKED state, session resume, fix: persist session_id (7466b17, 40d9ace) + +The full BLOCKED cycle was wired end-to-end: +1. Agent writes `question.json` to `$CLAUDOMATOR_QUESTION_FILE` and exits. +2. Runner detects the file and returns `*BlockedError`. +3. Pool transitions task to `BLOCKED` and stores the question JSON. +4. User answers via `POST /api/tasks/{id}/answer`. +5. Pool calls `SubmitResume` with a new `Execution` carrying `ResumeSessionID` + and `ResumeAnswer`. +6. Runner invokes `claude --resume <session-id> -p <answer>`. + +A bug was found and fixed: `session_id` was not persisted in `UpdateExecution`, +causing the BLOCKED → answer → resume cycle to fail because `GetLatestExecution` +returned no session ID. + +## 19. Context.Background for resume execution; CANCELLED→QUEUED restart (7d4890c) + +Resume executions now use `context.Background()` instead of inheriting a potentially +stale context. `CANCELLED → QUEUED` was added as a valid transition so cancelled +tasks can be manually restarted. + +## 20. git sandbox execution, project_dir rename (1f36e23) + +The `working_dir` field was renamed to `project_dir` across all layers (task YAML, +storage, API, UI). When `project_dir` is set, the runner no longer executes +directly in that directory. Instead it: + +1. Detects whether `project_dir` is a git repo (initialising one if not). +2. Clones the repo into `/tmp/claudomator-sandbox-*` (using `--no-hardlinks` + to avoid permission issues with mixed-owner `.git/objects`). +3. Runs the agent in the sandbox clone. +4. After the agent exits, verifies no uncommitted changes remain and pushes + new commits to the canonical bare repo. +5. Removes the sandbox. + +On BLOCKED, the sandbox is preserved so the agent can resume where it left off +in the same working tree. + +Concurrent push conflicts (two sandboxes pushing at the same time) are handled +by a fetch-rebase-retry sequence. + +## 21. Storage: enforce valid state transitions in UpdateTaskState (8777bf2) + +`storage.DB.UpdateTaskState` now calls `task.ValidTransition` before writing. If +the transition is not allowed by the state machine, the function returns an error +and no write occurs. This is the enforcement point for the state machine invariants. + +## 22. Executor internal dispatch queue; remove at-capacity rejection (2cf6d97) + +The previous pool rejected `Submit` when all slots were taken. This was replaced +with an internal `workCh` channel and a `dispatch` goroutine: tasks submitted +while the pool is at capacity are buffered in the channel and picked up as soon +as a slot opens. `Submit` now only returns an error if the channel itself is full +(which requires an enormous backlog). + +## 23. API hardening — WebSocket auth, per-IP rate limiter, script registry (363fc9e, 417034b, 181a376) + +Several API reliability improvements: +- WebSocket connections now require an API token (if `SetAPIToken` was called) and + are capped at a configurable maximum number of clients. A ping/pong keepalive + prevents stale connections from accumulating. +- A per-IP rate limiter was added to the `/api/elaborate` and `/api/validate` + endpoints to prevent abuse. +- The scripts endpoints were collapsed into a generic `ScriptRegistry`: instead of + individual handlers per script, a single handler dispatches to registered scripts + by name. + +## 24. API: extend executions and log streaming endpoints (7914153) + +`GET /api/executions` gained filtering and sorting. `GET /api/executions/{id}/logs` +was added for fetching completed log files. Live streaming via SSE and the log +tail endpoint were polished. + +## 25. CLI: newLogger, shared HTTP client, report command (1ce83b6) + +CLI utilities consolidated: a shared logger constructor (`newLogger`), a shared +HTTP client, a default server URL (`http://localhost:8484`). Added the `report` +CLI subcommand for fetching execution summaries from the server. + +## 26. Generic agent architecture — transition from Claude-only (306482d to f2d6822) + +This was a major refactor over several commits: +1. `ClaudeConfig` was renamed to `AgentConfig` with a new `Type` field (`"claude"`, + `"gemini"`, etc.). +2. `Pool` was changed from holding a single `ClaudeRunner` to holding a + `map[string]Runner` — one runner per agent type. +3. `GeminiRunner` was implemented, mirroring `ClaudeRunner` but invoking the + `gemini` CLI. +4. The storage layer, API handlers, elaborate/validate endpoints, and all tests + were updated to use `AgentConfig`. +5. The web UI was updated to expose agent type selection. + +## 27. Gemini-based task classification and explicit load balancing (406247b) + +`Classifier` and `pickAgent` were introduced to automate agent and model selection: + +- **`pickAgent(SystemStatus)`** — explicit load balancing: picks the available + (non-rate-limited) agent with the fewest active tasks. Falls back to fewest-active + if all agents are rate-limited. +- **`Classifier`** — calls the Gemini CLI with a meta-prompt asking it to pick + the best model for the task. This is intentionally model-picks-model: use a fast, + cheap classifier to avoid wasting expensive tokens. + +After this commit the flow is: `execute()` → pick agent → call classifier → set +`t.Agent.Type` and `t.Agent.Model` → dispatch to runner. + +## 28. ADR-003: Security Model (93a4c85) + +The security model was documented formally: no auth, permissive CORS, `bypassPermissions` +as default, and the known risk inventory (see `docs/adr/003-security-model.md`). + +## 29. Various web UI improvements (91fd904, 7b53b9e, 560f42b, cdfdc30) + +Running tasks became the default view. A "Running view" showing currently running +tasks alongside the 24h execution history was added. Agent type and model were +surfaced on running task cards. The Done/Interrupted tabs were filtered to 24h. + +## 30. Quota exhaustion detection from stream (076c0fa) + +Previously, quota exhaustion (the 5-hour usage limit) was treated identically to +generic failures. `isQuotaExhausted` was introduced to distinguish it: quota +exhaustion maps to `BUDGET_EXCEEDED` and sets a 5-hour rate-limit deadline on the +agent, rather than failing the task with a generic error. + +## 31. Sandbox fixes — push via bare repo, fetch/rebase (cfbcc7b, f135ab8, 07061ac) + +The sandbox teardown strategy was revised: instead of pushing directly into the +working copy (which fails for non-bare repos), the sandbox pushes to a bare repo +(`remote "local"` or `remote "origin"`) and the working copy is pulled separately +by the developer. This avoids permission errors from mixed-owner `.git/objects`. +The `--no-hardlinks` clone flag was added to prevent object sharing. + +## 32. BLOCKED→READY for parent tasks with subtasks (441ed9e, c8e3b46) + +When a top-level task exits the runner successfully but has subtasks, it transitions +to `BLOCKED` (waiting for subtasks to finish) rather than `READY`. A new +`maybeUnblockParent` function is called every time a subtask completes: if all +siblings are `COMPLETED`, the parent transitions `BLOCKED → READY` and is +presented for operator review. + +## 33. Stale RUNNING task recovery on server startup (9159572) + +`Pool.RecoverStaleRunning()` was added and called from `cli.serve`. It queries for +tasks still in `RUNNING` state (left over from a previous server crash) and marks +them `FAILED`, closing their open execution records. This prevents stuck tasks +after server restarts. + +## 34. API: configurable mockRunner, async error-path tests (b33566b) + +The `api` test suite was hardened with a configurable `mockRunner` that can be +injected into the test server. Async error paths (runner returns an error, DB +update fails mid-execution) were now exercised in tests. + +## 35. Storage: missing indexes, ListRecentExecutions tests, DeleteTask atomicity (8b6c97e, 3610409) + +Several storage correctness fixes: +- `idx_tasks_state`, `idx_tasks_parent_task_id`, `idx_executions_status`, + `idx_executions_task_id`, and `idx_executions_start_time` indexes were added. +- `ListRecentExecutions` had an off-by-one that caused it to miss recent executions; + tests were added to catch this. +- `DeleteTask` was made atomic using a recursive CTE to delete the task and all + its subtasks in a single transaction. + +## 36. API: validate ?state= param, standardize operation response shapes (933af81) + +`GET /api/tasks?state=XYZ` now validates the state value. All mutating operation +responses (`/run`, `/cancel`, `/accept`, `/reject`, `/answer`) were standardised +to return `{"status": "ok"}` on success. + +## 37. Re-classify on manual restart; handleRunResult extraction (0676f0f, 7d6943c) + +Tasks that are manually restarted (from `FAILED`, `CANCELLED`, etc.) now go through +classification again so they pick up the latest agent/model selection logic. The +post-run error classification block was extracted into `handleRunResult` — a shared +helper called by both `execute` and `executeResume` — eliminating 60+ lines of +duplication. + +## 38. Legacy Claude field removed (b4371d0, a782bbf) + +The last remnants of the original `ClaudeConfig` type and backward-compat `working_dir` +shim were removed. The schema is now fully generic. + +## 39. Kill-goroutine safety documentation, goroutine-leak test (3b4c50e) + +A documented invariant was added to the `execOnce` goroutine that kills the +subprocess process group: it cannot block indefinitely. Tests were added to verify +no goroutine leak occurs when a task is cancelled. + +## 40. Rate-limit avoidance in classifier; model list updates (8ec366d, fc1459b) + +The classifier now skips calling itself if the selected agent is rate-limited, +avoiding a redundant Gemini API call when the rate-limited agent is already known. +The model list was updated to Claude 4.x (`claude-sonnet-4-6`, `claude-opus-4-6`, +`claude-haiku-4-5-20251001`) and current Gemini models (`gemini-2.5-flash-lite`, +`gemini-2.5-flash`, `gemini-2.5-pro`). + +## 41. Map leak fixes — activePerAgent and rateLimited (7c7dd2b) + +Two map leak bugs were fixed in the pool: +- `activePerAgent[agentType]` was decremented but never deleted when the count hit + zero, so inactive agents accumulated as dead entries. +- Expired `rateLimited[agentType]` entries were not deleted, so the map grew + unboundedly over long runs. + +## 42. Sandbox teardown: remove working-copy pull, retry push on concurrent rejection (5c85624) + +The sandbox teardown removed the `git pull` into the working copy (which was failing +due to mixed-owner object dirs). The retry-push-on-rejection path was tightened to +detect `"fetch first"` and `"non-fast-forward"` as the rejection signals. + +## 43. Explicit load balancing separated from classification (e033504) + +Previously the `Classifier` both picked the agent and selected the model. This was +split: `pickAgent` is deterministic code that picks the agent from the registered +runners using the load-balancing algorithm. The `Classifier` only picks the model +for the already-selected agent. This makes load balancing reliable and fast even +when the Gemini classifier is unavailable. + +## 44. Session ID fix on second block-and-resume cycle (65c7638) + +A bug was found where the second BLOCKED→answer→resume cycle passed the wrong +`--resume` session ID to Claude. The fix ensures that resume executions propagate +the original session ID rather than the new execution's UUID. + +## 45. validTransitions promoted to package-level var (3226af3) + +`validTransitions` was promoted to a package-level variable in `internal/task/task.go` +for clarity and potential reuse outside the package. ADR-002 was updated to reflect +the current state machine including the `BLOCKED→READY` transition for parent tasks. + +--- + +## Feature Summary (current state) + +| Feature | Status | +|---|---| +| Task YAML parsing, batch files | Done | +| SQLite persistence | Done | +| REST API (CRUD + lifecycle) | Done | +| WebSocket real-time events | Done | +| Claude subprocess execution | Done | +| Gemini subprocess execution | Done | +| Explicit load balancing (pickAgent) | Done | +| Gemini-based model classification | Done | +| BLOCKED / question-answer / resume | Done | +| git sandbox isolation | Done | +| Subtask creation and unblocking | Done | +| READY state / human accept-reject | Done | +| Rate-limit and quota tracking | Done | +| Stale RUNNING recovery on startup | Done | +| Per-IP rate limiter on elaborate | Done | +| Web UI (PWA) | Done | +| Push notifications (PWA) | Planned | diff --git a/docs/adr/004-multi-agent-routing-and-classification.md b/docs/adr/004-multi-agent-routing-and-classification.md new file mode 100644 index 0000000..7afb10d --- /dev/null +++ b/docs/adr/004-multi-agent-routing-and-classification.md @@ -0,0 +1,107 @@ +# ADR-004: Multi-Agent Routing and Gemini-Based Classification + +## Status +Accepted + +## Context + +Claudomator started as a Claude-only system. As Gemini became a viable coding +agent, the architecture needed to support multiple agent backends without requiring +operators to manually select an agent or model for each task. + +Two distinct problems needed solving: + +1. **Which agent should run this task?** — Claude and Gemini have different API + quotas and rate limits. When Claude is rate-limited, tasks should flow to + Gemini automatically. +2. **Which model tier should the agent use?** — Both agents offer a spectrum from + fast/cheap to slow/powerful models. Using the wrong tier wastes money or + produces inferior results. + +## Decision + +The two problems are solved independently: + +### Agent selection: explicit load balancing in code (`pickAgent`) + +`pickAgent(SystemStatus)` selects the agent with the fewest active tasks, +preferring non-rate-limited agents. The algorithm is: + +1. First pass: consider only non-rate-limited agents; pick the one with the + fewest active tasks (alphabetical tie-break for determinism). +2. Fallback: if all agents are rate-limited, pick the least-active regardless + of rate-limit status. + +This is deterministic code, not an AI call. It runs in-process with no I/O and +cannot fail in ways that would block task execution. + +### Model selection: Gemini-based classifier (`Classifier`) + +Once the agent is selected, `Classifier.Classify` invokes the Gemini CLI with +`gemini-2.5-flash-lite` to select the best model tier for the task. The classifier +receives the task name, instructions, and the required agent type, and returns +a `Classification` with `agent_type`, `model`, and `reason`. + +The classifier uses a cheap, fast model for classification to minimise the cost +overhead. The response is parsed from JSON, with fallback handling for markdown +code blocks and credential noise in the output. + +### Separation of concerns + +These two decisions were initially merged (the classifier picked both agent and +model). They were separated in commit `e033504` because: + +- Load balancing must be reliable even when the Gemini API is unavailable. +- Classifier failures are non-fatal: if classification fails, the pool logs the + error and proceeds with the agent's default model. + +### Re-classification on manual restart + +When an operator manually restarts a task from a non-`QUEUED` state (e.g. `FAILED`, +`CANCELLED`), the task goes through `execute()` again and is re-classified. This +ensures restarts pick up any changes to agent availability or rate-limit status. + +## Rationale + +- **AI-picks-model**: the model selection decision is genuinely complex and + subjective. Using an AI classifier avoids hardcoding heuristics that would need + constant tuning. +- **Code-picks-agent**: load balancing is a scheduling problem with measurable + inputs (active task counts, rate-limit deadlines). Delegating this to an AI + would introduce unnecessary non-determinism and latency. +- **Gemini for classification**: using Gemini to classify Claude tasks (and vice + versa) prevents circular dependencies. Using the cheapest available Gemini model + keeps classification cost negligible. + +## Alternatives Considered + +- **Operator always picks agent and model**: too much manual overhead. Operators + should be able to submit tasks without knowing which agent is currently + rate-limited. +- **Single classifier picks both agent and model**: rejected after operational + experience showed that load balancing needs to work even when the Gemini API + is unavailable or returning errors. +- **Round-robin agent selection**: simpler but does not account for rate limits + or imbalanced task durations. + +## Consequences + +- Agent selection is deterministic and testable without mocking AI APIs. +- Classification failures are logged but non-fatal; the task runs with the + agent's default model. +- The classifier adds ~1–2 seconds of latency to task start (one Gemini API call). +- Tasks with `agent.type` pre-set in YAML still go through load balancing; + `pickAgent` may override the requested type if the requested type is not a + registered runner. This is by design: the operator's type hint is overridden + by the load balancer to ensure tasks are always routable. + +## Relevant Code Locations + +| Concern | File | +|---|---| +| `pickAgent` | `internal/executor/executor.go` | +| `Classifier` | `internal/executor/classifier.go` | +| Load balancing in `execute()` | `internal/executor/executor.go` | +| Re-classification gate | `internal/api/server.go` (handleRunTask) | +| `pickAgent` tests | `internal/executor/executor_test.go` | +| `Classifier` mock test | `internal/executor/classifier_test.go` | diff --git a/docs/adr/005-sandbox-execution-model.md b/docs/adr/005-sandbox-execution-model.md new file mode 100644 index 0000000..b374561 --- /dev/null +++ b/docs/adr/005-sandbox-execution-model.md @@ -0,0 +1,130 @@ +# ADR-005: Git Sandbox Execution Model + +## Status +Accepted + +## Context + +Tasks that modify source code need a safe execution environment. Running agents +directly in the canonical working copy creates several problems: + +1. **Concurrent corruption**: multiple agents running in the same directory stomp + on each other's changes. +2. **Partial work leaks**: if a task is cancelled mid-run, half-written files + remain in the working tree, blocking other work. +3. **No rollback**: a failed agent may leave the codebase in a broken state with + no clean way to recover without manual `git reset`. +4. **Audit trail**: changes made by an agent should be visible as discrete, + attributable git commits — not as an anonymous blob of working-tree noise. + +## Decision + +When a task has `agent.project_dir` set, `ClaudeRunner.Run` executes the agent +in an isolated git clone (a "sandbox") rather than in the project directory +directly. + +### Sandbox lifecycle + +``` +project_dir (canonical working copy) + | + | git clone --no-hardlinks <clone-source> /tmp/claudomator-sandbox-* + | + v +sandbox (temp clone) + | + | agent runs here; commits its changes + | + | git push (to bare repo "local" or "origin") + | +teardown: verify no uncommitted changes, remove sandbox dir +``` + +### Clone source + +`sandboxCloneSource` prefers a remote named `"local"` (a local bare repo). +If not present it falls back to the `"origin"` remote. Using a bare repo +accepts pushes cleanly; pushing to a non-bare working copy fails when the +receiving branch is checked out. + +### Uncommitted-change enforcement + +Before teardown, the runner runs `git status --porcelain` in the sandbox. +If any uncommitted changes are detected, the task is failed with an error +message listing the files. The sandbox is **preserved** so the operator +can inspect or recover the partial work. The error message includes the +sandbox path. + +### Concurrent push conflicts + +If two sandboxes try to push at the same time, the second push is rejected +(`"fetch first"` or `"non-fast-forward"` in the error output). The runner +detects these signals and performs a fetch → rebase → retry sequence, up to +a fixed retry limit, before giving up. + +### BLOCKED state and sandbox preservation + +When an agent exits with a `question.json` file (entering the `BLOCKED` +state), the sandbox is **not** torn down. The preserved sandbox allows the +resumed execution to pick up the same working tree state, including any +in-progress file changes made before the agent asked its question. + +Resume executions (`SubmitResume`) skip sandbox setup entirely and run +directly in `project_dir`, passing `--resume <session-id>` to the agent +so Claude can continue its previous conversation. + +### Session ID propagation on resume + +A subtle bug was found and fixed: when a resumed execution is itself blocked +again (a second BLOCKED→answer→resume cycle), the new execution record must +carry the **original** `ResumeSessionID`, not the new execution's own UUID. +If the wrong ID is used, `claude --resume` fails with "No conversation found". +The fix is in `ClaudeRunner.Run`: if `e.ResumeSessionID != ""`, use it as +`e.SessionID` rather than `e.ID`. + +## Rationale + +- **`--no-hardlinks`**: git defaults to hardlinking objects between clone and + source when both are on the same filesystem. This causes permission errors + when the source is owned by a different user (e.g. `www-data` vs. `root`). + The flag forces a full copy. +- **Bare repo for push target**: non-bare repos reject pushes to checked-out + branches. A bare repo (`git init --bare`) accepts all pushes safely. +- **Preserve sandbox on failure**: partial agent work may be valuable for + debugging or resumption. Destroying it immediately on failure was considered + and rejected. +- **Agent must commit**: requiring the agent to commit all changes before + exiting ensures git history is always the source of truth. The enforcement + check (uncommitted files → FAILED) makes this invariant observable. + +## Alternatives Considered + +- **Run directly in working copy**: rejected because of concurrent corruption + and partial-work leakage. +- **Copy files instead of git clone**: rejected because the agent needs a + working git history (for `git log`, `git blame`, and to push commits back). +- **Docker/container isolation**: considered for stronger isolation but + rejected due to operational complexity, dependency on container runtime, + and inability to use the host's claude/gemini credentials. + +## Consequences + +- Tasks without `project_dir` are unaffected; they run in whatever working + directory the server process inherited. +- If a sandbox's push repeatedly fails (e.g. due to a bare repo that is + itself broken), the task is failed with the sandbox preserved. +- If `/tmp` runs out of space (many large sandboxes), tasks will fail at + clone time. This is a known operational risk with no current mitigation. +- The `project_dir` field in task YAML must point to a git repository with + a configured `"local"` or `"origin"` remote that accepts pushes. + +## Relevant Code Locations + +| Concern | File | +|---|---| +| Sandbox setup/teardown | `internal/executor/claude.go` | +| `setupSandbox`, `teardownSandbox` | `internal/executor/claude.go` | +| `sandboxCloneSource` | `internal/executor/claude.go` | +| Resume skips sandbox | `internal/executor/claude.go` (Run) | +| Session ID propagation fix | `internal/executor/claude.go` (Run) | +| Sandbox tests | `internal/executor/claude_test.go` | |
