summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/RAW_NARRATIVE.md390
-rw-r--r--docs/adr/004-multi-agent-routing-and-classification.md107
-rw-r--r--docs/adr/005-sandbox-execution-model.md130
3 files changed, 627 insertions, 0 deletions
diff --git a/docs/RAW_NARRATIVE.md b/docs/RAW_NARRATIVE.md
new file mode 100644
index 0000000..a9c1492
--- /dev/null
+++ b/docs/RAW_NARRATIVE.md
@@ -0,0 +1,390 @@
+# Claudomator: Development Narrative
+
+This document is a chronological engineering history of the Claudomator project,
+reconstructed from the git log, ADRs, and source code.
+
+---
+
+## 1. Initial commit — core scaffolding (2e2b218)
+
+The project started with a single commit that established the full skeleton:
+task model, executor, API server, CLI, storage layer, and reporter. The Go module
+was `github.com/thepeterstone/claudomator`. The initial `Task` struct had a
+`ClaudeConfig` field (later renamed to `AgentConfig`) holding the model,
+instructions, `working_dir`, budget, permission mode, and tool lists. SQLite was
+chosen as the storage backend (see ADR-001). The executor pool used a bounded
+goroutine model. The API server was plain `net/http` with no external framework.
+The CLI was Cobra.
+
+## 2. JSON tags, module rename, gitignore (8ee1fb5, 46ba3f5, 2bf317d)
+
+Early housekeeping: added JSON struct tags to all exported types, renamed the Go
+module to its final identifier, and set up the `.gitignore` to exclude the compiled
+binary and local Claude settings.
+
+## 3. Verbose flag, logs CLI command (0377c06, f27d4f7)
+
+Added `--verbose` to the Claude subprocess invocation and a `logs` CLI subcommand
+for tailing execution output.
+
+## 4. Embedded web UI and HTTP wiring (135d8eb)
+
+The first web UI was embedded into the binary using `go:embed`. This made the
+binary fully self-contained: no separate static file server was needed.
+
+## 5. CLAUDE.md, clickable fold, subtask support (bdcc33f, 3881f80, 704d007)
+
+Added the project-level `CLAUDE.md` guidance document. Added a clickable fold to
+the web UI to expand hidden completed/failed tasks. Added `parent_task_id` to the
+`Task` struct, `ListSubtasks` to storage, and `UpdateTask` — the foundational
+subtask plumbing.
+
+## 6. Dependency waiting and planning preamble (f527972)
+
+The executor gained dependency waiting: tasks with `depends_on` now block in a
+polling loop until all dependencies reach `COMPLETED`. Any dependency entering a
+terminal failure state (`FAILED`, `TIMED_OUT`, `CANCELLED`, `BUDGET_EXCEEDED`)
+immediately fails the waiting task.
+
+The planning preamble was also introduced here — a system prompt prefix injected
+into every task's instructions that explains to the agent how to write question
+files, how to break tasks into subtasks via the `claudomator` CLI, and how to
+commit all changes in git sandboxes.
+
+## 7. Elaborate, logs-stream, templates, subtask-list endpoints (74cc740)
+
+The API gained several new endpoints:
+- `POST /api/elaborate` — calls Claude to expand a brief task description into
+ structured YAML.
+- `GET /api/executions/{id}/stream` — live-streams the execution log.
+- `GET /api/templates` / `POST /api/templates` — task template CRUD (later removed).
+- `GET /api/tasks/{id}/subtasks` — lists subtasks for a parent task.
+
+## 8. Web UI: tabs, new task modal, templates panel (e8d1b80)
+
+The web UI got a tabbed layout (Running / Done / Templates), a modal for creating
+new tasks with AI-drafted instructions, and a templates panel. This was the first
+version of the UI that matched the current design.
+
+## 9. READY state for human-in-the-loop review (6511d6e)
+
+A critical design point: when a top-level task's runner exits successfully, the
+task does not immediately go to `COMPLETED`. Instead it transitions to `READY`,
+meaning it paused for the operator to review the agent's output and explicitly
+accept or reject it. `READY → COMPLETED` requires `POST /api/tasks/{id}/accept`.
+`READY → PENDING` (for re-running) requires `POST /api/tasks/{id}/reject`.
+
+This is specific to top-level tasks. Subtasks (`parent_task_id != ""`) bypass READY
+and go directly to `COMPLETED` — only the root task requires human sign-off.
+
+## 10. Fix working_dir failures, hardcoded /root removed (3962597)
+
+Early deployments hardcoded `/root` as the base path for `working_dir`. This was
+removed. `working_dir` is now validated to exist before the subprocess starts.
+
+## 11. Scripts, debug-execution, deploy (2bbae74, f7c6de4)
+
+Added the `scripts/` directory with `debug-execution` (inspects a specific
+execution's logs) and `deploy` (builds and deploys the binary to the production
+server). Added a CLI `start` command and the `version` package.
+
+## 12. Rescue from recovery branch — question/answer, rate limiting, start-next-task (cf83444)
+
+A batch of features rescued from a detached-work branch:
+- **Question/answer flow (`BLOCKED` state)**: agents can write a `question.json`
+ file before exiting. The pool detects this and transitions the task to `BLOCKED`,
+ storing the question for the user. `POST /api/tasks/{id}/answer` resumes the
+ Claude session with the user's answer injected as the next message.
+- **Rate limiting**: the pool tracks which agents are rate-limited and when.
+ `isRateLimitError` and `isQuotaExhausted` distinguish transient throttles from
+ 5-hour quota exhaustion. The per-agent `rateLimited` map stores the deadline.
+- **Start-next-task script**: a shell script that picks the highest-priority pending
+ task and starts it.
+
+## 13. Accept/Reject for READY tasks, Start Next button in UI (9e790e3)
+
+The web UI gained explicit Accept/Reject buttons for tasks in the `READY` state
+and a "Start Next" button in the header that triggers the `start-next-task` script.
+
+## 14. Stream-level failure detection when claude exits 0 (4c0ee5c)
+
+Claude can exit 0 even when the task actually failed — for example when the
+permission mode denies a tool_use and Claude exits politely. `parseStream` was
+updated to detect `is_error: true` in the result message and
+`tool_result.is_error: true` with permission-denial text, returning an error in
+both cases so the task goes to `FAILED` rather than silently succeeding.
+
+## 15. Persist log paths at CreateExecution time (f8b5f25)
+
+Previously, `StdoutPath`, `StderrPath`, and `ArtifactDir` were only written to the
+execution record at `UpdateExecution` time (after the subprocess finished). This
+prevented live log tailing. Introduced the `LogPather` interface: runners that
+implement `ExecLogDir(execID)` allow the pool to pre-populate paths before calling
+`CreateExecution`, making them available for streaming before the process ends.
+
+## 16. bypassPermissions as executor default (a33211d)
+
+`permission_mode` defaults to `bypassPermissions` when not set in the task YAML.
+This was a deliberate trade-off: unattended automation needs to proceed without
+tool-use confirmation prompts. Operators can override per-task via `permission_mode`.
+
+## 17. Cancel endpoint and pool cancel mechanism (3672981)
+
+`POST /api/tasks/{id}/cancel` was implemented. The pool maintains a `cancels` map
+from taskID to context cancel functions. Cancellation sends a SIGKILL to the
+entire process group (via `syscall.Kill(-pgid, SIGKILL)`) to reap MCP servers and
+bash children that the claude subprocess spawned.
+
+## 18. BLOCKED state, session resume, fix: persist session_id (7466b17, 40d9ace)
+
+The full BLOCKED cycle was wired end-to-end:
+1. Agent writes `question.json` to `$CLAUDOMATOR_QUESTION_FILE` and exits.
+2. Runner detects the file and returns `*BlockedError`.
+3. Pool transitions task to `BLOCKED` and stores the question JSON.
+4. User answers via `POST /api/tasks/{id}/answer`.
+5. Pool calls `SubmitResume` with a new `Execution` carrying `ResumeSessionID`
+ and `ResumeAnswer`.
+6. Runner invokes `claude --resume <session-id> -p <answer>`.
+
+A bug was found and fixed: `session_id` was not persisted in `UpdateExecution`,
+causing the BLOCKED → answer → resume cycle to fail because `GetLatestExecution`
+returned no session ID.
+
+## 19. Context.Background for resume execution; CANCELLED→QUEUED restart (7d4890c)
+
+Resume executions now use `context.Background()` instead of inheriting a potentially
+stale context. `CANCELLED → QUEUED` was added as a valid transition so cancelled
+tasks can be manually restarted.
+
+## 20. git sandbox execution, project_dir rename (1f36e23)
+
+The `working_dir` field was renamed to `project_dir` across all layers (task YAML,
+storage, API, UI). When `project_dir` is set, the runner no longer executes
+directly in that directory. Instead it:
+
+1. Detects whether `project_dir` is a git repo (initialising one if not).
+2. Clones the repo into `/tmp/claudomator-sandbox-*` (using `--no-hardlinks`
+ to avoid permission issues with mixed-owner `.git/objects`).
+3. Runs the agent in the sandbox clone.
+4. After the agent exits, verifies no uncommitted changes remain and pushes
+ new commits to the canonical bare repo.
+5. Removes the sandbox.
+
+On BLOCKED, the sandbox is preserved so the agent can resume where it left off
+in the same working tree.
+
+Concurrent push conflicts (two sandboxes pushing at the same time) are handled
+by a fetch-rebase-retry sequence.
+
+## 21. Storage: enforce valid state transitions in UpdateTaskState (8777bf2)
+
+`storage.DB.UpdateTaskState` now calls `task.ValidTransition` before writing. If
+the transition is not allowed by the state machine, the function returns an error
+and no write occurs. This is the enforcement point for the state machine invariants.
+
+## 22. Executor internal dispatch queue; remove at-capacity rejection (2cf6d97)
+
+The previous pool rejected `Submit` when all slots were taken. This was replaced
+with an internal `workCh` channel and a `dispatch` goroutine: tasks submitted
+while the pool is at capacity are buffered in the channel and picked up as soon
+as a slot opens. `Submit` now only returns an error if the channel itself is full
+(which requires an enormous backlog).
+
+## 23. API hardening — WebSocket auth, per-IP rate limiter, script registry (363fc9e, 417034b, 181a376)
+
+Several API reliability improvements:
+- WebSocket connections now require an API token (if `SetAPIToken` was called) and
+ are capped at a configurable maximum number of clients. A ping/pong keepalive
+ prevents stale connections from accumulating.
+- A per-IP rate limiter was added to the `/api/elaborate` and `/api/validate`
+ endpoints to prevent abuse.
+- The scripts endpoints were collapsed into a generic `ScriptRegistry`: instead of
+ individual handlers per script, a single handler dispatches to registered scripts
+ by name.
+
+## 24. API: extend executions and log streaming endpoints (7914153)
+
+`GET /api/executions` gained filtering and sorting. `GET /api/executions/{id}/logs`
+was added for fetching completed log files. Live streaming via SSE and the log
+tail endpoint were polished.
+
+## 25. CLI: newLogger, shared HTTP client, report command (1ce83b6)
+
+CLI utilities consolidated: a shared logger constructor (`newLogger`), a shared
+HTTP client, a default server URL (`http://localhost:8484`). Added the `report`
+CLI subcommand for fetching execution summaries from the server.
+
+## 26. Generic agent architecture — transition from Claude-only (306482d to f2d6822)
+
+This was a major refactor over several commits:
+1. `ClaudeConfig` was renamed to `AgentConfig` with a new `Type` field (`"claude"`,
+ `"gemini"`, etc.).
+2. `Pool` was changed from holding a single `ClaudeRunner` to holding a
+ `map[string]Runner` — one runner per agent type.
+3. `GeminiRunner` was implemented, mirroring `ClaudeRunner` but invoking the
+ `gemini` CLI.
+4. The storage layer, API handlers, elaborate/validate endpoints, and all tests
+ were updated to use `AgentConfig`.
+5. The web UI was updated to expose agent type selection.
+
+## 27. Gemini-based task classification and explicit load balancing (406247b)
+
+`Classifier` and `pickAgent` were introduced to automate agent and model selection:
+
+- **`pickAgent(SystemStatus)`** — explicit load balancing: picks the available
+ (non-rate-limited) agent with the fewest active tasks. Falls back to fewest-active
+ if all agents are rate-limited.
+- **`Classifier`** — calls the Gemini CLI with a meta-prompt asking it to pick
+ the best model for the task. This is intentionally model-picks-model: use a fast,
+ cheap classifier to avoid wasting expensive tokens.
+
+After this commit the flow is: `execute()` → pick agent → call classifier → set
+`t.Agent.Type` and `t.Agent.Model` → dispatch to runner.
+
+## 28. ADR-003: Security Model (93a4c85)
+
+The security model was documented formally: no auth, permissive CORS, `bypassPermissions`
+as default, and the known risk inventory (see `docs/adr/003-security-model.md`).
+
+## 29. Various web UI improvements (91fd904, 7b53b9e, 560f42b, cdfdc30)
+
+Running tasks became the default view. A "Running view" showing currently running
+tasks alongside the 24h execution history was added. Agent type and model were
+surfaced on running task cards. The Done/Interrupted tabs were filtered to 24h.
+
+## 30. Quota exhaustion detection from stream (076c0fa)
+
+Previously, quota exhaustion (the 5-hour usage limit) was treated identically to
+generic failures. `isQuotaExhausted` was introduced to distinguish it: quota
+exhaustion maps to `BUDGET_EXCEEDED` and sets a 5-hour rate-limit deadline on the
+agent, rather than failing the task with a generic error.
+
+## 31. Sandbox fixes — push via bare repo, fetch/rebase (cfbcc7b, f135ab8, 07061ac)
+
+The sandbox teardown strategy was revised: instead of pushing directly into the
+working copy (which fails for non-bare repos), the sandbox pushes to a bare repo
+(`remote "local"` or `remote "origin"`) and the working copy is pulled separately
+by the developer. This avoids permission errors from mixed-owner `.git/objects`.
+The `--no-hardlinks` clone flag was added to prevent object sharing.
+
+## 32. BLOCKED→READY for parent tasks with subtasks (441ed9e, c8e3b46)
+
+When a top-level task exits the runner successfully but has subtasks, it transitions
+to `BLOCKED` (waiting for subtasks to finish) rather than `READY`. A new
+`maybeUnblockParent` function is called every time a subtask completes: if all
+siblings are `COMPLETED`, the parent transitions `BLOCKED → READY` and is
+presented for operator review.
+
+## 33. Stale RUNNING task recovery on server startup (9159572)
+
+`Pool.RecoverStaleRunning()` was added and called from `cli.serve`. It queries for
+tasks still in `RUNNING` state (left over from a previous server crash) and marks
+them `FAILED`, closing their open execution records. This prevents stuck tasks
+after server restarts.
+
+## 34. API: configurable mockRunner, async error-path tests (b33566b)
+
+The `api` test suite was hardened with a configurable `mockRunner` that can be
+injected into the test server. Async error paths (runner returns an error, DB
+update fails mid-execution) were now exercised in tests.
+
+## 35. Storage: missing indexes, ListRecentExecutions tests, DeleteTask atomicity (8b6c97e, 3610409)
+
+Several storage correctness fixes:
+- `idx_tasks_state`, `idx_tasks_parent_task_id`, `idx_executions_status`,
+ `idx_executions_task_id`, and `idx_executions_start_time` indexes were added.
+- `ListRecentExecutions` had an off-by-one that caused it to miss recent executions;
+ tests were added to catch this.
+- `DeleteTask` was made atomic using a recursive CTE to delete the task and all
+ its subtasks in a single transaction.
+
+## 36. API: validate ?state= param, standardize operation response shapes (933af81)
+
+`GET /api/tasks?state=XYZ` now validates the state value. All mutating operation
+responses (`/run`, `/cancel`, `/accept`, `/reject`, `/answer`) were standardised
+to return `{"status": "ok"}` on success.
+
+## 37. Re-classify on manual restart; handleRunResult extraction (0676f0f, 7d6943c)
+
+Tasks that are manually restarted (from `FAILED`, `CANCELLED`, etc.) now go through
+classification again so they pick up the latest agent/model selection logic. The
+post-run error classification block was extracted into `handleRunResult` — a shared
+helper called by both `execute` and `executeResume` — eliminating 60+ lines of
+duplication.
+
+## 38. Legacy Claude field removed (b4371d0, a782bbf)
+
+The last remnants of the original `ClaudeConfig` type and backward-compat `working_dir`
+shim were removed. The schema is now fully generic.
+
+## 39. Kill-goroutine safety documentation, goroutine-leak test (3b4c50e)
+
+A documented invariant was added to the `execOnce` goroutine that kills the
+subprocess process group: it cannot block indefinitely. Tests were added to verify
+no goroutine leak occurs when a task is cancelled.
+
+## 40. Rate-limit avoidance in classifier; model list updates (8ec366d, fc1459b)
+
+The classifier now skips calling itself if the selected agent is rate-limited,
+avoiding a redundant Gemini API call when the rate-limited agent is already known.
+The model list was updated to Claude 4.x (`claude-sonnet-4-6`, `claude-opus-4-6`,
+`claude-haiku-4-5-20251001`) and current Gemini models (`gemini-2.5-flash-lite`,
+`gemini-2.5-flash`, `gemini-2.5-pro`).
+
+## 41. Map leak fixes — activePerAgent and rateLimited (7c7dd2b)
+
+Two map leak bugs were fixed in the pool:
+- `activePerAgent[agentType]` was decremented but never deleted when the count hit
+ zero, so inactive agents accumulated as dead entries.
+- Expired `rateLimited[agentType]` entries were not deleted, so the map grew
+ unboundedly over long runs.
+
+## 42. Sandbox teardown: remove working-copy pull, retry push on concurrent rejection (5c85624)
+
+The sandbox teardown removed the `git pull` into the working copy (which was failing
+due to mixed-owner object dirs). The retry-push-on-rejection path was tightened to
+detect `"fetch first"` and `"non-fast-forward"` as the rejection signals.
+
+## 43. Explicit load balancing separated from classification (e033504)
+
+Previously the `Classifier` both picked the agent and selected the model. This was
+split: `pickAgent` is deterministic code that picks the agent from the registered
+runners using the load-balancing algorithm. The `Classifier` only picks the model
+for the already-selected agent. This makes load balancing reliable and fast even
+when the Gemini classifier is unavailable.
+
+## 44. Session ID fix on second block-and-resume cycle (65c7638)
+
+A bug was found where the second BLOCKED→answer→resume cycle passed the wrong
+`--resume` session ID to Claude. The fix ensures that resume executions propagate
+the original session ID rather than the new execution's UUID.
+
+## 45. validTransitions promoted to package-level var (3226af3)
+
+`validTransitions` was promoted to a package-level variable in `internal/task/task.go`
+for clarity and potential reuse outside the package. ADR-002 was updated to reflect
+the current state machine including the `BLOCKED→READY` transition for parent tasks.
+
+---
+
+## Feature Summary (current state)
+
+| Feature | Status |
+|---|---|
+| Task YAML parsing, batch files | Done |
+| SQLite persistence | Done |
+| REST API (CRUD + lifecycle) | Done |
+| WebSocket real-time events | Done |
+| Claude subprocess execution | Done |
+| Gemini subprocess execution | Done |
+| Explicit load balancing (pickAgent) | Done |
+| Gemini-based model classification | Done |
+| BLOCKED / question-answer / resume | Done |
+| git sandbox isolation | Done |
+| Subtask creation and unblocking | Done |
+| READY state / human accept-reject | Done |
+| Rate-limit and quota tracking | Done |
+| Stale RUNNING recovery on startup | Done |
+| Per-IP rate limiter on elaborate | Done |
+| Web UI (PWA) | Done |
+| Push notifications (PWA) | Planned |
diff --git a/docs/adr/004-multi-agent-routing-and-classification.md b/docs/adr/004-multi-agent-routing-and-classification.md
new file mode 100644
index 0000000..7afb10d
--- /dev/null
+++ b/docs/adr/004-multi-agent-routing-and-classification.md
@@ -0,0 +1,107 @@
+# ADR-004: Multi-Agent Routing and Gemini-Based Classification
+
+## Status
+Accepted
+
+## Context
+
+Claudomator started as a Claude-only system. As Gemini became a viable coding
+agent, the architecture needed to support multiple agent backends without requiring
+operators to manually select an agent or model for each task.
+
+Two distinct problems needed solving:
+
+1. **Which agent should run this task?** — Claude and Gemini have different API
+ quotas and rate limits. When Claude is rate-limited, tasks should flow to
+ Gemini automatically.
+2. **Which model tier should the agent use?** — Both agents offer a spectrum from
+ fast/cheap to slow/powerful models. Using the wrong tier wastes money or
+ produces inferior results.
+
+## Decision
+
+The two problems are solved independently:
+
+### Agent selection: explicit load balancing in code (`pickAgent`)
+
+`pickAgent(SystemStatus)` selects the agent with the fewest active tasks,
+preferring non-rate-limited agents. The algorithm is:
+
+1. First pass: consider only non-rate-limited agents; pick the one with the
+ fewest active tasks (alphabetical tie-break for determinism).
+2. Fallback: if all agents are rate-limited, pick the least-active regardless
+ of rate-limit status.
+
+This is deterministic code, not an AI call. It runs in-process with no I/O and
+cannot fail in ways that would block task execution.
+
+### Model selection: Gemini-based classifier (`Classifier`)
+
+Once the agent is selected, `Classifier.Classify` invokes the Gemini CLI with
+`gemini-2.5-flash-lite` to select the best model tier for the task. The classifier
+receives the task name, instructions, and the required agent type, and returns
+a `Classification` with `agent_type`, `model`, and `reason`.
+
+The classifier uses a cheap, fast model for classification to minimise the cost
+overhead. The response is parsed from JSON, with fallback handling for markdown
+code blocks and credential noise in the output.
+
+### Separation of concerns
+
+These two decisions were initially merged (the classifier picked both agent and
+model). They were separated in commit `e033504` because:
+
+- Load balancing must be reliable even when the Gemini API is unavailable.
+- Classifier failures are non-fatal: if classification fails, the pool logs the
+ error and proceeds with the agent's default model.
+
+### Re-classification on manual restart
+
+When an operator manually restarts a task from a non-`QUEUED` state (e.g. `FAILED`,
+`CANCELLED`), the task goes through `execute()` again and is re-classified. This
+ensures restarts pick up any changes to agent availability or rate-limit status.
+
+## Rationale
+
+- **AI-picks-model**: the model selection decision is genuinely complex and
+ subjective. Using an AI classifier avoids hardcoding heuristics that would need
+ constant tuning.
+- **Code-picks-agent**: load balancing is a scheduling problem with measurable
+ inputs (active task counts, rate-limit deadlines). Delegating this to an AI
+ would introduce unnecessary non-determinism and latency.
+- **Gemini for classification**: using Gemini to classify Claude tasks (and vice
+ versa) prevents circular dependencies. Using the cheapest available Gemini model
+ keeps classification cost negligible.
+
+## Alternatives Considered
+
+- **Operator always picks agent and model**: too much manual overhead. Operators
+ should be able to submit tasks without knowing which agent is currently
+ rate-limited.
+- **Single classifier picks both agent and model**: rejected after operational
+ experience showed that load balancing needs to work even when the Gemini API
+ is unavailable or returning errors.
+- **Round-robin agent selection**: simpler but does not account for rate limits
+ or imbalanced task durations.
+
+## Consequences
+
+- Agent selection is deterministic and testable without mocking AI APIs.
+- Classification failures are logged but non-fatal; the task runs with the
+ agent's default model.
+- The classifier adds ~1–2 seconds of latency to task start (one Gemini API call).
+- Tasks with `agent.type` pre-set in YAML still go through load balancing;
+ `pickAgent` may override the requested type if the requested type is not a
+ registered runner. This is by design: the operator's type hint is overridden
+ by the load balancer to ensure tasks are always routable.
+
+## Relevant Code Locations
+
+| Concern | File |
+|---|---|
+| `pickAgent` | `internal/executor/executor.go` |
+| `Classifier` | `internal/executor/classifier.go` |
+| Load balancing in `execute()` | `internal/executor/executor.go` |
+| Re-classification gate | `internal/api/server.go` (handleRunTask) |
+| `pickAgent` tests | `internal/executor/executor_test.go` |
+| `Classifier` mock test | `internal/executor/classifier_test.go` |
diff --git a/docs/adr/005-sandbox-execution-model.md b/docs/adr/005-sandbox-execution-model.md
new file mode 100644
index 0000000..b374561
--- /dev/null
+++ b/docs/adr/005-sandbox-execution-model.md
@@ -0,0 +1,130 @@
+# ADR-005: Git Sandbox Execution Model
+
+## Status
+Accepted
+
+## Context
+
+Tasks that modify source code need a safe execution environment. Running agents
+directly in the canonical working copy creates several problems:
+
+1. **Concurrent corruption**: multiple agents running in the same directory stomp
+ on each other's changes.
+2. **Partial work leaks**: if a task is cancelled mid-run, half-written files
+ remain in the working tree, blocking other work.
+3. **No rollback**: a failed agent may leave the codebase in a broken state with
+ no clean way to recover without manual `git reset`.
+4. **Audit trail**: changes made by an agent should be visible as discrete,
+ attributable git commits — not as an anonymous blob of working-tree noise.
+
+## Decision
+
+When a task has `agent.project_dir` set, `ClaudeRunner.Run` executes the agent
+in an isolated git clone (a "sandbox") rather than in the project directory
+directly.
+
+### Sandbox lifecycle
+
+```
+project_dir (canonical working copy)
+ |
+ | git clone --no-hardlinks <clone-source> /tmp/claudomator-sandbox-*
+ |
+ v
+sandbox (temp clone)
+ |
+ | agent runs here; commits its changes
+ |
+ | git push (to bare repo "local" or "origin")
+ |
+teardown: verify no uncommitted changes, remove sandbox dir
+```
+
+### Clone source
+
+`sandboxCloneSource` prefers a remote named `"local"` (a local bare repo).
+If not present it falls back to the `"origin"` remote. Using a bare repo
+accepts pushes cleanly; pushing to a non-bare working copy fails when the
+receiving branch is checked out.
+
+### Uncommitted-change enforcement
+
+Before teardown, the runner runs `git status --porcelain` in the sandbox.
+If any uncommitted changes are detected, the task is failed with an error
+message listing the files. The sandbox is **preserved** so the operator
+can inspect or recover the partial work. The error message includes the
+sandbox path.
+
+### Concurrent push conflicts
+
+If two sandboxes try to push at the same time, the second push is rejected
+(`"fetch first"` or `"non-fast-forward"` in the error output). The runner
+detects these signals and performs a fetch → rebase → retry sequence, up to
+a fixed retry limit, before giving up.
+
+### BLOCKED state and sandbox preservation
+
+When an agent exits with a `question.json` file (entering the `BLOCKED`
+state), the sandbox is **not** torn down. The preserved sandbox allows the
+resumed execution to pick up the same working tree state, including any
+in-progress file changes made before the agent asked its question.
+
+Resume executions (`SubmitResume`) skip sandbox setup entirely and run
+directly in `project_dir`, passing `--resume <session-id>` to the agent
+so Claude can continue its previous conversation.
+
+### Session ID propagation on resume
+
+A subtle bug was found and fixed: when a resumed execution is itself blocked
+again (a second BLOCKED→answer→resume cycle), the new execution record must
+carry the **original** `ResumeSessionID`, not the new execution's own UUID.
+If the wrong ID is used, `claude --resume` fails with "No conversation found".
+The fix is in `ClaudeRunner.Run`: if `e.ResumeSessionID != ""`, use it as
+`e.SessionID` rather than `e.ID`.
+
+## Rationale
+
+- **`--no-hardlinks`**: git defaults to hardlinking objects between clone and
+ source when both are on the same filesystem. This causes permission errors
+ when the source is owned by a different user (e.g. `www-data` vs. `root`).
+ The flag forces a full copy.
+- **Bare repo for push target**: non-bare repos reject pushes to checked-out
+ branches. A bare repo (`git init --bare`) accepts all pushes safely.
+- **Preserve sandbox on failure**: partial agent work may be valuable for
+ debugging or resumption. Destroying it immediately on failure was considered
+ and rejected.
+- **Agent must commit**: requiring the agent to commit all changes before
+ exiting ensures git history is always the source of truth. The enforcement
+ check (uncommitted files → FAILED) makes this invariant observable.
+
+## Alternatives Considered
+
+- **Run directly in working copy**: rejected because of concurrent corruption
+ and partial-work leakage.
+- **Copy files instead of git clone**: rejected because the agent needs a
+ working git history (for `git log`, `git blame`, and to push commits back).
+- **Docker/container isolation**: considered for stronger isolation but
+ rejected due to operational complexity, dependency on container runtime,
+ and inability to use the host's claude/gemini credentials.
+
+## Consequences
+
+- Tasks without `project_dir` are unaffected; they run in whatever working
+ directory the server process inherited.
+- If a sandbox's push repeatedly fails (e.g. due to a bare repo that is
+ itself broken), the task is failed with the sandbox preserved.
+- If `/tmp` runs out of space (many large sandboxes), tasks will fail at
+ clone time. This is a known operational risk with no current mitigation.
+- The `project_dir` field in task YAML must point to a git repository with
+ a configured `"local"` or `"origin"` remote that accepts pushes.
+
+## Relevant Code Locations
+
+| Concern | File |
+|---|---|
+| Sandbox setup/teardown | `internal/executor/claude.go` |
+| `setupSandbox`, `teardownSandbox` | `internal/executor/claude.go` |
+| `sandboxCloneSource` | `internal/executor/claude.go` |
+| Resume skips sandbox | `internal/executor/claude.go` (Run) |
+| Session ID propagation fix | `internal/executor/claude.go` (Run) |
+| Sandbox tests | `internal/executor/claude_test.go` |