3 files changed, 240 insertions, 1 deletions
diff --git a/docs/adr/002-task-state-machine.md b/docs/adr/002-task-state-machine.md
index 1d41619..310c337 100644
--- a/docs/adr/002-task-state-machine.md
+++ b/docs/adr/002-task-state-machine.md
@@ -23,7 +23,7 @@ execution (subprocess), user interaction (review, Q&A), retries, and cancellatio
 | `BUDGET_EXCEEDED` | Exceeded `max_budget_usd` (terminal) |
 | `BLOCKED` | Agent paused and wrote a question file; awaiting user answer |
 
-Terminal states with no outgoing transitions: `COMPLETED`, `CANCELLED`, `BUDGET_EXCEEDED`.
+True terminal state (no outgoing transitions): `COMPLETED`. All other non-success states (`CANCELLED`, `FAILED`, `TIMED_OUT`, `BUDGET_EXCEEDED`) may transition back to `QUEUED` to restart or retry.
 
 ## State Transition Diagram
 
@@ -77,6 +77,8 @@ Terminal states with no outgoing transitions: `COMPLETED`, `CANCELLED`, `BUDGET_
 | `READY` | `PENDING` | `POST /api/tasks/{id}/reject` (with optional comment) |
 | `FAILED` | `QUEUED` | Retry (manual re-run via `POST /api/tasks/{id}/run`) |
 | `TIMED_OUT` | `QUEUED` | `POST /api/tasks/{id}/resume` (resumes with session ID) |
+| `CANCELLED` | `QUEUED` | Restart (manual re-run via `POST /api/tasks/{id}/run`) |
+| `BUDGET_EXCEEDED` | `QUEUED` | Retry (manual re-run via `POST /api/tasks/{id}/run`) |
 | `BLOCKED` | `QUEUED` | `POST /api/tasks/{id}/answer` (resumes with user answer) |
 | `BLOCKED` | `READY` | All subtasks reached `COMPLETED` (parent task unblocked by subtask completion watcher) |
 
diff --git a/docs/adr/004-multi-agent-routing-and-classification.md b/docs/adr/004-multi-agent-routing-and-classification.md
new file mode 100644
index 0000000..7afb10d
--- /dev/null
+++ b/docs/adr/004-multi-agent-routing-and-classification.md
@@ -0,0 +1,107 @@
+# ADR-004: Multi-Agent Routing and Gemini-Based Classification
+
+## Status
+Accepted
+
+## Context
+
+Claudomator started as a Claude-only system. As Gemini became a viable coding
+agent, the architecture needed to support multiple agent backends without requiring
+operators to manually select an agent or model for each task.
+
+Two distinct problems needed solving:
+
+1. **Which agent should run this task?** — Claude and Gemini have different API
+   quotas and rate limits. When Claude is rate-limited, tasks should flow to
+   Gemini automatically.
+2. **Which model tier should the agent use?** — Both agents offer a spectrum from
+   fast/cheap to slow/powerful models. Using the wrong tier wastes money or
+   produces inferior results.
+
+## Decision
+
+The two problems are solved independently:
+
+### Agent selection: explicit load balancing in code (`pickAgent`)
+
+`pickAgent(SystemStatus)` selects the agent with the fewest active tasks,
+preferring non-rate-limited agents. The algorithm is:
+
+1. First pass: consider only non-rate-limited agents; pick the one with the
+   fewest active tasks (alphabetical tie-break for determinism).
+2. Fallback: if all agents are rate-limited, pick the least-active regardless
+   of rate-limit status.
+
+This is deterministic code, not an AI call. It runs in-process with no I/O and
+cannot fail in ways that would block task execution.
+
+### Model selection: Gemini-based classifier (`Classifier`)
+
+Once the agent is selected, `Classifier.Classify` invokes the Gemini CLI with
+`gemini-2.5-flash-lite` to select the best model tier for the task. The classifier
+receives the task name, instructions, and the required agent type, and returns
+a `Classification` with `agent_type`, `model`, and `reason`.
+
+The classifier uses a cheap, fast model for classification to minimise the cost
+overhead. The response is parsed from JSON, with fallback handling for markdown
+code blocks and credential noise in the output.
+
+### Separation of concerns
+
+These two decisions were initially merged (the classifier picked both agent and
+model). They were separated in commit `e033504` because:
+
+- Load balancing must be reliable even when the Gemini API is unavailable.
+- Classifier failures are non-fatal: if classification fails, the pool logs the
+  error and proceeds with the agent's default model.
+
+### Re-classification on manual restart
+
+When an operator manually restarts a task from a non-`QUEUED` state (e.g. `FAILED`,
+`CANCELLED`), the task goes through `execute()` again and is re-classified. This
+ensures restarts pick up any changes to agent availability or rate-limit status.
+
+## Rationale
+
+- **AI-picks-model**: the model selection decision is genuinely complex and
+  subjective. Using an AI classifier avoids hardcoding heuristics that would need
+  constant tuning.
+- **Code-picks-agent**: load balancing is a scheduling problem with measurable
+  inputs (active task counts, rate-limit deadlines). Delegating this to an AI
+  would introduce unnecessary non-determinism and latency.
+- **Gemini for classification**: using Gemini to classify Claude tasks (and vice
+  versa) prevents circular dependencies. Using the cheapest available Gemini model
+  keeps classification cost negligible.
+
+## Alternatives Considered
+
+- **Operator always picks agent and model**: too much manual overhead. Operators
+  should be able to submit tasks without knowing which agent is currently
+  rate-limited.
+- **Single classifier picks both agent and model**: rejected after operational
+  experience showed that load balancing needs to work even when the Gemini API
+  is unavailable or returning errors.
+- **Round-robin agent selection**: simpler but does not account for rate limits
+  or imbalanced task durations.
+
+## Consequences
+
+- Agent selection is deterministic and testable without mocking AI APIs.
+- Classification failures are logged but non-fatal; the task runs with the
+  agent's default model.
+- The classifier adds ~1–2 seconds of latency to task start (one Gemini API call).
+- Tasks with `agent.type` pre-set in YAML still go through load balancing;
+  `pickAgent` may override the requested type if the requested type is not a
+  registered runner. This is by design: the operator's type hint is overridden
+  by the load balancer to ensure tasks are always routable.
+
+## Relevant Code Locations
+
+| Concern | File |
+|---|---|
+| `pickAgent` | `internal/executor/executor.go` |
+| `Classifier` | `internal/executor/classifier.go` |
+| Load balancing in `execute()` | `internal/executor/executor.go` |
+| Re-classification gate | `internal/api/server.go` (handleRunTask) |
+| `pickAgent` tests | `internal/executor/executor_test.go` |
+| `Classifier` mock test | `internal/executor/classifier_test.go` |
diff --git a/docs/adr/005-sandbox-execution-model.md b/docs/adr/005-sandbox-execution-model.md
new file mode 100644
index 0000000..b374561
--- /dev/null
+++ b/docs/adr/005-sandbox-execution-model.md
@@ -0,0 +1,130 @@
+# ADR-005: Git Sandbox Execution Model
+
+## Status
+Accepted
+
+## Context
+
+Tasks that modify source code need a safe execution environment. Running agents
+directly in the canonical working copy creates several problems:
+
+1. **Concurrent corruption**: multiple agents running in the same directory stomp
+   on each other's changes.
+2. **Partial work leaks**: if a task is cancelled mid-run, half-written files
+   remain in the working tree, blocking other work.
+3. **No rollback**: a failed agent may leave the codebase in a broken state with
+   no clean way to recover without manual `git reset`.
+4. **Audit trail**: changes made by an agent should be visible as discrete,
+   attributable git commits — not as an anonymous blob of working-tree noise.
+
+## Decision
+
+When a task has `agent.project_dir` set, `ClaudeRunner.Run` executes the agent
+in an isolated git clone (a "sandbox") rather than in the project directory
+directly.
+
+### Sandbox lifecycle
+
+```
+project_dir (canonical working copy)
+    |
+    | git clone --no-hardlinks <clone-source> /tmp/claudomator-sandbox-*
+    |
+    v
+sandbox (temp clone)
+    |
+    | agent runs here; commits its changes
+    |
+    | git push (to bare repo "local" or "origin")
+    |
+teardown: verify no uncommitted changes, remove sandbox dir
+```
+
+### Clone source
+
+`sandboxCloneSource` prefers a remote named `"local"` (a local bare repo).
+If not present it falls back to the `"origin"` remote. Using a bare repo
+accepts pushes cleanly; pushing to a non-bare working copy fails when the
+receiving branch is checked out.
+
+### Uncommitted-change enforcement
+
+Before teardown, the runner runs `git status --porcelain` in the sandbox.
+If any uncommitted changes are detected, the task is failed with an error
+message listing the files. The sandbox is **preserved** so the operator
+can inspect or recover the partial work. The error message includes the
+sandbox path.
+
+### Concurrent push conflicts
+
+If two sandboxes try to push at the same time, the second push is rejected
+(`"fetch first"` or `"non-fast-forward"` in the error output). The runner
+detects these signals and performs a fetch → rebase → retry sequence, up to
+a fixed retry limit, before giving up.
+
+### BLOCKED state and sandbox preservation
+
+When an agent exits with a `question.json` file (entering the `BLOCKED`
+state), the sandbox is **not** torn down. The preserved sandbox allows the
+resumed execution to pick up the same working tree state, including any
+in-progress file changes made before the agent asked its question.
+
+Resume executions (`SubmitResume`) skip sandbox setup entirely and run
+directly in `project_dir`, passing `--resume <session-id>` to the agent
+so Claude can continue its previous conversation.
+
+### Session ID propagation on resume
+
+A subtle bug was found and fixed: when a resumed execution is itself blocked
+again (a second BLOCKED→answer→resume cycle), the new execution record must
+carry the **original** `ResumeSessionID`, not the new execution's own UUID.
+If the wrong ID is used, `claude --resume` fails with "No conversation found".
+The fix is in `ClaudeRunner.Run`: if `e.ResumeSessionID != ""`, use it as
+`e.SessionID` rather than `e.ID`.
+
+## Rationale
+
+- **`--no-hardlinks`**: git defaults to hardlinking objects between clone and
+  source when both are on the same filesystem. This causes permission errors
+  when the source is owned by a different user (e.g. `www-data` vs. `root`).
+  The flag forces a full copy.
+- **Bare repo for push target**: non-bare repos reject pushes to checked-out
+  branches. A bare repo (`git init --bare`) accepts all pushes safely.
+- **Preserve sandbox on failure**: partial agent work may be valuable for
+  debugging or resumption. Destroying it immediately on failure was considered
+  and rejected.
+- **Agent must commit**: requiring the agent to commit all changes before
+  exiting ensures git history is always the source of truth. The enforcement
+  check (uncommitted files → FAILED) makes this invariant observable.
+
+## Alternatives Considered
+
+- **Run directly in working copy**: rejected because of concurrent corruption
+  and partial-work leakage.
+- **Copy files instead of git clone**: rejected because the agent needs a
+  working git history (for `git log`, `git blame`, and to push commits back).
+- **Docker/container isolation**: considered for stronger isolation but
+  rejected due to operational complexity, dependency on container runtime,
+  and inability to use the host's claude/gemini credentials.
+
+## Consequences
+
+- Tasks without `project_dir` are unaffected; they run in whatever working
+  directory the server process inherited.
+- If a sandbox's push repeatedly fails (e.g. due to a bare repo that is
+  itself broken), the task is failed with the sandbox preserved.
+- If `/tmp` runs out of space (many large sandboxes), tasks will fail at
+  clone time. This is a known operational risk with no current mitigation.
+- The `project_dir` field in task YAML must point to a git repository with
+  a configured `"local"` or `"origin"` remote that accepts pushes.
+
+## Relevant Code Locations
+
+| Concern | File |
+|---|---|
+| Sandbox setup/teardown | `internal/executor/claude.go` |
+| `setupSandbox`, `teardownSandbox` | `internal/executor/claude.go` |
+| `sandboxCloneSource` | `internal/executor/claude.go` |
+| Resume skips sandbox | `internal/executor/claude.go` (Run) |
+| Session ID propagation fix | `internal/executor/claude.go` (Run) |
+| Sandbox tests | `internal/executor/claude_test.go` |