diff options
Diffstat (limited to 'docs/adr')
| -rw-r--r-- | docs/adr/002-task-state-machine.md | 4 | ||||
| -rw-r--r-- | docs/adr/004-multi-agent-routing-and-classification.md | 107 | ||||
| -rw-r--r-- | docs/adr/005-sandbox-execution-model.md | 130 |
3 files changed, 240 insertions, 1 deletions
diff --git a/docs/adr/002-task-state-machine.md b/docs/adr/002-task-state-machine.md index 1d41619..310c337 100644 --- a/docs/adr/002-task-state-machine.md +++ b/docs/adr/002-task-state-machine.md @@ -23,7 +23,7 @@ execution (subprocess), user interaction (review, Q&A), retries, and cancellatio | `BUDGET_EXCEEDED` | Exceeded `max_budget_usd` (terminal) | | `BLOCKED` | Agent paused and wrote a question file; awaiting user answer | -Terminal states with no outgoing transitions: `COMPLETED`, `CANCELLED`, `BUDGET_EXCEEDED`. +True terminal state (no outgoing transitions): `COMPLETED`. All other non-success states (`CANCELLED`, `FAILED`, `TIMED_OUT`, `BUDGET_EXCEEDED`) may transition back to `QUEUED` to restart or retry. ## State Transition Diagram @@ -77,6 +77,8 @@ Terminal states with no outgoing transitions: `COMPLETED`, `CANCELLED`, `BUDGET_ | `READY` | `PENDING` | `POST /api/tasks/{id}/reject` (with optional comment) | | `FAILED` | `QUEUED` | Retry (manual re-run via `POST /api/tasks/{id}/run`) | | `TIMED_OUT` | `QUEUED` | `POST /api/tasks/{id}/resume` (resumes with session ID) | +| `CANCELLED` | `QUEUED` | Restart (manual re-run via `POST /api/tasks/{id}/run`) | +| `BUDGET_EXCEEDED` | `QUEUED` | Retry (manual re-run via `POST /api/tasks/{id}/run`) | | `BLOCKED` | `QUEUED` | `POST /api/tasks/{id}/answer` (resumes with user answer) | | `BLOCKED` | `READY` | All subtasks reached `COMPLETED` (parent task unblocked by subtask completion watcher) | diff --git a/docs/adr/004-multi-agent-routing-and-classification.md b/docs/adr/004-multi-agent-routing-and-classification.md new file mode 100644 index 0000000..7afb10d --- /dev/null +++ b/docs/adr/004-multi-agent-routing-and-classification.md @@ -0,0 +1,107 @@ +# ADR-004: Multi-Agent Routing and Gemini-Based Classification + +## Status +Accepted + +## Context + +Claudomator started as a Claude-only system. As Gemini became a viable coding +agent, the architecture needed to support multiple agent backends without requiring +operators to manually select an agent or model for each task. + +Two distinct problems needed solving: + +1. **Which agent should run this task?** — Claude and Gemini have different API + quotas and rate limits. When Claude is rate-limited, tasks should flow to + Gemini automatically. +2. **Which model tier should the agent use?** — Both agents offer a spectrum from + fast/cheap to slow/powerful models. Using the wrong tier wastes money or + produces inferior results. + +## Decision + +The two problems are solved independently: + +### Agent selection: explicit load balancing in code (`pickAgent`) + +`pickAgent(SystemStatus)` selects the agent with the fewest active tasks, +preferring non-rate-limited agents. The algorithm is: + +1. First pass: consider only non-rate-limited agents; pick the one with the + fewest active tasks (alphabetical tie-break for determinism). +2. Fallback: if all agents are rate-limited, pick the least-active regardless + of rate-limit status. + +This is deterministic code, not an AI call. It runs in-process with no I/O and +cannot fail in ways that would block task execution. + +### Model selection: Gemini-based classifier (`Classifier`) + +Once the agent is selected, `Classifier.Classify` invokes the Gemini CLI with +`gemini-2.5-flash-lite` to select the best model tier for the task. The classifier +receives the task name, instructions, and the required agent type, and returns +a `Classification` with `agent_type`, `model`, and `reason`. + +The classifier uses a cheap, fast model for classification to minimise the cost +overhead. The response is parsed from JSON, with fallback handling for markdown +code blocks and credential noise in the output. + +### Separation of concerns + +These two decisions were initially merged (the classifier picked both agent and +model). They were separated in commit `e033504` because: + +- Load balancing must be reliable even when the Gemini API is unavailable. +- Classifier failures are non-fatal: if classification fails, the pool logs the + error and proceeds with the agent's default model. + +### Re-classification on manual restart + +When an operator manually restarts a task from a non-`QUEUED` state (e.g. `FAILED`, +`CANCELLED`), the task goes through `execute()` again and is re-classified. This +ensures restarts pick up any changes to agent availability or rate-limit status. + +## Rationale + +- **AI-picks-model**: the model selection decision is genuinely complex and + subjective. Using an AI classifier avoids hardcoding heuristics that would need + constant tuning. +- **Code-picks-agent**: load balancing is a scheduling problem with measurable + inputs (active task counts, rate-limit deadlines). Delegating this to an AI + would introduce unnecessary non-determinism and latency. +- **Gemini for classification**: using Gemini to classify Claude tasks (and vice + versa) prevents circular dependencies. Using the cheapest available Gemini model + keeps classification cost negligible. + +## Alternatives Considered + +- **Operator always picks agent and model**: too much manual overhead. Operators + should be able to submit tasks without knowing which agent is currently + rate-limited. +- **Single classifier picks both agent and model**: rejected after operational + experience showed that load balancing needs to work even when the Gemini API + is unavailable or returning errors. +- **Round-robin agent selection**: simpler but does not account for rate limits + or imbalanced task durations. + +## Consequences + +- Agent selection is deterministic and testable without mocking AI APIs. +- Classification failures are logged but non-fatal; the task runs with the + agent's default model. +- The classifier adds ~1–2 seconds of latency to task start (one Gemini API call). +- Tasks with `agent.type` pre-set in YAML still go through load balancing; + `pickAgent` may override the requested type if the requested type is not a + registered runner. This is by design: the operator's type hint is overridden + by the load balancer to ensure tasks are always routable. + +## Relevant Code Locations + +| Concern | File | +|---|---| +| `pickAgent` | `internal/executor/executor.go` | +| `Classifier` | `internal/executor/classifier.go` | +| Load balancing in `execute()` | `internal/executor/executor.go` | +| Re-classification gate | `internal/api/server.go` (handleRunTask) | +| `pickAgent` tests | `internal/executor/executor_test.go` | +| `Classifier` mock test | `internal/executor/classifier_test.go` | diff --git a/docs/adr/005-sandbox-execution-model.md b/docs/adr/005-sandbox-execution-model.md new file mode 100644 index 0000000..b374561 --- /dev/null +++ b/docs/adr/005-sandbox-execution-model.md @@ -0,0 +1,130 @@ +# ADR-005: Git Sandbox Execution Model + +## Status +Accepted + +## Context + +Tasks that modify source code need a safe execution environment. Running agents +directly in the canonical working copy creates several problems: + +1. **Concurrent corruption**: multiple agents running in the same directory stomp + on each other's changes. +2. **Partial work leaks**: if a task is cancelled mid-run, half-written files + remain in the working tree, blocking other work. +3. **No rollback**: a failed agent may leave the codebase in a broken state with + no clean way to recover without manual `git reset`. +4. **Audit trail**: changes made by an agent should be visible as discrete, + attributable git commits — not as an anonymous blob of working-tree noise. + +## Decision + +When a task has `agent.project_dir` set, `ClaudeRunner.Run` executes the agent +in an isolated git clone (a "sandbox") rather than in the project directory +directly. + +### Sandbox lifecycle + +``` +project_dir (canonical working copy) + | + | git clone --no-hardlinks <clone-source> /tmp/claudomator-sandbox-* + | + v +sandbox (temp clone) + | + | agent runs here; commits its changes + | + | git push (to bare repo "local" or "origin") + | +teardown: verify no uncommitted changes, remove sandbox dir +``` + +### Clone source + +`sandboxCloneSource` prefers a remote named `"local"` (a local bare repo). +If not present it falls back to the `"origin"` remote. Using a bare repo +accepts pushes cleanly; pushing to a non-bare working copy fails when the +receiving branch is checked out. + +### Uncommitted-change enforcement + +Before teardown, the runner runs `git status --porcelain` in the sandbox. +If any uncommitted changes are detected, the task is failed with an error +message listing the files. The sandbox is **preserved** so the operator +can inspect or recover the partial work. The error message includes the +sandbox path. + +### Concurrent push conflicts + +If two sandboxes try to push at the same time, the second push is rejected +(`"fetch first"` or `"non-fast-forward"` in the error output). The runner +detects these signals and performs a fetch → rebase → retry sequence, up to +a fixed retry limit, before giving up. + +### BLOCKED state and sandbox preservation + +When an agent exits with a `question.json` file (entering the `BLOCKED` +state), the sandbox is **not** torn down. The preserved sandbox allows the +resumed execution to pick up the same working tree state, including any +in-progress file changes made before the agent asked its question. + +Resume executions (`SubmitResume`) skip sandbox setup entirely and run +directly in `project_dir`, passing `--resume <session-id>` to the agent +so Claude can continue its previous conversation. + +### Session ID propagation on resume + +A subtle bug was found and fixed: when a resumed execution is itself blocked +again (a second BLOCKED→answer→resume cycle), the new execution record must +carry the **original** `ResumeSessionID`, not the new execution's own UUID. +If the wrong ID is used, `claude --resume` fails with "No conversation found". +The fix is in `ClaudeRunner.Run`: if `e.ResumeSessionID != ""`, use it as +`e.SessionID` rather than `e.ID`. + +## Rationale + +- **`--no-hardlinks`**: git defaults to hardlinking objects between clone and + source when both are on the same filesystem. This causes permission errors + when the source is owned by a different user (e.g. `www-data` vs. `root`). + The flag forces a full copy. +- **Bare repo for push target**: non-bare repos reject pushes to checked-out + branches. A bare repo (`git init --bare`) accepts all pushes safely. +- **Preserve sandbox on failure**: partial agent work may be valuable for + debugging or resumption. Destroying it immediately on failure was considered + and rejected. +- **Agent must commit**: requiring the agent to commit all changes before + exiting ensures git history is always the source of truth. The enforcement + check (uncommitted files → FAILED) makes this invariant observable. + +## Alternatives Considered + +- **Run directly in working copy**: rejected because of concurrent corruption + and partial-work leakage. +- **Copy files instead of git clone**: rejected because the agent needs a + working git history (for `git log`, `git blame`, and to push commits back). +- **Docker/container isolation**: considered for stronger isolation but + rejected due to operational complexity, dependency on container runtime, + and inability to use the host's claude/gemini credentials. + +## Consequences + +- Tasks without `project_dir` are unaffected; they run in whatever working + directory the server process inherited. +- If a sandbox's push repeatedly fails (e.g. due to a bare repo that is + itself broken), the task is failed with the sandbox preserved. +- If `/tmp` runs out of space (many large sandboxes), tasks will fail at + clone time. This is a known operational risk with no current mitigation. +- The `project_dir` field in task YAML must point to a git repository with + a configured `"local"` or `"origin"` remote that accepts pushes. + +## Relevant Code Locations + +| Concern | File | +|---|---| +| Sandbox setup/teardown | `internal/executor/claude.go` | +| `setupSandbox`, `teardownSandbox` | `internal/executor/claude.go` | +| `sandboxCloneSource` | `internal/executor/claude.go` | +| Resume skips sandbox | `internal/executor/claude.go` (Run) | +| Session ID propagation fix | `internal/executor/claude.go` (Run) | +| Sandbox tests | `internal/executor/claude_test.go` | |
