docs/adr/004-multi-agent-routing-and-classification.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

# ADR-004: Multi-Agent Routing and Gemini-Based Classification

## Status
Accepted

## Context

Claudomator started as a Claude-only system. As Gemini became a viable coding
agent, the architecture needed to support multiple agent backends without requiring
operators to manually select an agent or model for each task.

Two distinct problems needed solving:

1. **Which agent should run this task?** — Claude and Gemini have different API
   quotas and rate limits. When Claude is rate-limited, tasks should flow to
   Gemini automatically.
2. **Which model tier should the agent use?** — Both agents offer a spectrum from
   fast/cheap to slow/powerful models. Using the wrong tier wastes money or
   produces inferior results.

## Decision

The two problems are solved independently:

### Agent selection: explicit load balancing in code (`pickAgent`)

`pickAgent(SystemStatus)` selects the agent with the fewest active tasks,
preferring non-rate-limited agents. The algorithm is:

1. First pass: consider only non-rate-limited agents; pick the one with the
   fewest active tasks (alphabetical tie-break for determinism).
2. Fallback: if all agents are rate-limited, pick the least-active regardless
   of rate-limit status.

This is deterministic code, not an AI call. It runs in-process with no I/O and
cannot fail in ways that would block task execution.

### Model selection: Gemini-based classifier (`Classifier`)

Once the agent is selected, `Classifier.Classify` invokes the Gemini CLI with
`gemini-2.5-flash-lite` to select the best model tier for the task. The classifier
receives the task name, instructions, and the required agent type, and returns
a `Classification` with `agent_type`, `model`, and `reason`.

The classifier uses a cheap, fast model for classification to minimise the cost
overhead. The response is parsed from JSON, with fallback handling for markdown
code blocks and credential noise in the output.

### Separation of concerns

These two decisions were initially merged (the classifier picked both agent and
model). They were separated in commit `e033504` because:

- Load balancing must be reliable even when the Gemini API is unavailable.
- Classifier failures are non-fatal: if classification fails, the pool logs the
  error and proceeds with the agent's default model.

### Re-classification on manual restart

When an operator manually restarts a task from a non-`QUEUED` state (e.g. `FAILED`,
`CANCELLED`), the task goes through `execute()` again and is re-classified. This
ensures restarts pick up any changes to agent availability or rate-limit status.

## Rationale

- **AI-picks-model**: the model selection decision is genuinely complex and
  subjective. Using an AI classifier avoids hardcoding heuristics that would need
  constant tuning.
- **Code-picks-agent**: load balancing is a scheduling problem with measurable
  inputs (active task counts, rate-limit deadlines). Delegating this to an AI
  would introduce unnecessary non-determinism and latency.
- **Gemini for classification**: using Gemini to classify Claude tasks (and vice
  versa) prevents circular dependencies. Using the cheapest available Gemini model
  keeps classification cost negligible.

## Alternatives Considered

- **Operator always picks agent and model**: too much manual overhead. Operators
  should be able to submit tasks without knowing which agent is currently
  rate-limited.
- **Single classifier picks both agent and model**: rejected after operational
  experience showed that load balancing needs to work even when the Gemini API
  is unavailable or returning errors.
- **Round-robin agent selection**: simpler but does not account for rate limits
  or imbalanced task durations.

## Consequences

- Agent selection is deterministic and testable without mocking AI APIs.
- Classification failures are logged but non-fatal; the task runs with the
  agent's default model.
- The classifier adds ~1–2 seconds of latency to task start (one Gemini API call).
- Tasks with `agent.type` pre-set in YAML still go through load balancing;
  `pickAgent` may override the requested type if the requested type is not a
  registered runner. This is by design: the operator's type hint is overridden
  by the load balancer to ensure tasks are always routable.

## Relevant Code Locations

| Concern | File |
|---|---|
| `pickAgent` | `internal/executor/executor.go` |
| `Classifier` | `internal/executor/classifier.go` |
| Load balancing in `execute()` | `internal/executor/executor.go` |
| Re-classification gate | `internal/api/server.go` (handleRunTask) |
| `pickAgent` tests | `internal/executor/executor_test.go` |
| `Classifier` mock test | `internal/executor/classifier_test.go` |