summaryrefslogtreecommitdiff
path: root/docs/adr/002-task-state-machine.md
blob: 1d4161984818aef6670b3e7ed8ab6ded45514082 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# ADR-002: Task State Machine Design

## Status
Accepted

## Context
Claudomator tasks move through a well-defined lifecycle: from creation through
queuing, execution, and final resolution. The lifecycle must handle asynchronous
execution (subprocess), user interaction (review, Q&A), retries, and cancellation.

## States

| State | Meaning |
|---|---|
| `PENDING` | Task created; not yet submitted for execution |
| `QUEUED` | Submitted to executor pool; goroutine slot may be waiting |
| `RUNNING` | Claude subprocess is actively executing |
| `READY` | Top-level task completed execution; awaiting user accept/reject |
| `COMPLETED` | Task is fully done (terminal) |
| `FAILED` | Execution error occurred; eligible for retry (terminal if retries exhausted) |
| `TIMED_OUT` | Task exceeded configured timeout; resumable or retryable (terminal if not resumed) |
| `CANCELLED` | Explicitly cancelled by user or API (terminal) |
| `BUDGET_EXCEEDED` | Exceeded `max_budget_usd` (terminal) |
| `BLOCKED` | Agent paused and wrote a question file; awaiting user answer |

Terminal states with no outgoing transitions: `COMPLETED`, `CANCELLED`, `BUDGET_EXCEEDED`.

## State Transition Diagram

```
                     ┌─────────┐
                     │ PENDING │◄───────────────────────────────┐
                     └────┬────┘                                │
          POST /run        │              POST /reject           │
          POST /cancel     │                                     │
                      ┌────▼────┐                        ┌──────┴─────┐
             ┌────────┤  QUEUED ├─────────────┐          │   READY    │◄─────────┐
             │        └────┬────┘             │          └──────┬─────┘          │
     POST /cancel          │                  │    POST /accept │                 │
             │        pool picks up           │                 ▼                 │
             ▼             ▼                  │          ┌─────────────┐          │
      ┌──────────┐   ┌─────────┐             │          │  COMPLETED  │          │
      │CANCELLED │◄──┤ RUNNING ├──────────────┘          └─────────────┘          │
      └──────────┘   └────┬────┘                                                  │
                          │                                                        │
          ┌───────────────┼───────────────────┬───────────────┐                  │
          │               │                   │               │                  │
          ▼               ▼                   ▼               ▼                  │
    ┌──────────┐   ┌──────────────┐   ┌─────────────┐  ┌─────────┐              │
    │  FAILED  │   │  TIMED_OUT   │   │    BUDGET   │  │ BLOCKED │──all subtasks─┘
    └────┬─────┘   └──────┬───────┘   │  _EXCEEDED  │  └────┬────┘  COMPLETED
         │                │           └─────────────┘       │
  retry  │        resume/ │                          POST /answer
         │        retry   │                                  │
         └────────┬───────┘                                  │
                  ▼                                           │
               QUEUED ◄────────────────────────────────────┘
```

### Transition Table

| From | To | Trigger |
|---|---|---|
| `PENDING` | `QUEUED` | `POST /api/tasks/{id}/run` |
| `PENDING` | `CANCELLED` | `POST /api/tasks/{id}/cancel` |
| `QUEUED` | `RUNNING` | Pool goroutine starts execution |
| `QUEUED` | `CANCELLED` | `POST /api/tasks/{id}/cancel` |
| `RUNNING` | `READY` | Runner exits 0, no question file, top-level task (`parent_task_id == ""`), and task has no subtasks |
| `RUNNING` | `BLOCKED` | Runner exits 0, no question file, top-level task (`parent_task_id == ""`), and task has subtasks |
| `RUNNING` | `COMPLETED` | Runner exits 0, no question file, subtask (`parent_task_id != ""`) |
| `RUNNING` | `FAILED` | Runner exits non-zero or stream signals `is_error: true` |
| `RUNNING` | `TIMED_OUT` | Context deadline exceeded (`context.DeadlineExceeded`) |
| `RUNNING` | `CANCELLED` | Context cancelled (`context.Canceled`) |
| `RUNNING` | `BUDGET_EXCEEDED` | `--max-budget-usd` exceeded (signalled by runner) |
| `RUNNING` | `BLOCKED` | Runner exits 0 but left a `question.json` file in log dir |
| `READY` | `COMPLETED` | `POST /api/tasks/{id}/accept` |
| `READY` | `PENDING` | `POST /api/tasks/{id}/reject` (with optional comment) |
| `FAILED` | `QUEUED` | Retry (manual re-run via `POST /api/tasks/{id}/run`) |
| `TIMED_OUT` | `QUEUED` | `POST /api/tasks/{id}/resume` (resumes with session ID) |
| `BLOCKED` | `QUEUED` | `POST /api/tasks/{id}/answer` (resumes with user answer) |
| `BLOCKED` | `READY` | All subtasks reached `COMPLETED` (parent task unblocked by subtask completion watcher) |

## Implementation

**Validation:** `task.ValidTransition(from, to State) bool`
(`internal/task/task.go:93`) — called by API handlers before every state change.

**State writes:** `storage.DB.UpdateTaskState(id, state)` — single source of
write; called by both API handlers and the executor pool.

**Execution outcome → state mapping** (in `executor.Pool.execute` and `executeResume`):

```
runner.Run() returns nil   AND parent_task_id == "" AND no subtasks → READY
runner.Run() returns nil   AND parent_task_id == "" AND has subtasks → BLOCKED  (awaiting subtask completion)
runner.Run() returns nil   AND parent_task_id != ""                  → COMPLETED
runner.Run() returns *BlockedError                                   → BLOCKED  (question stored)
ctx.Err() == DeadlineExceeded                                        → TIMED_OUT
ctx.Err() == Canceled                                                → CANCELLED
any other error                                                      → FAILED
all subtasks reach COMPLETED  (from BLOCKED parent)                  → READY
```

## Key Invariants

1. **`READY` is top-level only.** Subtasks (tasks with `parent_task_id != ""`) skip
   `READY` and go directly to `COMPLETED` on success. This allows parent review
   flows without requiring per-subtask acknowledgement.

2. **`BLOCKED` requires a `session_id`.** The executor persists `session_id` in
   the execution record at start time. `POST /answer` retrieves it via
   `GetLatestExecution` to resume the correct Claude session.

3. **`DELETE` is blocked for `RUNNING` and `QUEUED` tasks.** (`server.go:127`)

4. **`CANCELLED` is reachable from `PENDING`, `QUEUED`, and `RUNNING`.** For
   `RUNNING` tasks, `pool.Cancel(taskID)` sends a context cancellation signal;
   the state transition happens asynchronously when the goroutine exits. For
   `PENDING`/`QUEUED` tasks, `UpdateTaskState` is called directly.

5. **Dependency waiting.** Tasks with `depends_on` stay conceptually in `QUEUED`
   (goroutine running but not executing) while `waitForDependencies` polls until
   all dependencies reach `COMPLETED`. If any dependency reaches a terminal
   failure state, the waiting task transitions to `FAILED`.

6. **Retry re-enters at `QUEUED`.** `FAILED → QUEUED` and `TIMED_OUT → QUEUED`
   are the only back-edges. Retry is manual (caller must call `/run` again); the
   `RetryConfig.MaxAttempts` field exists but enforcement is left to callers.

7. **Parent task with subtasks goes `BLOCKED` on runner exit.** A top-level task
   that has subtasks transitions to `BLOCKED` (not `READY`) when its runner exits
   successfully. It unblocks to `READY` only when all its subtasks reach
   `COMPLETED`. This allows the parent to dispatch subtasks and wait for them
   before presenting the result for user review.

## Side Effects on Transition

| Transition | Side effects |
|---|---|
| → `QUEUED` | `pool.Submit()` called; execution record created in DB with log paths |
| → `RUNNING` | `store.UpdateTaskState`; cancel func registered in `pool.cancels` |
| → `BLOCKED` | `store.UpdateTaskQuestion(taskID, questionJSON)`; session_id in execution |
| → `READY/COMPLETED/FAILED/TIMED_OUT/CANCELLED/BUDGET_EXCEEDED` | Execution end time set; `store.UpdateExecution`; result broadcast via WebSocket (`pool.resultCh → forwardResults → hub.Broadcast`) |
| `READY → PENDING` (reject) | `store.RejectTask(id, comment)` — stores `rejection_comment` |
| `BLOCKED → QUEUED` (answer) | `store.UpdateTaskQuestion(taskID, "")` clears question; new `storage.Execution` created with `ResumeSessionID` + `ResumeAnswer` |

## WebSocket Events

Task lifecycle changes produce WebSocket broadcasts to all connected clients:

- `task_completed` — emitted on any terminal or quasi-terminal transition (READY,
  COMPLETED, FAILED, TIMED_OUT, CANCELLED, BUDGET_EXCEEDED, BLOCKED)
- `task_question` — emitted by `BroadcastQuestion` when an agent uses
  `AskUserQuestion` (currently unused in the file-based flow; the file-based
  mechanism is the primary BLOCKED path)

## Known Limitations and Edge Cases

- **`BUDGET_EXCEEDED` transition.** `BUDGET_EXCEEDED` appears in `terminalFailureStates`
  (used by `waitForDependencies`) but has no outgoing transitions in `ValidTransition`,
  making it permanently terminal. There is no `/resume` endpoint for it.

- **Retry enforcement.** `RetryConfig.MaxAttempts` is stored but not enforced by
  the pool. The API allows unlimited manual retries via `POST /run` from `FAILED`.

- **`TIMED_OUT` resumability.** Only `POST /api/tasks/{id}/resume` resumes timed-out
  tasks. Unlike BLOCKED, there is no question/answer — the resume message is
  hardcoded: _"Your previous execution timed out. Please continue where you left off."_

- **Concurrent cancellation race.** If a task transitions `RUNNING → COMPLETED`
  and `POST /cancel` is called in the same window, `pool.Cancel()` may return
  `true` (cancel func still registered) even though the goroutine is finishing.
  The goroutine's `ctx.Err()` check wins; the task ends in `COMPLETED`.

## Relevant Code Locations

| Concern | File | Lines |
|---|---|---|
| State constants | `internal/task/task.go` | 7–18 |
| `ValidTransition` | `internal/task/task.go` | 93–109 |
| State machine tests | `internal/task/task_test.go` | 8–72 |
| Pool execute | `internal/executor/executor.go` | 194–303 |
| Pool executeResume | `internal/executor/executor.go` | 116–185 |
| Dependency wait | `internal/executor/executor.go` | 305–340 |
| `BlockedError` | `internal/executor/claude.go` | 31–37 |
| Question file detection | `internal/executor/claude.go` | 103–110 |
| API state transitions | `internal/api/server.go` | 138–415 |