diff options
Diffstat (limited to 'docs/adr/002-task-state-machine.md')
| -rw-r--r-- | docs/adr/002-task-state-machine.md | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/docs/adr/002-task-state-machine.md b/docs/adr/002-task-state-machine.md new file mode 100644 index 0000000..debf7b0 --- /dev/null +++ b/docs/adr/002-task-state-machine.md @@ -0,0 +1,177 @@ +# ADR-002: Task State Machine Design + +## Status +Accepted + +## Context +Claudomator tasks move through a well-defined lifecycle: from creation through +queuing, execution, and final resolution. The lifecycle must handle asynchronous +execution (subprocess), user interaction (review, Q&A), retries, and cancellation. + +## States + +| State | Meaning | +|---|---| +| `PENDING` | Task created; not yet submitted for execution | +| `QUEUED` | Submitted to executor pool; goroutine slot may be waiting | +| `RUNNING` | Claude subprocess is actively executing | +| `READY` | Top-level task completed execution; awaiting user accept/reject | +| `COMPLETED` | Task is fully done (terminal) | +| `FAILED` | Execution error occurred; eligible for retry (terminal if retries exhausted) | +| `TIMED_OUT` | Task exceeded configured timeout; resumable or retryable (terminal if not resumed) | +| `CANCELLED` | Explicitly cancelled by user or API (terminal) | +| `BUDGET_EXCEEDED` | Exceeded `max_budget_usd` (terminal) | +| `BLOCKED` | Agent paused and wrote a question file; awaiting user answer | + +Terminal states with no outgoing transitions: `COMPLETED`, `CANCELLED`, `BUDGET_EXCEEDED`. + +## State Transition Diagram + +``` + ┌─────────┐ + │ PENDING │◄───────────────────────────────┐ + └────┬────┘ │ + POST /run │ POST /reject │ + POST /cancel │ │ + ┌────▼────┐ ┌──────┴─────┐ + ┌────────┤ QUEUED ├─────────────┐ │ READY │ + │ └────┬────┘ │ └──────┬─────┘ + POST /cancel │ │ POST /accept │ + │ pool picks up │ ▼ + ▼ ▼ │ ┌─────────────┐ + ┌──────────┐ ┌─────────┐ │ │ COMPLETED │ + │CANCELLED │◄──┤ RUNNING ├──────────────┘ └─────────────┘ + └──────────┘ └────┬────┘ + │ + ┌───────────────┼───────────────────┬───────────────┐ + │ │ │ │ + ▼ ▼ ▼ ▼ + ┌──────────┐ ┌──────────────┐ ┌─────────────┐ ┌─────────┐ + │ FAILED │ │ TIMED_OUT │ │ BUDGET │ │ BLOCKED │ + └────┬─────┘ └──────┬───────┘ │ _EXCEEDED │ └────┬────┘ + │ │ └─────────────┘ │ + retry │ resume/ │ POST /answer + │ retry │ │ + └────────┬───────┘ │ + ▼ │ + QUEUED ◄────────────────────────────────────┘ +``` + +### Transition Table + +| From | To | Trigger | +|---|---|---| +| `PENDING` | `QUEUED` | `POST /api/tasks/{id}/run` | +| `PENDING` | `CANCELLED` | `POST /api/tasks/{id}/cancel` | +| `QUEUED` | `RUNNING` | Pool goroutine starts execution | +| `QUEUED` | `CANCELLED` | `POST /api/tasks/{id}/cancel` | +| `RUNNING` | `READY` | Runner exits 0, no question file, top-level task (`parent_task_id == ""`) | +| `RUNNING` | `COMPLETED` | Runner exits 0, no question file, subtask (`parent_task_id != ""`) | +| `RUNNING` | `FAILED` | Runner exits non-zero or stream signals `is_error: true` | +| `RUNNING` | `TIMED_OUT` | Context deadline exceeded (`context.DeadlineExceeded`) | +| `RUNNING` | `CANCELLED` | Context cancelled (`context.Canceled`) | +| `RUNNING` | `BUDGET_EXCEEDED` | `--max-budget-usd` exceeded (signalled by runner) | +| `RUNNING` | `BLOCKED` | Runner exits 0 but left a `question.json` file in log dir | +| `READY` | `COMPLETED` | `POST /api/tasks/{id}/accept` | +| `READY` | `PENDING` | `POST /api/tasks/{id}/reject` (with optional comment) | +| `FAILED` | `QUEUED` | Retry (manual re-run via `POST /api/tasks/{id}/run`) | +| `TIMED_OUT` | `QUEUED` | `POST /api/tasks/{id}/resume` (resumes with session ID) | +| `BLOCKED` | `QUEUED` | `POST /api/tasks/{id}/answer` (resumes with user answer) | + +## Implementation + +**Validation:** `task.ValidTransition(from, to State) bool` +(`internal/task/task.go:93`) — called by API handlers before every state change. + +**State writes:** `storage.DB.UpdateTaskState(id, state)` — single source of +write; called by both API handlers and the executor pool. + +**Execution outcome → state mapping** (in `executor.Pool.execute` and `executeResume`): + +``` +runner.Run() returns nil AND parent_task_id == "" → READY +runner.Run() returns nil AND parent_task_id != "" → COMPLETED +runner.Run() returns *BlockedError → BLOCKED (question stored) +ctx.Err() == DeadlineExceeded → TIMED_OUT +ctx.Err() == Canceled → CANCELLED +any other error → FAILED +``` + +## Key Invariants + +1. **`READY` is top-level only.** Subtasks (tasks with `parent_task_id != ""`) skip + `READY` and go directly to `COMPLETED` on success. This allows parent review + flows without requiring per-subtask acknowledgement. + +2. **`BLOCKED` requires a `session_id`.** The executor persists `session_id` in + the execution record at start time. `POST /answer` retrieves it via + `GetLatestExecution` to resume the correct Claude session. + +3. **`DELETE` is blocked for `RUNNING` and `QUEUED` tasks.** (`server.go:127`) + +4. **`CANCELLED` is reachable from `PENDING`, `QUEUED`, and `RUNNING`.** For + `RUNNING` tasks, `pool.Cancel(taskID)` sends a context cancellation signal; + the state transition happens asynchronously when the goroutine exits. For + `PENDING`/`QUEUED` tasks, `UpdateTaskState` is called directly. + +5. **Dependency waiting.** Tasks with `depends_on` stay conceptually in `QUEUED` + (goroutine running but not executing) while `waitForDependencies` polls until + all dependencies reach `COMPLETED`. If any dependency reaches a terminal + failure state, the waiting task transitions to `FAILED`. + +6. **Retry re-enters at `QUEUED`.** `FAILED → QUEUED` and `TIMED_OUT → QUEUED` + are the only back-edges. Retry is manual (caller must call `/run` again); the + `RetryConfig.MaxAttempts` field exists but enforcement is left to callers. + +## Side Effects on Transition + +| Transition | Side effects | +|---|---| +| → `QUEUED` | `pool.Submit()` called; execution record created in DB with log paths | +| → `RUNNING` | `store.UpdateTaskState`; cancel func registered in `pool.cancels` | +| → `BLOCKED` | `store.UpdateTaskQuestion(taskID, questionJSON)`; session_id in execution | +| → `READY/COMPLETED/FAILED/TIMED_OUT/CANCELLED/BUDGET_EXCEEDED` | Execution end time set; `store.UpdateExecution`; result broadcast via WebSocket (`pool.resultCh → forwardResults → hub.Broadcast`) | +| `READY → PENDING` (reject) | `store.RejectTask(id, comment)` — stores `rejection_comment` | +| `BLOCKED → QUEUED` (answer) | `store.UpdateTaskQuestion(taskID, "")` clears question; new `storage.Execution` created with `ResumeSessionID` + `ResumeAnswer` | + +## WebSocket Events + +Task lifecycle changes produce WebSocket broadcasts to all connected clients: + +- `task_completed` — emitted on any terminal or quasi-terminal transition (READY, + COMPLETED, FAILED, TIMED_OUT, CANCELLED, BUDGET_EXCEEDED, BLOCKED) +- `task_question` — emitted by `BroadcastQuestion` when an agent uses + `AskUserQuestion` (currently unused in the file-based flow; the file-based + mechanism is the primary BLOCKED path) + +## Known Limitations and Edge Cases + +- **`BUDGET_EXCEEDED` transition.** `BUDGET_EXCEEDED` appears in `terminalFailureStates` + (used by `waitForDependencies`) but has no outgoing transitions in `ValidTransition`, + making it permanently terminal. There is no `/resume` endpoint for it. + +- **Retry enforcement.** `RetryConfig.MaxAttempts` is stored but not enforced by + the pool. The API allows unlimited manual retries via `POST /run` from `FAILED`. + +- **`TIMED_OUT` resumability.** Only `POST /api/tasks/{id}/resume` resumes timed-out + tasks. Unlike BLOCKED, there is no question/answer — the resume message is + hardcoded: _"Your previous execution timed out. Please continue where you left off."_ + +- **Concurrent cancellation race.** If a task transitions `RUNNING → COMPLETED` + and `POST /cancel` is called in the same window, `pool.Cancel()` may return + `true` (cancel func still registered) even though the goroutine is finishing. + The goroutine's `ctx.Err()` check wins; the task ends in `COMPLETED`. + +## Relevant Code Locations + +| Concern | File | Lines | +|---|---|---| +| State constants | `internal/task/task.go` | 7–18 | +| `ValidTransition` | `internal/task/task.go` | 93–109 | +| State machine tests | `internal/task/task_test.go` | 8–72 | +| Pool execute | `internal/executor/executor.go` | 194–303 | +| Pool executeResume | `internal/executor/executor.go` | 116–185 | +| Dependency wait | `internal/executor/executor.go` | 305–340 | +| `BlockedError` | `internal/executor/claude.go` | 31–37 | +| Question file detection | `internal/executor/claude.go` | 103–110 | +| API state transitions | `internal/api/server.go` | 138–415 | |
