summaryrefslogtreecommitdiff
path: root/docs/adr
diff options
context:
space:
mode:
Diffstat (limited to 'docs/adr')
-rw-r--r--docs/adr/002-task-state-machine.md177
1 files changed, 177 insertions, 0 deletions
diff --git a/docs/adr/002-task-state-machine.md b/docs/adr/002-task-state-machine.md
new file mode 100644
index 0000000..debf7b0
--- /dev/null
+++ b/docs/adr/002-task-state-machine.md
@@ -0,0 +1,177 @@
+# ADR-002: Task State Machine Design
+
+## Status
+Accepted
+
+## Context
+Claudomator tasks move through a well-defined lifecycle: from creation through
+queuing, execution, and final resolution. The lifecycle must handle asynchronous
+execution (subprocess), user interaction (review, Q&A), retries, and cancellation.
+
+## States
+
+| State | Meaning |
+|---|---|
+| `PENDING` | Task created; not yet submitted for execution |
+| `QUEUED` | Submitted to executor pool; goroutine slot may be waiting |
+| `RUNNING` | Claude subprocess is actively executing |
+| `READY` | Top-level task completed execution; awaiting user accept/reject |
+| `COMPLETED` | Task is fully done (terminal) |
+| `FAILED` | Execution error occurred; eligible for retry (terminal if retries exhausted) |
+| `TIMED_OUT` | Task exceeded configured timeout; resumable or retryable (terminal if not resumed) |
+| `CANCELLED` | Explicitly cancelled by user or API (terminal) |
+| `BUDGET_EXCEEDED` | Exceeded `max_budget_usd` (terminal) |
+| `BLOCKED` | Agent paused and wrote a question file; awaiting user answer |
+
+Terminal states with no outgoing transitions: `COMPLETED`, `CANCELLED`, `BUDGET_EXCEEDED`.
+
+## State Transition Diagram
+
+```
+ ┌─────────┐
+ │ PENDING │◄───────────────────────────────┐
+ └────┬────┘ │
+ POST /run │ POST /reject │
+ POST /cancel │ │
+ ┌────▼────┐ ┌──────┴─────┐
+ ┌────────┤ QUEUED ├─────────────┐ │ READY │
+ │ └────┬────┘ │ └──────┬─────┘
+ POST /cancel │ │ POST /accept │
+ │ pool picks up │ ▼
+ ▼ ▼ │ ┌─────────────┐
+ ┌──────────┐ ┌─────────┐ │ │ COMPLETED │
+ │CANCELLED │◄──┤ RUNNING ├──────────────┘ └─────────────┘
+ └──────────┘ └────┬────┘
+ │
+ ┌───────────────┼───────────────────┬───────────────┐
+ │ │ │ │
+ ▼ ▼ ▼ ▼
+ ┌──────────┐ ┌──────────────┐ ┌─────────────┐ ┌─────────┐
+ │ FAILED │ │ TIMED_OUT │ │ BUDGET │ │ BLOCKED │
+ └────┬─────┘ └──────┬───────┘ │ _EXCEEDED │ └────┬────┘
+ │ │ └─────────────┘ │
+ retry │ resume/ │ POST /answer
+ │ retry │ │
+ └────────┬───────┘ │
+ ▼ │
+ QUEUED ◄────────────────────────────────────┘
+```
+
+### Transition Table
+
+| From | To | Trigger |
+|---|---|---|
+| `PENDING` | `QUEUED` | `POST /api/tasks/{id}/run` |
+| `PENDING` | `CANCELLED` | `POST /api/tasks/{id}/cancel` |
+| `QUEUED` | `RUNNING` | Pool goroutine starts execution |
+| `QUEUED` | `CANCELLED` | `POST /api/tasks/{id}/cancel` |
+| `RUNNING` | `READY` | Runner exits 0, no question file, top-level task (`parent_task_id == ""`) |
+| `RUNNING` | `COMPLETED` | Runner exits 0, no question file, subtask (`parent_task_id != ""`) |
+| `RUNNING` | `FAILED` | Runner exits non-zero or stream signals `is_error: true` |
+| `RUNNING` | `TIMED_OUT` | Context deadline exceeded (`context.DeadlineExceeded`) |
+| `RUNNING` | `CANCELLED` | Context cancelled (`context.Canceled`) |
+| `RUNNING` | `BUDGET_EXCEEDED` | `--max-budget-usd` exceeded (signalled by runner) |
+| `RUNNING` | `BLOCKED` | Runner exits 0 but left a `question.json` file in log dir |
+| `READY` | `COMPLETED` | `POST /api/tasks/{id}/accept` |
+| `READY` | `PENDING` | `POST /api/tasks/{id}/reject` (with optional comment) |
+| `FAILED` | `QUEUED` | Retry (manual re-run via `POST /api/tasks/{id}/run`) |
+| `TIMED_OUT` | `QUEUED` | `POST /api/tasks/{id}/resume` (resumes with session ID) |
+| `BLOCKED` | `QUEUED` | `POST /api/tasks/{id}/answer` (resumes with user answer) |
+
+## Implementation
+
+**Validation:** `task.ValidTransition(from, to State) bool`
+(`internal/task/task.go:93`) — called by API handlers before every state change.
+
+**State writes:** `storage.DB.UpdateTaskState(id, state)` — single source of
+write; called by both API handlers and the executor pool.
+
+**Execution outcome → state mapping** (in `executor.Pool.execute` and `executeResume`):
+
+```
+runner.Run() returns nil AND parent_task_id == "" → READY
+runner.Run() returns nil AND parent_task_id != "" → COMPLETED
+runner.Run() returns *BlockedError → BLOCKED (question stored)
+ctx.Err() == DeadlineExceeded → TIMED_OUT
+ctx.Err() == Canceled → CANCELLED
+any other error → FAILED
+```
+
+## Key Invariants
+
+1. **`READY` is top-level only.** Subtasks (tasks with `parent_task_id != ""`) skip
+ `READY` and go directly to `COMPLETED` on success. This allows parent review
+ flows without requiring per-subtask acknowledgement.
+
+2. **`BLOCKED` requires a `session_id`.** The executor persists `session_id` in
+ the execution record at start time. `POST /answer` retrieves it via
+ `GetLatestExecution` to resume the correct Claude session.
+
+3. **`DELETE` is blocked for `RUNNING` and `QUEUED` tasks.** (`server.go:127`)
+
+4. **`CANCELLED` is reachable from `PENDING`, `QUEUED`, and `RUNNING`.** For
+ `RUNNING` tasks, `pool.Cancel(taskID)` sends a context cancellation signal;
+ the state transition happens asynchronously when the goroutine exits. For
+ `PENDING`/`QUEUED` tasks, `UpdateTaskState` is called directly.
+
+5. **Dependency waiting.** Tasks with `depends_on` stay conceptually in `QUEUED`
+ (goroutine running but not executing) while `waitForDependencies` polls until
+ all dependencies reach `COMPLETED`. If any dependency reaches a terminal
+ failure state, the waiting task transitions to `FAILED`.
+
+6. **Retry re-enters at `QUEUED`.** `FAILED → QUEUED` and `TIMED_OUT → QUEUED`
+ are the only back-edges. Retry is manual (caller must call `/run` again); the
+ `RetryConfig.MaxAttempts` field exists but enforcement is left to callers.
+
+## Side Effects on Transition
+
+| Transition | Side effects |
+|---|---|
+| → `QUEUED` | `pool.Submit()` called; execution record created in DB with log paths |
+| → `RUNNING` | `store.UpdateTaskState`; cancel func registered in `pool.cancels` |
+| → `BLOCKED` | `store.UpdateTaskQuestion(taskID, questionJSON)`; session_id in execution |
+| → `READY/COMPLETED/FAILED/TIMED_OUT/CANCELLED/BUDGET_EXCEEDED` | Execution end time set; `store.UpdateExecution`; result broadcast via WebSocket (`pool.resultCh → forwardResults → hub.Broadcast`) |
+| `READY → PENDING` (reject) | `store.RejectTask(id, comment)` — stores `rejection_comment` |
+| `BLOCKED → QUEUED` (answer) | `store.UpdateTaskQuestion(taskID, "")` clears question; new `storage.Execution` created with `ResumeSessionID` + `ResumeAnswer` |
+
+## WebSocket Events
+
+Task lifecycle changes produce WebSocket broadcasts to all connected clients:
+
+- `task_completed` — emitted on any terminal or quasi-terminal transition (READY,
+ COMPLETED, FAILED, TIMED_OUT, CANCELLED, BUDGET_EXCEEDED, BLOCKED)
+- `task_question` — emitted by `BroadcastQuestion` when an agent uses
+ `AskUserQuestion` (currently unused in the file-based flow; the file-based
+ mechanism is the primary BLOCKED path)
+
+## Known Limitations and Edge Cases
+
+- **`BUDGET_EXCEEDED` transition.** `BUDGET_EXCEEDED` appears in `terminalFailureStates`
+ (used by `waitForDependencies`) but has no outgoing transitions in `ValidTransition`,
+ making it permanently terminal. There is no `/resume` endpoint for it.
+
+- **Retry enforcement.** `RetryConfig.MaxAttempts` is stored but not enforced by
+ the pool. The API allows unlimited manual retries via `POST /run` from `FAILED`.
+
+- **`TIMED_OUT` resumability.** Only `POST /api/tasks/{id}/resume` resumes timed-out
+ tasks. Unlike BLOCKED, there is no question/answer — the resume message is
+ hardcoded: _"Your previous execution timed out. Please continue where you left off."_
+
+- **Concurrent cancellation race.** If a task transitions `RUNNING → COMPLETED`
+ and `POST /cancel` is called in the same window, `pool.Cancel()` may return
+ `true` (cancel func still registered) even though the goroutine is finishing.
+ The goroutine's `ctx.Err()` check wins; the task ends in `COMPLETED`.
+
+## Relevant Code Locations
+
+| Concern | File | Lines |
+|---|---|---|
+| State constants | `internal/task/task.go` | 7–18 |
+| `ValidTransition` | `internal/task/task.go` | 93–109 |
+| State machine tests | `internal/task/task_test.go` | 8–72 |
+| Pool execute | `internal/executor/executor.go` | 194–303 |
+| Pool executeResume | `internal/executor/executor.go` | 116–185 |
+| Dependency wait | `internal/executor/executor.go` | 305–340 |
+| `BlockedError` | `internal/executor/claude.go` | 31–37 |
+| Question file detection | `internal/executor/claude.go` | 103–110 |
+| API state transitions | `internal/api/server.go` | 138–415 |