summaryrefslogtreecommitdiff
path: root/docs/superpowers/specs/2026-04-04-task-checker-story-ship.md
blob: 1be2f3c6f10c24af5b2b60a342310736f9dcf26d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Task Checker Agent and Story Ship Gate — Design Spec

**Date:** 2026-04-04
**Status:** Approved

---

## Goal

Reduce per-task human review burden by running an independent checker agent after every top-level task completes. If the checker passes, the task auto-accepts. Human attention is required only when the checker flags a problem. Stories accumulate completed tasks until all are done, then surface a single "Ship" action for human approval before deploy.

---

## Context

The current flow requires a human to manually accept every READY task. For stories with many tasks this is friction — the human has to review each one before `checkStoryCompletion` can fire and ship the story. Additionally, `checkStoryCompletion` currently auto-triggers deploy with no human gate at the story level.

ADR-007 describes a post-deploy validation agent (runs after merge, verifies the live deployment). This spec adds a pre-ship, per-task checker that is independent of the implementor and runs asynchronously.

---

## Design

### Two-tier validation

| Tier | When | Author | Gate |
|---|---|---|---|
| **Checker** (this spec) | After task → READY | Different agent from implementor | Auto-accepts on pass; leaves READY + attaches report on fail |
| **Post-deploy validation** (existing) | After story → DEPLOYED | Separate validation task | Story → REVIEW_READY or NEEDS_FIX |

### Full story flow after this change

```
Task runs → READY
  → checker task spawned (async, independent)
  → checker passes: task → COMPLETED (silent)
  → checker fails:  task stays READY, checker_report attached

All story top-level tasks COMPLETED
  → story → SHIPPABLE (human gate — no auto-deploy)

Human clicks "Ship" on SHIPPABLE story
  → merge branch to main + run deploy script → DEPLOYED
  → post-deploy validation task created → VALIDATING
  → REVIEW_READY | NEEDS_FIX
```

---

## Data Model

### `tasks` table — three new columns

| Column | Type | Purpose |
|---|---|---|
| `acceptance_criteria` | `TEXT` | Criteria for the checker. Empty = use task instructions as spec. |
| `checker_for_task_id` | `TEXT` | Set on checker tasks. Points to the task being checked. |
| `checker_report` | `TEXT` | Populated when checker fails. Shown in the UI on the READY task. |

Checker tasks have `checker_for_task_id` set and no `story_id`. They do not appear in story task lists and do not affect `checkStoryCompletion`.

---

## Acceptance Criteria Source

**Story tasks:** The elaborator generates `acceptance_criteria` alongside each task's instructions. Example:

```
Run the full test suite and verify all tests pass.
Confirm the /api/tasks endpoint returns repository_url in the response body.
```

**Standalone tasks (no story):** `acceptance_criteria` is empty. The checker uses the task's own `instructions` field as the specification.

---

## Checker Task

### Creation

Spawned by the executor pool when a top-level task (no `parent_task_id`, no `checker_for_task_id`) transitions to READY. The pool constructs and submits a checker task immediately:

```
Name:               "Check: <original task name>"
checker_for_task_id: <original task ID>
story_id:           (empty — checker is not a story task)
repository_url:     same as the checked task
agent.type:         claude
agent.instructions: (see below)
agent.max_budget_usd: 0.50
timeout:            10m
retry.max_attempts: 1
```

**Do not spawn a checker if:**
- `t.ParentTaskID != ""` (subtasks go directly to COMPLETED, never READY)
- `t.CheckerForTaskID != ""` (never check a checker)
- A checker task already exists for this task (query by `checker_for_task_id` before spawning)

### Instructions template

```
You are validating a completed task. Do not make any changes to the code or repository.

Task: <name>
Instructions given to the implementor:
<task instructions>

Acceptance criteria:
<acceptance_criteria, or task instructions if acceptance_criteria is empty>

Steps:
1. Clone the repository and review the changes made.
2. Verify each acceptance criterion is met. Run tests or make HTTP requests as needed.
3. If all criteria are satisfied, exit normally (success).
4. If any criterion is not met, use the Bash tool to exit with a non-zero code:
   bash -c "exit 1"
   Before exiting, write a brief summary of what failed.
```

### Completion handling

In `executor.Pool.handleRunResult`, after determining the task outcome, check `t.CheckerForTaskID`:

- **Checker succeeded (exit 0):** Call `store.UpdateTaskState(t.CheckerForTaskID, task.StateCompleted)`. Then call `checkStoryCompletion` for the checked task's story (if any).
- **Checker failed (exit non-0 or error):** Extract the execution summary and call `store.UpdateTaskCheckerReport(t.CheckerForTaskID, summary)`. The checked task stays READY; the report surfaces in the UI.

The checker task itself always resolves to COMPLETED or FAILED through the normal state machine — no special states needed.

---

## Story Ship Gate

### Remove auto-deploy from `checkStoryCompletion`

Current code at the end of `checkStoryCompletion`:
```go
go p.triggerStoryDeploy(ctx, storyID)  // REMOVE THIS
```

After this change, `checkStoryCompletion` only transitions the story to SHIPPABLE. Deploy is triggered explicitly by the human.

### New endpoint

`POST /api/stories/{id}/ship`

- Verifies story state is SHIPPABLE. Returns 409 otherwise.
- Calls `p.triggerStoryDeploy(ctx, storyID)` (existing function, no changes needed).
- Returns 202 Accepted immediately; deploy runs async.

### UI

The stories panel shows a **"Ship"** button on any story in SHIPPABLE state. No other UI changes required for the story panel. The button calls `POST /api/stories/{id}/ship`.

---

## Task Card UI

When `checker_report` is non-empty on a READY task:

- Show a warning badge on the task card (e.g. "⚠ Checker failed")
- Expand the card or side panel to show the report text
- Human can still accept or reject the task manually regardless of checker result

When a checker is pending/running for a READY task:

- Show a subtle indicator (e.g. "Checking…") — optional, nice to have

---

## Elaborator Changes

The story elaboration endpoint currently returns a list of tasks with `name` and `instructions`. Add `acceptance_criteria` to each task in the elaborated output:

```json
{
  "tasks": [
    {
      "name": "Add repository_url to task struct",
      "instructions": "...",
      "acceptance_criteria": "Run go test ./... and verify all tests pass. Confirm Task struct has RepositoryURL field with correct json tag."
    }
  ]
}
```

The elaborator prompt should instruct the LLM to write acceptance criteria that are concrete and verifiable by a separate agent: specific commands to run, specific API responses to check, specific file contents to verify. Vague criteria like "code looks good" are not acceptable.

---

## Storage

New methods on `storage.DB`:

```go
// UpdateTaskCheckerReport sets the checker_report field on a task.
func (s *DB) UpdateTaskCheckerReport(id, report string) error

// GetCheckerTask returns the checker task for a given task ID, or nil if none exists.
func (s *DB) GetCheckerTask(checkedTaskID string) (*task.Task, error)
```

Migration: `ALTER TABLE tasks ADD COLUMN acceptance_criteria TEXT NOT NULL DEFAULT ''`, same for `checker_for_task_id` and `checker_report`.

---

## What This Does Not Change

- Post-deploy validation flow (DEPLOYED → VALIDATING → REVIEW_READY/NEEDS_FIX) is unchanged.
- Subtask handling is unchanged — subtasks never go READY, so they never get checkers.
- The `handleAcceptTask` endpoint remains — humans can still manually accept READY tasks.
- The `handleRejectTask` endpoint remains — humans can still manually reject.
- Checker tasks are subject to normal rate-limiting, retry, and budget enforcement.

---

## Out of Scope

- Checker result overriding the human (human can always accept/reject regardless)
- Parallel checker runs (one checker per task, no re-run unless task is re-run)
- Configurable checker agent type per project
- Checker budget/timeout configuration beyond the defaults above