diff options
| author | Peter Stone <thepeterstone@gmail.com> | 2026-04-04 07:40:45 +0000 |
|---|---|---|
| committer | Peter Stone <thepeterstone@gmail.com> | 2026-04-04 07:40:45 +0000 |
| commit | 68e43ca49a3002218ec057296c14f1c02d9f3237 (patch) | |
| tree | 2c4624bd2d300b9d539333c518f7660eebfc5221 | |
| parent | 774a33431f7ae8a54082f5bca5db0019d6459a60 (diff) | |
docs: task checker agent and story ship gate design spec
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| -rw-r--r-- | docs/superpowers/specs/2026-04-04-task-checker-story-ship.md | 222 |
1 files changed, 222 insertions, 0 deletions
diff --git a/docs/superpowers/specs/2026-04-04-task-checker-story-ship.md b/docs/superpowers/specs/2026-04-04-task-checker-story-ship.md new file mode 100644 index 0000000..1be2f3c --- /dev/null +++ b/docs/superpowers/specs/2026-04-04-task-checker-story-ship.md @@ -0,0 +1,222 @@ +# Task Checker Agent and Story Ship Gate — Design Spec + +**Date:** 2026-04-04 +**Status:** Approved + +--- + +## Goal + +Reduce per-task human review burden by running an independent checker agent after every top-level task completes. If the checker passes, the task auto-accepts. Human attention is required only when the checker flags a problem. Stories accumulate completed tasks until all are done, then surface a single "Ship" action for human approval before deploy. + +--- + +## Context + +The current flow requires a human to manually accept every READY task. For stories with many tasks this is friction — the human has to review each one before `checkStoryCompletion` can fire and ship the story. Additionally, `checkStoryCompletion` currently auto-triggers deploy with no human gate at the story level. + +ADR-007 describes a post-deploy validation agent (runs after merge, verifies the live deployment). This spec adds a pre-ship, per-task checker that is independent of the implementor and runs asynchronously. + +--- + +## Design + +### Two-tier validation + +| Tier | When | Author | Gate | +|---|---|---|---| +| **Checker** (this spec) | After task → READY | Different agent from implementor | Auto-accepts on pass; leaves READY + attaches report on fail | +| **Post-deploy validation** (existing) | After story → DEPLOYED | Separate validation task | Story → REVIEW_READY or NEEDS_FIX | + +### Full story flow after this change + +``` +Task runs → READY + → checker task spawned (async, independent) + → checker passes: task → COMPLETED (silent) + → checker fails: task stays READY, checker_report attached + +All story top-level tasks COMPLETED + → story → SHIPPABLE (human gate — no auto-deploy) + +Human clicks "Ship" on SHIPPABLE story + → merge branch to main + run deploy script → DEPLOYED + → post-deploy validation task created → VALIDATING + → REVIEW_READY | NEEDS_FIX +``` + +--- + +## Data Model + +### `tasks` table — three new columns + +| Column | Type | Purpose | +|---|---|---| +| `acceptance_criteria` | `TEXT` | Criteria for the checker. Empty = use task instructions as spec. | +| `checker_for_task_id` | `TEXT` | Set on checker tasks. Points to the task being checked. | +| `checker_report` | `TEXT` | Populated when checker fails. Shown in the UI on the READY task. | + +Checker tasks have `checker_for_task_id` set and no `story_id`. They do not appear in story task lists and do not affect `checkStoryCompletion`. + +--- + +## Acceptance Criteria Source + +**Story tasks:** The elaborator generates `acceptance_criteria` alongside each task's instructions. Example: + +``` +Run the full test suite and verify all tests pass. +Confirm the /api/tasks endpoint returns repository_url in the response body. +``` + +**Standalone tasks (no story):** `acceptance_criteria` is empty. The checker uses the task's own `instructions` field as the specification. + +--- + +## Checker Task + +### Creation + +Spawned by the executor pool when a top-level task (no `parent_task_id`, no `checker_for_task_id`) transitions to READY. The pool constructs and submits a checker task immediately: + +``` +Name: "Check: <original task name>" +checker_for_task_id: <original task ID> +story_id: (empty — checker is not a story task) +repository_url: same as the checked task +agent.type: claude +agent.instructions: (see below) +agent.max_budget_usd: 0.50 +timeout: 10m +retry.max_attempts: 1 +``` + +**Do not spawn a checker if:** +- `t.ParentTaskID != ""` (subtasks go directly to COMPLETED, never READY) +- `t.CheckerForTaskID != ""` (never check a checker) +- A checker task already exists for this task (query by `checker_for_task_id` before spawning) + +### Instructions template + +``` +You are validating a completed task. Do not make any changes to the code or repository. + +Task: <name> +Instructions given to the implementor: +<task instructions> + +Acceptance criteria: +<acceptance_criteria, or task instructions if acceptance_criteria is empty> + +Steps: +1. Clone the repository and review the changes made. +2. Verify each acceptance criterion is met. Run tests or make HTTP requests as needed. +3. If all criteria are satisfied, exit normally (success). +4. If any criterion is not met, use the Bash tool to exit with a non-zero code: + bash -c "exit 1" + Before exiting, write a brief summary of what failed. +``` + +### Completion handling + +In `executor.Pool.handleRunResult`, after determining the task outcome, check `t.CheckerForTaskID`: + +- **Checker succeeded (exit 0):** Call `store.UpdateTaskState(t.CheckerForTaskID, task.StateCompleted)`. Then call `checkStoryCompletion` for the checked task's story (if any). +- **Checker failed (exit non-0 or error):** Extract the execution summary and call `store.UpdateTaskCheckerReport(t.CheckerForTaskID, summary)`. The checked task stays READY; the report surfaces in the UI. + +The checker task itself always resolves to COMPLETED or FAILED through the normal state machine — no special states needed. + +--- + +## Story Ship Gate + +### Remove auto-deploy from `checkStoryCompletion` + +Current code at the end of `checkStoryCompletion`: +```go +go p.triggerStoryDeploy(ctx, storyID) // REMOVE THIS +``` + +After this change, `checkStoryCompletion` only transitions the story to SHIPPABLE. Deploy is triggered explicitly by the human. + +### New endpoint + +`POST /api/stories/{id}/ship` + +- Verifies story state is SHIPPABLE. Returns 409 otherwise. +- Calls `p.triggerStoryDeploy(ctx, storyID)` (existing function, no changes needed). +- Returns 202 Accepted immediately; deploy runs async. + +### UI + +The stories panel shows a **"Ship"** button on any story in SHIPPABLE state. No other UI changes required for the story panel. The button calls `POST /api/stories/{id}/ship`. + +--- + +## Task Card UI + +When `checker_report` is non-empty on a READY task: + +- Show a warning badge on the task card (e.g. "⚠ Checker failed") +- Expand the card or side panel to show the report text +- Human can still accept or reject the task manually regardless of checker result + +When a checker is pending/running for a READY task: + +- Show a subtle indicator (e.g. "Checking…") — optional, nice to have + +--- + +## Elaborator Changes + +The story elaboration endpoint currently returns a list of tasks with `name` and `instructions`. Add `acceptance_criteria` to each task in the elaborated output: + +```json +{ + "tasks": [ + { + "name": "Add repository_url to task struct", + "instructions": "...", + "acceptance_criteria": "Run go test ./... and verify all tests pass. Confirm Task struct has RepositoryURL field with correct json tag." + } + ] +} +``` + +The elaborator prompt should instruct the LLM to write acceptance criteria that are concrete and verifiable by a separate agent: specific commands to run, specific API responses to check, specific file contents to verify. Vague criteria like "code looks good" are not acceptable. + +--- + +## Storage + +New methods on `storage.DB`: + +```go +// UpdateTaskCheckerReport sets the checker_report field on a task. +func (s *DB) UpdateTaskCheckerReport(id, report string) error + +// GetCheckerTask returns the checker task for a given task ID, or nil if none exists. +func (s *DB) GetCheckerTask(checkedTaskID string) (*task.Task, error) +``` + +Migration: `ALTER TABLE tasks ADD COLUMN acceptance_criteria TEXT NOT NULL DEFAULT ''`, same for `checker_for_task_id` and `checker_report`. + +--- + +## What This Does Not Change + +- Post-deploy validation flow (DEPLOYED → VALIDATING → REVIEW_READY/NEEDS_FIX) is unchanged. +- Subtask handling is unchanged — subtasks never go READY, so they never get checkers. +- The `handleAcceptTask` endpoint remains — humans can still manually accept READY tasks. +- The `handleRejectTask` endpoint remains — humans can still manually reject. +- Checker tasks are subject to normal rate-limiting, retry, and budget enforcement. + +--- + +## Out of Scope + +- Checker result overriding the human (human can always accept/reject regardless) +- Parallel checker runs (one checker per task, no re-run unless task is re-run) +- Configurable checker agent type per project +- Checker budget/timeout configuration beyond the defaults above |
