docs: task checker agent and story ship gate design spec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
author: Peter Stone <thepeterstone@gmail.com> 2026-04-04 07:40:45 +0000
committer: Peter Stone <thepeterstone@gmail.com> 2026-04-04 07:40:45 +0000
commit: 68e43ca49a3002218ec057296c14f1c02d9f3237 (patch)
tree: 2c4624bd2d300b9d539333c518f7660eebfc5221
parent: 774a33431f7ae8a54082f5bca5db0019d6459a60 (diff)
1 files changed, 222 insertions, 0 deletions
diff --git a/docs/superpowers/specs/2026-04-04-task-checker-story-ship.md b/docs/superpowers/specs/2026-04-04-task-checker-story-ship.md
new file mode 100644
index 0000000..1be2f3c
--- /dev/null
+++ b/docs/superpowers/specs/2026-04-04-task-checker-story-ship.md
@@ -0,0 +1,222 @@
+# Task Checker Agent and Story Ship Gate — Design Spec
+
+**Date:** 2026-04-04
+**Status:** Approved
+
+---
+
+## Goal
+
+Reduce per-task human review burden by running an independent checker agent after every top-level task completes. If the checker passes, the task auto-accepts. Human attention is required only when the checker flags a problem. Stories accumulate completed tasks until all are done, then surface a single "Ship" action for human approval before deploy.
+
+---
+
+## Context
+
+The current flow requires a human to manually accept every READY task. For stories with many tasks this is friction — the human has to review each one before `checkStoryCompletion` can fire and ship the story. Additionally, `checkStoryCompletion` currently auto-triggers deploy with no human gate at the story level.
+
+ADR-007 describes a post-deploy validation agent (runs after merge, verifies the live deployment). This spec adds a pre-ship, per-task checker that is independent of the implementor and runs asynchronously.
+
+---
+
+## Design
+
+### Two-tier validation
+
+| Tier | When | Author | Gate |
+|---|---|---|---|
+| **Checker** (this spec) | After task → READY | Different agent from implementor | Auto-accepts on pass; leaves READY + attaches report on fail |
+| **Post-deploy validation** (existing) | After story → DEPLOYED | Separate validation task | Story → REVIEW_READY or NEEDS_FIX |
+
+### Full story flow after this change
+
+```
+Task runs → READY
+  → checker task spawned (async, independent)
+  → checker passes: task → COMPLETED (silent)
+  → checker fails:  task stays READY, checker_report attached
+
+All story top-level tasks COMPLETED
+  → story → SHIPPABLE (human gate — no auto-deploy)
+
+Human clicks "Ship" on SHIPPABLE story
+  → merge branch to main + run deploy script → DEPLOYED
+  → post-deploy validation task created → VALIDATING
+  → REVIEW_READY | NEEDS_FIX
+```
+
+---
+
+## Data Model
+
+### `tasks` table — three new columns
+
+| Column | Type | Purpose |
+|---|---|---|
+| `acceptance_criteria` | `TEXT` | Criteria for the checker. Empty = use task instructions as spec. |
+| `checker_for_task_id` | `TEXT` | Set on checker tasks. Points to the task being checked. |
+| `checker_report` | `TEXT` | Populated when checker fails. Shown in the UI on the READY task. |
+
+Checker tasks have `checker_for_task_id` set and no `story_id`. They do not appear in story task lists and do not affect `checkStoryCompletion`.
+
+---
+
+## Acceptance Criteria Source
+
+**Story tasks:** The elaborator generates `acceptance_criteria` alongside each task's instructions. Example:
+
+```
+Run the full test suite and verify all tests pass.
+Confirm the /api/tasks endpoint returns repository_url in the response body.
+```
+
+**Standalone tasks (no story):** `acceptance_criteria` is empty. The checker uses the task's own `instructions` field as the specification.
+
+---
+
+## Checker Task
+
+### Creation
+
+Spawned by the executor pool when a top-level task (no `parent_task_id`, no `checker_for_task_id`) transitions to READY. The pool constructs and submits a checker task immediately:
+
+```
+Name:               "Check: <original task name>"
+checker_for_task_id: <original task ID>
+story_id:           (empty — checker is not a story task)
+repository_url:     same as the checked task
+agent.type:         claude
+agent.instructions: (see below)
+agent.max_budget_usd: 0.50
+timeout:            10m
+retry.max_attempts: 1
+```
+
+**Do not spawn a checker if:**
+- `t.ParentTaskID != ""` (subtasks go directly to COMPLETED, never READY)
+- `t.CheckerForTaskID != ""` (never check a checker)
+- A checker task already exists for this task (query by `checker_for_task_id` before spawning)
+
+### Instructions template
+
+```
+You are validating a completed task. Do not make any changes to the code or repository.
+
+Task: <name>
+Instructions given to the implementor:
+<task instructions>
+
+Acceptance criteria:
+<acceptance_criteria, or task instructions if acceptance_criteria is empty>
+
+Steps:
+1. Clone the repository and review the changes made.
+2. Verify each acceptance criterion is met. Run tests or make HTTP requests as needed.
+3. If all criteria are satisfied, exit normally (success).
+4. If any criterion is not met, use the Bash tool to exit with a non-zero code:
+   bash -c "exit 1"
+   Before exiting, write a brief summary of what failed.
+```
+
+### Completion handling
+
+In `executor.Pool.handleRunResult`, after determining the task outcome, check `t.CheckerForTaskID`:
+
+- **Checker succeeded (exit 0):** Call `store.UpdateTaskState(t.CheckerForTaskID, task.StateCompleted)`. Then call `checkStoryCompletion` for the checked task's story (if any).
+- **Checker failed (exit non-0 or error):** Extract the execution summary and call `store.UpdateTaskCheckerReport(t.CheckerForTaskID, summary)`. The checked task stays READY; the report surfaces in the UI.
+
+The checker task itself always resolves to COMPLETED or FAILED through the normal state machine — no special states needed.
+
+---
+
+## Story Ship Gate
+
+### Remove auto-deploy from `checkStoryCompletion`
+
+Current code at the end of `checkStoryCompletion`:
+```go
+go p.triggerStoryDeploy(ctx, storyID)  // REMOVE THIS
+```
+
+After this change, `checkStoryCompletion` only transitions the story to SHIPPABLE. Deploy is triggered explicitly by the human.
+
+### New endpoint
+
+`POST /api/stories/{id}/ship`
+
+- Verifies story state is SHIPPABLE. Returns 409 otherwise.
+- Calls `p.triggerStoryDeploy(ctx, storyID)` (existing function, no changes needed).
+- Returns 202 Accepted immediately; deploy runs async.
+
+### UI
+
+The stories panel shows a **"Ship"** button on any story in SHIPPABLE state. No other UI changes required for the story panel. The button calls `POST /api/stories/{id}/ship`.
+
+---
+
+## Task Card UI
+
+When `checker_report` is non-empty on a READY task:
+
+- Show a warning badge on the task card (e.g. "⚠ Checker failed")
+- Expand the card or side panel to show the report text
+- Human can still accept or reject the task manually regardless of checker result
+
+When a checker is pending/running for a READY task:
+
+- Show a subtle indicator (e.g. "Checking…") — optional, nice to have
+
+---
+
+## Elaborator Changes
+
+The story elaboration endpoint currently returns a list of tasks with `name` and `instructions`. Add `acceptance_criteria` to each task in the elaborated output:
+
+```json
+{
+  "tasks": [
+    {
+      "name": "Add repository_url to task struct",
+      "instructions": "...",
+      "acceptance_criteria": "Run go test ./... and verify all tests pass. Confirm Task struct has RepositoryURL field with correct json tag."
+    }
+  ]
+}
+```
+
+The elaborator prompt should instruct the LLM to write acceptance criteria that are concrete and verifiable by a separate agent: specific commands to run, specific API responses to check, specific file contents to verify. Vague criteria like "code looks good" are not acceptable.
+
+---
+
+## Storage
+
+New methods on `storage.DB`:
+
+```go
+// UpdateTaskCheckerReport sets the checker_report field on a task.
+func (s *DB) UpdateTaskCheckerReport(id, report string) error
+
+// GetCheckerTask returns the checker task for a given task ID, or nil if none exists.
+func (s *DB) GetCheckerTask(checkedTaskID string) (*task.Task, error)
+```
+
+Migration: `ALTER TABLE tasks ADD COLUMN acceptance_criteria TEXT NOT NULL DEFAULT ''`, same for `checker_for_task_id` and `checker_report`.
+
+---
+
+## What This Does Not Change
+
+- Post-deploy validation flow (DEPLOYED → VALIDATING → REVIEW_READY/NEEDS_FIX) is unchanged.
+- Subtask handling is unchanged — subtasks never go READY, so they never get checkers.
+- The `handleAcceptTask` endpoint remains — humans can still manually accept READY tasks.
+- The `handleRejectTask` endpoint remains — humans can still manually reject.
+- Checker tasks are subject to normal rate-limiting, retry, and budget enforcement.
+
+---
+
+## Out of Scope
+
+- Checker result overriding the human (human can always accept/reject regardless)
+- Parallel checker runs (one checker per task, no re-run unless task is re-run)
+- Configurable checker agent type per project
+- Checker budget/timeout configuration beyond the defaults above
author	Peter Stone <thepeterstone@gmail.com>	2026-04-04 07:40:45 +0000
committer	Peter Stone <thepeterstone@gmail.com>	2026-04-04 07:40:45 +0000
commit	68e43ca49a3002218ec057296c14f1c02d9f3237 (patch)
tree	2c4624bd2d300b9d539333c518f7660eebfc5221
parent	774a33431f7ae8a54082f5bca5db0019d6459a60 (diff)