1 files changed, 155 insertions, 0 deletions
diff --git a/claudomator/REFACTOR_PLAN.md b/claudomator/REFACTOR_PLAN.md
new file mode 100644
index 0000000..f4497db
--- /dev/null
+++ b/claudomator/REFACTOR_PLAN.md
@@ -0,0 +1,155 @@
+# Claudomator Refactor Plan
+
+**Date:** 2026-03-18
+**Scope:** Architectural integrity, test coverage gaps, and potential bugs
+
+---
+
+## Executive Summary
+
+The Claudomator codebase is well-structured with clear package separation and a sound architecture. The analysis identified **1 critical bug**, **2 medium-priority issues**, and **8 lower-priority improvements** across ~7,300 LOC. Test coverage is roughly 65% but the storage and API database integration tests cannot run without CGO enabled.
+
+---
+
+## Critical Bugs
+
+### CRIT-1: Race Condition / Panic in `QuestionRegistry.Answer()`
+
+**File:** `internal/executor/question.go:63`
+
+The channel send occurs **outside** the mutex. If the corresponding `AskQuestion()` goroutine times out and closes or abandons the `AnswerCh` between the `Unlock()` and the send, the server panics with a send on closed channel.
+
+```go
+// Current (unsafe)
+pq.AnswerCh <- answer  // outside lock — goroutine receiving may have exited
+
+// Fix: use select to avoid blocking on a dead channel
+select {
+case pq.AnswerCh <- answer:
+    return true
+default:
+    return false
+}
+```
+
+**Action:** Guard the channel send with a `select`/`default` and add a test with `-race` flag that cancels the question context before answering.
+
+---
+
+## Medium-Priority Issues
+
+### MED-1: Sandbox Directories Leak on Execution Failure
+
+**File:** `internal/executor/claude.go:163-166`
+
+When an execution fails and `project_dir` is set, the sandbox is intentionally preserved for debugging. There is no cleanup mechanism. Over time, `/tmp` accumulates preserved sandboxes and can exhaust disk space.
+
+**Action:** Implement a cleanup policy — either a max-age sweep on `serve` startup, or a dedicated `/api/admin/cleanup-sandboxes` endpoint. Document the sandbox location pattern and manual cleanup steps.
+
+### MED-2: VAPID Private Key Not Validated on Load
+
+**File:** `internal/cli/serve.go:57`
+
+The code validates the public VAPID key but not the private key. A corrupted or swapped private key in the database causes push notifications to fail silently at runtime with no regeneration triggered.
+
+**Action:** Add `notify.ValidateVAPIDPrivateKey()` (or a combined `ValidateVAPIDKeyPair()`) and include it in the load check so a bad private key triggers regeneration, mirroring the existing public-key logic.
+
+---
+
+## Minor Issues
+
+### MIN-1: `strconv.Atoi` Errors Silently Ignored in Changestats Parser
+
+**File:** `internal/task/changestats.go:26,28,31`
+
+Conversion errors from `strconv.Atoi` are discarded with `_`. The regex guarantees digits are present, so this rarely matters, but if the format ever changes the function returns zeroed stats with no diagnostic.
+
+**Action:** Check the errors and log a warning when conversion unexpectedly fails.
+
+### MIN-2: `json.Marshal` Error Ignored in `ResetTaskForRetry`
+
+**File:** `internal/storage/db.go:258`
+
+`json.Marshal(t.Agent)` error is discarded. If marshaling ever fails, the stored config JSON would be empty/corrupted, making the retried task unusable.
+
+**Action:** Return the error to the caller.
+
+### MIN-3: `os.Remove` Errors Silently Ignored
+
+**File:** `internal/executor/claude.go:172,190`
+
+Cleanup of temp files (`questionFile`, `summaryFile`) ignores all errors. While `ENOENT` is benign, other errors (permissions, I/O) should be logged as warnings to surface sandbox issues.
+
+**Action:** Log `!os.IsNotExist(err)` errors at `Warn` level.
+
+### MIN-4: Transaction Rollback Errors Not Logged
+
+**File:** `internal/storage/db.go:214,242,492,614`
+
+All `defer tx.Rollback()` calls use `//nolint:errcheck`. A failed rollback can indicate disk-full or corruption and should at least be logged.
+
+**Action:** Replace with a logging closure that suppresses `sql.ErrTxDone` but logs anything else.
+
+### MIN-5: Duplicate Error-Exit Boilerplate in `executor.go`
+
+**File:** `internal/executor/executor.go` (multiple locations)
+
+The pattern of "update task state → decrement activePerAgent → release slot" is repeated 3+ times across `execute()` and `executeResume()` with subtle differences that make the code hard to audit.
+
+**Action:** Extract a private `p.abortExecution(t, agentType, err)` helper to consolidate this pattern and make the concurrency invariants explicit.
+
+### MIN-6: Storage and API DB Integration Tests Disabled
+
+**Context:** `internal/storage/db_test.go`, `internal/api/elaborate_test.go`, `internal/api/executions_test.go`, `internal/api/drops_test.go`
+
+These tests fail when `CGO_ENABLED=0` because `go-sqlite3` requires CGo. The CI/build environment apparently compiles without CGo, disabling ~50 test cases.
+
+**Action:** Ensure the build and test environment has `gcc` installed and `CGO_ENABLED=1` (or equivalent). Document this requirement in `CLAUDE.md` and the `Makefile`/`scripts/`.
+
+### MIN-7: Missing Tests for `changestats.go`, `deployment.go`, `task_view.go`
+
+**Files:** `internal/api/changestats.go`, `internal/api/deployment.go`, `internal/api/task_view.go`
+
+These files have no dedicated test coverage. `changestats.go` in particular contains the double-write logic described in `CLAUDE.md` that would benefit from regression tests.
+
+**Action:** Add unit tests for each file; at minimum cover the happy path and key error branches.
+
+### MIN-8: Classifier Fallback Parsing Is Fragile
+
+**File:** `internal/executor/classifier.go:91`
+
+When the Gemini CLI returns non-JSON output, the code falls back to treating the raw string as the response. A subsequent JSON unmarshal attempt on an arbitrary string can silently produce wrong results.
+
+**Action:** Separate the fallback path clearly, log a warning when JSON parsing fails, and add a test for the non-JSON fallback.
+
+---
+
+## Architecture / Observability Improvements (Low Priority)
+
+These are quality-of-life improvements rather than bugs.
+
+| ID | Area | Suggestion |
+|----|------|-----------|
+| OBS-1 | Metrics | Expose pool utilization, execution duration, cost-per-task, and retry rate via a `/metrics` (Prometheus) or `/api/stats` endpoint |
+| OBS-2 | Health check | Enhance `/api/health` to include pool queue depth and DB connectivity |
+| ARCH-1 | Sandbox package | Extract sandbox setup/teardown from `claude.go` into `internal/sandbox` to reduce the file's complexity and enable independent testing |
+| ARCH-2 | Typed WebSocket messages | Replace `interface{}` in `internal/api/websocket.go` with a typed discriminated union to prevent accidental protocol drift |
+| ARCH-3 | Prepared statements | Add prepared statements for the hot `GetTask` / `UpdateTaskState` queries to reduce per-call parsing overhead under load |
+
+---
+
+## Summary Table
+
+| ID | Severity | File | Description |
+|----|----------|------|-------------|
+| CRIT-1 | Critical | executor/question.go | Panic: channel send outside lock |
+| MED-1 | Medium | executor/claude.go | Sandbox dir leak on failure |
+| MED-2 | Medium | cli/serve.go | VAPID private key not validated |
+| MIN-1 | Minor | task/changestats.go | strconv errors silently ignored |
+| MIN-2 | Minor | storage/db.go | json.Marshal error ignored in retry |
+| MIN-3 | Minor | executor/claude.go | os.Remove errors silently ignored |
+| MIN-4 | Minor | storage/db.go | Rollback errors not logged |
+| MIN-5 | Minor | executor/executor.go | Duplicate error-exit boilerplate |
+| MIN-6 | Minor | All test files | CGo required — DB tests disabled |
+| MIN-7 | Minor | api/changestats.go et al | Missing test files |
+| MIN-8 | Minor | executor/classifier.go | Fragile classifier fallback |