From 0bb4c1923b864c6d47e1f18c379bee77bbdb2980 Mon Sep 17 00:00:00 2001 From: Claudomator Agent Date: Wed, 18 Mar 2026 10:16:18 +0000 Subject: docs: add REFACTOR_PLAN.md with codebase analysis findings Analyzed claudomator for architectural integrity, test coverage gaps, and bugs. Documents 1 critical race condition (QuestionRegistry.Answer panics on closed channel), 2 medium issues (sandbox leak, VAPID private key validation), and 8 minor issues covering error handling, test coverage gaps, and code duplication. 11 discrete subtasks created in Claudomator for each actionable item. Co-Authored-By: Claude Sonnet 4.6 --- claudomator/REFACTOR_PLAN.md | 155 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 claudomator/REFACTOR_PLAN.md diff --git a/claudomator/REFACTOR_PLAN.md b/claudomator/REFACTOR_PLAN.md new file mode 100644 index 0000000..f4497db --- /dev/null +++ b/claudomator/REFACTOR_PLAN.md @@ -0,0 +1,155 @@ +# Claudomator Refactor Plan + +**Date:** 2026-03-18 +**Scope:** Architectural integrity, test coverage gaps, and potential bugs + +--- + +## Executive Summary + +The Claudomator codebase is well-structured with clear package separation and a sound architecture. The analysis identified **1 critical bug**, **2 medium-priority issues**, and **8 lower-priority improvements** across ~7,300 LOC. Test coverage is roughly 65% but the storage and API database integration tests cannot run without CGO enabled. + +--- + +## Critical Bugs + +### CRIT-1: Race Condition / Panic in `QuestionRegistry.Answer()` + +**File:** `internal/executor/question.go:63` + +The channel send occurs **outside** the mutex. If the corresponding `AskQuestion()` goroutine times out and closes or abandons the `AnswerCh` between the `Unlock()` and the send, the server panics with a send on closed channel. + +```go +// Current (unsafe) +pq.AnswerCh <- answer // outside lock — goroutine receiving may have exited + +// Fix: use select to avoid blocking on a dead channel +select { +case pq.AnswerCh <- answer: + return true +default: + return false +} +``` + +**Action:** Guard the channel send with a `select`/`default` and add a test with `-race` flag that cancels the question context before answering. + +--- + +## Medium-Priority Issues + +### MED-1: Sandbox Directories Leak on Execution Failure + +**File:** `internal/executor/claude.go:163-166` + +When an execution fails and `project_dir` is set, the sandbox is intentionally preserved for debugging. There is no cleanup mechanism. Over time, `/tmp` accumulates preserved sandboxes and can exhaust disk space. + +**Action:** Implement a cleanup policy — either a max-age sweep on `serve` startup, or a dedicated `/api/admin/cleanup-sandboxes` endpoint. Document the sandbox location pattern and manual cleanup steps. + +### MED-2: VAPID Private Key Not Validated on Load + +**File:** `internal/cli/serve.go:57` + +The code validates the public VAPID key but not the private key. A corrupted or swapped private key in the database causes push notifications to fail silently at runtime with no regeneration triggered. + +**Action:** Add `notify.ValidateVAPIDPrivateKey()` (or a combined `ValidateVAPIDKeyPair()`) and include it in the load check so a bad private key triggers regeneration, mirroring the existing public-key logic. + +--- + +## Minor Issues + +### MIN-1: `strconv.Atoi` Errors Silently Ignored in Changestats Parser + +**File:** `internal/task/changestats.go:26,28,31` + +Conversion errors from `strconv.Atoi` are discarded with `_`. The regex guarantees digits are present, so this rarely matters, but if the format ever changes the function returns zeroed stats with no diagnostic. + +**Action:** Check the errors and log a warning when conversion unexpectedly fails. + +### MIN-2: `json.Marshal` Error Ignored in `ResetTaskForRetry` + +**File:** `internal/storage/db.go:258` + +`json.Marshal(t.Agent)` error is discarded. If marshaling ever fails, the stored config JSON would be empty/corrupted, making the retried task unusable. + +**Action:** Return the error to the caller. + +### MIN-3: `os.Remove` Errors Silently Ignored + +**File:** `internal/executor/claude.go:172,190` + +Cleanup of temp files (`questionFile`, `summaryFile`) ignores all errors. While `ENOENT` is benign, other errors (permissions, I/O) should be logged as warnings to surface sandbox issues. + +**Action:** Log `!os.IsNotExist(err)` errors at `Warn` level. + +### MIN-4: Transaction Rollback Errors Not Logged + +**File:** `internal/storage/db.go:214,242,492,614` + +All `defer tx.Rollback()` calls use `//nolint:errcheck`. A failed rollback can indicate disk-full or corruption and should at least be logged. + +**Action:** Replace with a logging closure that suppresses `sql.ErrTxDone` but logs anything else. + +### MIN-5: Duplicate Error-Exit Boilerplate in `executor.go` + +**File:** `internal/executor/executor.go` (multiple locations) + +The pattern of "update task state → decrement activePerAgent → release slot" is repeated 3+ times across `execute()` and `executeResume()` with subtle differences that make the code hard to audit. + +**Action:** Extract a private `p.abortExecution(t, agentType, err)` helper to consolidate this pattern and make the concurrency invariants explicit. + +### MIN-6: Storage and API DB Integration Tests Disabled + +**Context:** `internal/storage/db_test.go`, `internal/api/elaborate_test.go`, `internal/api/executions_test.go`, `internal/api/drops_test.go` + +These tests fail when `CGO_ENABLED=0` because `go-sqlite3` requires CGo. The CI/build environment apparently compiles without CGo, disabling ~50 test cases. + +**Action:** Ensure the build and test environment has `gcc` installed and `CGO_ENABLED=1` (or equivalent). Document this requirement in `CLAUDE.md` and the `Makefile`/`scripts/`. + +### MIN-7: Missing Tests for `changestats.go`, `deployment.go`, `task_view.go` + +**Files:** `internal/api/changestats.go`, `internal/api/deployment.go`, `internal/api/task_view.go` + +These files have no dedicated test coverage. `changestats.go` in particular contains the double-write logic described in `CLAUDE.md` that would benefit from regression tests. + +**Action:** Add unit tests for each file; at minimum cover the happy path and key error branches. + +### MIN-8: Classifier Fallback Parsing Is Fragile + +**File:** `internal/executor/classifier.go:91` + +When the Gemini CLI returns non-JSON output, the code falls back to treating the raw string as the response. A subsequent JSON unmarshal attempt on an arbitrary string can silently produce wrong results. + +**Action:** Separate the fallback path clearly, log a warning when JSON parsing fails, and add a test for the non-JSON fallback. + +--- + +## Architecture / Observability Improvements (Low Priority) + +These are quality-of-life improvements rather than bugs. + +| ID | Area | Suggestion | +|----|------|-----------| +| OBS-1 | Metrics | Expose pool utilization, execution duration, cost-per-task, and retry rate via a `/metrics` (Prometheus) or `/api/stats` endpoint | +| OBS-2 | Health check | Enhance `/api/health` to include pool queue depth and DB connectivity | +| ARCH-1 | Sandbox package | Extract sandbox setup/teardown from `claude.go` into `internal/sandbox` to reduce the file's complexity and enable independent testing | +| ARCH-2 | Typed WebSocket messages | Replace `interface{}` in `internal/api/websocket.go` with a typed discriminated union to prevent accidental protocol drift | +| ARCH-3 | Prepared statements | Add prepared statements for the hot `GetTask` / `UpdateTaskState` queries to reduce per-call parsing overhead under load | + +--- + +## Summary Table + +| ID | Severity | File | Description | +|----|----------|------|-------------| +| CRIT-1 | Critical | executor/question.go | Panic: channel send outside lock | +| MED-1 | Medium | executor/claude.go | Sandbox dir leak on failure | +| MED-2 | Medium | cli/serve.go | VAPID private key not validated | +| MIN-1 | Minor | task/changestats.go | strconv errors silently ignored | +| MIN-2 | Minor | storage/db.go | json.Marshal error ignored in retry | +| MIN-3 | Minor | executor/claude.go | os.Remove errors silently ignored | +| MIN-4 | Minor | storage/db.go | Rollback errors not logged | +| MIN-5 | Minor | executor/executor.go | Duplicate error-exit boilerplate | +| MIN-6 | Minor | All test files | CGo required — DB tests disabled | +| MIN-7 | Minor | api/changestats.go et al | Missing test files | +| MIN-8 | Minor | executor/classifier.go | Fragile classifier fallback | -- cgit v1.2.3