claudomator/REFACTOR_PLAN.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155

# Claudomator Refactor Plan

**Date:** 2026-03-18
**Scope:** Architectural integrity, test coverage gaps, and potential bugs

---

## Executive Summary

The Claudomator codebase is well-structured with clear package separation and a sound architecture. The analysis identified **1 critical bug**, **2 medium-priority issues**, and **8 lower-priority improvements** across ~7,300 LOC. Test coverage is roughly 65% but the storage and API database integration tests cannot run without CGO enabled.

---

## Critical Bugs

### CRIT-1: Race Condition / Panic in `QuestionRegistry.Answer()`

**File:** `internal/executor/question.go:63`

The channel send occurs **outside** the mutex. If the corresponding `AskQuestion()` goroutine times out and closes or abandons the `AnswerCh` between the `Unlock()` and the send, the server panics with a send on closed channel.

```go
// Current (unsafe)
pq.AnswerCh <- answer  // outside lock — goroutine receiving may have exited

// Fix: use select to avoid blocking on a dead channel
select {
case pq.AnswerCh <- answer:
    return true
default:
    return false
}
```

**Action:** Guard the channel send with a `select`/`default` and add a test with `-race` flag that cancels the question context before answering.

---

## Medium-Priority Issues

### MED-1: Sandbox Directories Leak on Execution Failure

**File:** `internal/executor/claude.go:163-166`

When an execution fails and `project_dir` is set, the sandbox is intentionally preserved for debugging. There is no cleanup mechanism. Over time, `/tmp` accumulates preserved sandboxes and can exhaust disk space.

**Action:** Implement a cleanup policy — either a max-age sweep on `serve` startup, or a dedicated `/api/admin/cleanup-sandboxes` endpoint. Document the sandbox location pattern and manual cleanup steps.

### MED-2: VAPID Private Key Not Validated on Load

**File:** `internal/cli/serve.go:57`

The code validates the public VAPID key but not the private key. A corrupted or swapped private key in the database causes push notifications to fail silently at runtime with no regeneration triggered.

**Action:** Add `notify.ValidateVAPIDPrivateKey()` (or a combined `ValidateVAPIDKeyPair()`) and include it in the load check so a bad private key triggers regeneration, mirroring the existing public-key logic.

---

## Minor Issues

### MIN-1: `strconv.Atoi` Errors Silently Ignored in Changestats Parser

**File:** `internal/task/changestats.go:26,28,31`

Conversion errors from `strconv.Atoi` are discarded with `_`. The regex guarantees digits are present, so this rarely matters, but if the format ever changes the function returns zeroed stats with no diagnostic.

**Action:** Check the errors and log a warning when conversion unexpectedly fails.

### MIN-2: `json.Marshal` Error Ignored in `ResetTaskForRetry`

**File:** `internal/storage/db.go:258`

`json.Marshal(t.Agent)` error is discarded. If marshaling ever fails, the stored config JSON would be empty/corrupted, making the retried task unusable.

**Action:** Return the error to the caller.

### MIN-3: `os.Remove` Errors Silently Ignored

**File:** `internal/executor/claude.go:172,190`

Cleanup of temp files (`questionFile`, `summaryFile`) ignores all errors. While `ENOENT` is benign, other errors (permissions, I/O) should be logged as warnings to surface sandbox issues.

**Action:** Log `!os.IsNotExist(err)` errors at `Warn` level.

### MIN-4: Transaction Rollback Errors Not Logged

**File:** `internal/storage/db.go:214,242,492,614`

All `defer tx.Rollback()` calls use `//nolint:errcheck`. A failed rollback can indicate disk-full or corruption and should at least be logged.

**Action:** Replace with a logging closure that suppresses `sql.ErrTxDone` but logs anything else.

### MIN-5: Duplicate Error-Exit Boilerplate in `executor.go`

**File:** `internal/executor/executor.go` (multiple locations)

The pattern of "update task state → decrement activePerAgent → release slot" is repeated 3+ times across `execute()` and `executeResume()` with subtle differences that make the code hard to audit.

**Action:** Extract a private `p.abortExecution(t, agentType, err)` helper to consolidate this pattern and make the concurrency invariants explicit.

### MIN-6: Storage and API DB Integration Tests Disabled

**Context:** `internal/storage/db_test.go`, `internal/api/elaborate_test.go`, `internal/api/executions_test.go`, `internal/api/drops_test.go`

These tests fail when `CGO_ENABLED=0` because `go-sqlite3` requires CGo. The CI/build environment apparently compiles without CGo, disabling ~50 test cases.

**Action:** Ensure the build and test environment has `gcc` installed and `CGO_ENABLED=1` (or equivalent). Document this requirement in `CLAUDE.md` and the `Makefile`/`scripts/`.

### MIN-7: Missing Tests for `changestats.go`, `deployment.go`, `task_view.go`

**Files:** `internal/api/changestats.go`, `internal/api/deployment.go`, `internal/api/task_view.go`

These files have no dedicated test coverage. `changestats.go` in particular contains the double-write logic described in `CLAUDE.md` that would benefit from regression tests.

**Action:** Add unit tests for each file; at minimum cover the happy path and key error branches.

### MIN-8: Classifier Fallback Parsing Is Fragile

**File:** `internal/executor/classifier.go:91`

When the Gemini CLI returns non-JSON output, the code falls back to treating the raw string as the response. A subsequent JSON unmarshal attempt on an arbitrary string can silently produce wrong results.

**Action:** Separate the fallback path clearly, log a warning when JSON parsing fails, and add a test for the non-JSON fallback.

---

## Architecture / Observability Improvements (Low Priority)

These are quality-of-life improvements rather than bugs.

| ID | Area | Suggestion |
|----|------|-----------|
| OBS-1 | Metrics | Expose pool utilization, execution duration, cost-per-task, and retry rate via a `/metrics` (Prometheus) or `/api/stats` endpoint |
| OBS-2 | Health check | Enhance `/api/health` to include pool queue depth and DB connectivity |
| ARCH-1 | Sandbox package | Extract sandbox setup/teardown from `claude.go` into `internal/sandbox` to reduce the file's complexity and enable independent testing |
| ARCH-2 | Typed WebSocket messages | Replace `interface{}` in `internal/api/websocket.go` with a typed discriminated union to prevent accidental protocol drift |
| ARCH-3 | Prepared statements | Add prepared statements for the hot `GetTask` / `UpdateTaskState` queries to reduce per-call parsing overhead under load |

---

## Summary Table

| ID | Severity | File | Description |
|----|----------|------|-------------|
| CRIT-1 | Critical | executor/question.go | Panic: channel send outside lock |
| MED-1 | Medium | executor/claude.go | Sandbox dir leak on failure |
| MED-2 | Medium | cli/serve.go | VAPID private key not validated |
| MIN-1 | Minor | task/changestats.go | strconv errors silently ignored |
| MIN-2 | Minor | storage/db.go | json.Marshal error ignored in retry |
| MIN-3 | Minor | executor/claude.go | os.Remove errors silently ignored |
| MIN-4 | Minor | storage/db.go | Rollback errors not logged |
| MIN-5 | Minor | executor/executor.go | Duplicate error-exit boilerplate |
| MIN-6 | Minor | All test files | CGo required — DB tests disabled |
| MIN-7 | Minor | api/changestats.go et al | Missing test files |
| MIN-8 | Minor | executor/classifier.go | Fragile classifier fallback |