# ADR-005: Git Sandbox Execution Model

## Status
Accepted

## Context

Tasks that modify source code need a safe execution environment. Running agents
directly in the canonical working copy creates several problems:

1. **Concurrent corruption**: multiple agents running in the same directory stomp
   on each other's changes.
2. **Partial work leaks**: if a task is cancelled mid-run, half-written files
   remain in the working tree, blocking other work.
3. **No rollback**: a failed agent may leave the codebase in a broken state with
   no clean way to recover without manual `git reset`.
4. **Audit trail**: changes made by an agent should be visible as discrete,
   attributable git commits — not as an anonymous blob of working-tree noise.

## Decision

When a task has `agent.project_dir` set, `ClaudeRunner.Run` executes the agent
in an isolated git clone (a "sandbox") rather than in the project directory
directly.

### Sandbox lifecycle

```
project_dir (canonical working copy)
    |
    | git clone --no-hardlinks <clone-source> /tmp/claudomator-sandbox-*
    |
    v
sandbox (temp clone)
    |
    | agent runs here; commits its changes
    |
    | git push (to bare repo "local" or "origin")
    |
teardown: verify no uncommitted changes, remove sandbox dir
```

### Clone source

`sandboxCloneSource` prefers a remote named `"local"` (a local bare repo).
If not present it falls back to the `"origin"` remote. Using a bare repo
accepts pushes cleanly; pushing to a non-bare working copy fails when the
receiving branch is checked out.

### Uncommitted-change enforcement

Before teardown, the runner runs `git status --porcelain` in the sandbox.
If any uncommitted changes are detected, the task is failed with an error
message listing the files. The sandbox is **preserved** so the operator
can inspect or recover the partial work. The error message includes the
sandbox path.

### Concurrent push conflicts

If two sandboxes try to push at the same time, the second push is rejected
(`"fetch first"` or `"non-fast-forward"` in the error output). The runner
detects these signals and performs a fetch → rebase → retry sequence, up to
a fixed retry limit, before giving up.

### BLOCKED state and sandbox preservation

When an agent exits with a `question.json` file (entering the `BLOCKED`
state), the sandbox is **not** torn down. The preserved sandbox allows the
resumed execution to pick up the same working tree state, including any
in-progress file changes made before the agent asked its question.

**Known Risk: Resume skips sandbox.** Current implementation of
Resume executions (`SubmitResume`) skips sandbox setup entirely and runs
directly in `project_dir`. This is a significant behavioral divergence: if a 
resumed task makes further changes, they land directly in the canonical working 
copy, reintroducing the concurrent corruption and partial-work leak risks 
identified in the Context section. A future iteration should ensure resumed 
tasks pick up the preserved sandbox instead.

### Session ID propagation on resume

A subtle bug was found and fixed: when a resumed execution is itself blocked
again (a second BLOCKED→answer→resume cycle), the new execution record must
carry the **original** `ResumeSessionID`, not the new execution's own UUID.
If the wrong ID is used, `claude --resume` fails with "No conversation found".
The fix is in `ClaudeRunner.Run`: if `e.ResumeSessionID != ""`, use it as
`e.SessionID` rather than `e.ID`.

## Rationale

- **`--no-hardlinks`**: git defaults to hardlinking objects between clone and
  source when both are on the same filesystem. This causes permission errors
  when the source is owned by a different user (e.g. `www-data` vs. `root`).
  The flag forces a full copy.
- **Bare repo for push target**: non-bare repos reject pushes to checked-out
  branches. A bare repo (`git init --bare`) accepts all pushes safely.
- **Preserve sandbox on failure**: partial agent work may be valuable for
  debugging or resumption. Destroying it immediately on failure was considered
  and rejected.
- **Agent must commit**: requiring the agent to commit all changes before
  exiting ensures git history is always the source of truth. The enforcement
  check (uncommitted files → FAILED) makes this invariant observable.

## Alternatives Considered

- **Run directly in working copy**: rejected because of concurrent corruption
  and partial-work leakage.
- **Copy files instead of git clone**: rejected because the agent needs a
  working git history (for `git log`, `git blame`, and to push commits back).
- **Docker/container isolation**: considered for stronger isolation but
  rejected due to operational complexity, dependency on container runtime,
  and inability to use the host's claude/gemini credentials.

## Consequences

- Tasks without `project_dir` are unaffected; they run in whatever working
  directory the server process inherited.
- If a sandbox's push repeatedly fails (e.g. due to a bare repo that is
  itself broken), the task is failed with the sandbox preserved.
- **If `/tmp` runs out of space** (many large sandboxes), tasks will fail at
  clone time. This is a known operational risk. Mitigations such as periodic
  cleanup of old sandboxes (cron) or pre-clone disk space checks are required
  as follow-up items.
- **The `project_dir` field in task YAML** must point to a git repository with
  a configured `"local"` or `"origin"` remote that accepts pushes. If neither
  remote exists or the push is rejected for other reasons, the task will be
  marked as `FAILED` and the sandbox will be preserved for manual recovery.


## Relevant Code Locations

| Concern | File |
|---|---|
| Sandbox setup/teardown | `internal/executor/claude.go` |
| `setupSandbox`, `teardownSandbox` | `internal/executor/claude.go` |
| `sandboxCloneSource` | `internal/executor/claude.go` |
| Resume skips sandbox | `internal/executor/claude.go` (Run) |
| Session ID propagation fix | `internal/executor/claude.go` (Run) |
| Sandbox tests | `internal/executor/claude_test.go` |