GripProbe Reports

Saved Review Notes (2026-05-11)

Questions to Keep

Suggested Corrections for This Document

GripProbe-CLI: Technical System Description

1. Purpose and Positioning

GripProbe is a benchmark harness for CLI agents that work with local or remote LLM backends (primarily Ollama in the current configuration).
The system evaluates not only textual answers, but real side effects in the workspace (created/modified files, applied patch, executed shell command, HTTP fetch result, etc.).

The core comparison axis is:

This is important: the same model can behave very differently depending on the CLI agent (for example gptme, continue-cli, codex, opencode, aider) and execution mode (tool / markdown where supported).


2. High-Level Architecture

Main modules:

Specs are data-driven:


3. Data Model and Specs

Pydantic schemas are defined in gripprobe/models.py.

Key entities:

The system is strongly declarative: adding new tests/models/cli_agent combinations is usually a spec change first, not a code change.


4. Execution Pipeline (Single Run)

Command run and suite executor run-suite both call runner.run(...).

For each run:

  1. Load and resolve specs (model, cli_agent, backend, tests, formats).
  2. Create run directory:
    • results/runs/<RUN_ID>/cases/...
    • results/runs/<RUN_ID>/reports/...
  3. Resolve runtime metadata:
    • CLI agent executable/version
    • selected environment metadata
    • runtime probes (load/memory/GPU; Ollama /api/ps when applicable)
  4. For each case:
    • prepare warmup and measured workspaces,
    • run CLI agent adapter twice (warmup + measured),
    • validate measured workspace,
    • classify status/trajectory/invoked/match,
    • persist case.json and text artifacts.
  5. Build per-run summaries:
    • reports/summary.md
    • reports/summary.html
    • manifest.json

RUN_ID is UTC-based (YYYYMMDDTHHMMSSZ).


5. Warmup/Measured Design

Every case runs in two phases:

This allows:

The adapters store command strings and phase timestamps in metadata (warmup_command, measured_command, *_started_at, *_finished_at).


6. Adapter Layer

All CLI agents implement CliAgentAdapter (gripprobe/adapters/base.py; legacy alias ShellAdapter), with one contract: run_case(case, model_spec, test_spec) -> CaseResult.

Current adapters:

Common behavior:

CLI-agent-specific nuances are intentionally preserved, because compatibility differences are the benchmark target, not noise.


7. Validation System

Validation is file/effect based (validator_runner.py):

Each test can combine multiple validators; final pass requires all validators to pass.

Observed/expected text artifacts are written per case:

This keeps failures explainable and rebuildable.


8. Built-In Dynamic Test Fixtures (Web/Scenario)

runner.py can launch ephemeral local HTTP challenge servers per case:

Dynamic validator patching inserts measured nonce/query/path expectations into active TestSpec at runtime.
This prevents static prompt memorization and enforces real fetch behavior.


9. Status, Trajectory, and Failure Semantics

Case status (CaseResult.status) includes:

Additional dimensions:

Failure reason is inferred separately (failure_reason.py), e.g.:


10. Rebuild and Resume

Rebuild

rebuild-reports regenerates run reports from results/runs/<RUN_ID>.
With --recompute-case-json, case status and metadata are recalculated from artifacts/validators if needed.

This is required when raw cases exist but report files are missing or stale.

Resume

run-suite --resume-suite resumes at case key granularity:

Completed keys are loaded from existing results/runs/*/cases/*/case.json (or fallback manifest expansion).
Only missing cases are executed.


11. Reporting Layers

Run-level report (diagnostic)

Generated per run:

Used to investigate individual failures, timings, traces, and per-case detail pages.

Aggregate report (publication/comparison)

Generated by aggregate-reports:

The aggregate groups rows by:

It computes:

It also provides filters/sorting and links to per-case aggregate detail pages and source run summaries.


12. Privacy and Sanitization

Two layers have different policy:

Sanitization includes:

System-message stripping from transcripts is available and used by default for sensitive prompt minimization (strip_system_messages_from_transcripts).


13. Operational Notes and Limits


14. Practical Lifecycle

Typical workflow:

  1. validate specs.
  2. run (single point) or run-suite (matrix).
  3. inspect run-level summaries for debugging.
  4. optionally rebuild-reports for repaired/updated interpretation.
  5. aggregate-reports for cross-run comparison/publication.

This separation keeps debugging forensics and publication artifacts decoupled, while preserving reproducibility.