trajectory / run_consistency be normalized for all adapters, or explicitly documented as gptme/rebuild-specific behavior?SHELL_ERROR and SKIPPED planned for active runtime use, or should they be marked as reserved statuses?trajectory and consistency analysis are currently computed in runtime mainly for gptme (and in rebuild --recompute-case-json), while other adapters usually keep default trajectory=clean.HARNESS_ERROR paths”.TIMEOUT may still have match_percent=100 when artifact is reached before timeout, but status remains TIMEOUT.SHELL_ERROR / SKIPPED as reserved unless they become actively emitted in runtime branches.gptme and in rebuild (unless --keep-system-messages).cases/* remain diagnostic.backfill-model-hashes and list-runs to operational lifecycle section.GripProbe is a benchmark harness for CLI agents that work with local or remote LLM backends (primarily Ollama in the current configuration).
The system evaluates not only textual answers, but real side effects in the workspace (created/modified files, applied patch, executed shell command, HTTP fetch result, etc.).
The core comparison axis is:
This is important: the same model can behave very differently depending on the CLI agent (for example gptme, continue-cli, codex, opencode, aider) and execution mode (tool / markdown where supported).
Main modules:
gripprobe/cli.py — entrypoint and CLI commands.gripprobe/spec_loader.py — loading YAML specs.gripprobe/runner.py — main execution pipeline.gripprobe/adapters/*.py — CLI-agent-specific adapters.gripprobe/validator_runner.py + gripprobe/validators/* — result validation layer.gripprobe/reporters/* — per-run reports (summary.md, summary.html, case pages).gripprobe/rebuild.py — rebuild run reports from existing artifacts.gripprobe/aggregate.py — cross-run aggregate report builder.gripprobe/suite_runner.py — matrix/suite orchestration and resume logic.Specs are data-driven:
specs/models/*.yamlspecs/cli_agents/*.yaml (with compatibility fallback to specs/shells/*.yaml)specs/tests/*.yamlspecs/suites/*.yamlspecs/hardware_profiles.yamlPydantic schemas are defined in gripprobe/models.py.
Key entities:
TestSpec: prompt, validators, tags (sanity, non_sanity, multilingual, etc.), allowed tools, supported cli agents/formats.ModelSpec / BackendSpec: model identity, backend mapping (model_id, cli_agent_model_id), supported formats, optional policy overrides.CliAgentSpec: executable, default args, default tools, supported formats, timeout.SuiteSpec: reusable run matrix (cli_agents/models/tests/formats) with optional explicit matrix overrides.CaseResult: normalized case outcome: status, trajectory, invocation signal, match %, timings, metadata.The system is strongly declarative: adding new tests/models/cli_agent combinations is usually a spec change first, not a code change.
Command run and suite executor run-suite both call runner.run(...).
For each run:
model, cli_agent, backend, tests, formats).results/runs/<RUN_ID>/cases/...results/runs/<RUN_ID>/reports/.../api/ps when applicable)case.json and text artifacts.reports/summary.mdreports/summary.htmlmanifest.jsonRUN_ID is UTC-based (YYYYMMDDTHHMMSSZ).
Every case runs in two phases:
This allows:
The adapters store command strings and phase timestamps in metadata (warmup_command, measured_command, *_started_at, *_finished_at).
All CLI agents implement CliAgentAdapter (gripprobe/adapters/base.py; legacy alias ShellAdapter), with one contract: run_case(case, model_spec, test_spec) -> CaseResult.
Current adapters:
GptmeAdapterContinueCliAdapterCodexAdapterOpencodeAdapterAiderAdapterCommon behavior:
HOME, XDG_*, TMPDIR),CLI-agent-specific nuances are intentionally preserved, because compatibility differences are the benchmark target, not noise.
Validation is file/effect based (validator_runner.py):
file_equalspatch_appliedweb_nonce_proofweb_search_resultweekly_plan_taskEach test can combine multiple validators; final pass requires all validators to pass.
Observed/expected text artifacts are written per case:
expected.txtobserved.txtThis keeps failures explainable and rebuildable.
runner.py can launch ephemeral local HTTP challenge servers per case:
Dynamic validator patching inserts measured nonce/query/path expectations into active TestSpec at runtime.
This prevents static prompt memorization and enforces real fetch behavior.
Case status (CaseResult.status) includes:
PASS, FAIL, TIMEOUT, NO_TOOL_CALL, TOOL_UNSUPPORTED,SHELL_ERROR, HARNESS_ERROR, SKIPPED.Additional dimensions:
invoked: yes | no | maybetrajectory: clean | recovered | violatedmatch_percent: currently practical binary behavior for most tests (0/100)Failure reason is inferred separately (failure_reason.py), e.g.:
tool unsupported by backendanswered without invoking toolrebuild-reports regenerates run reports from results/runs/<RUN_ID>.
With --recompute-case-json, case status and metadata are recalculated from artifacts/validators if needed.
This is required when raw cases exist but report files are missing or stale.
run-suite --resume-suite resumes at case key granularity:
(cli_agent, model, backend, format, test)Completed keys are loaded from existing results/runs/*/cases/*/case.json (or fallback manifest expansion).
Only missing cases are executed.
Generated per run:
results/runs/<RUN_ID>/reports/summary.htmlresults/runs/<RUN_ID>/reports/summary.mdUsed to investigate individual failures, timings, traces, and per-case detail pages.
Generated by aggregate-reports:
.../reports/summary.html.../reports/summary.mdThe aggregate groups rows by:
It computes:
It also provides filters/sorting and links to per-case aggregate detail pages and source run summaries.
Two layers have different policy:
results/runs/* — internal/raw diagnostics (not aggressively sanitized).Sanitization includes:
$HOME, $USER),System-message stripping from transcripts is available and used by default for sensitive prompt minimization (strip_system_messages_from_transcripts).
case.json + artifacts).Typical workflow:
validate specs.run (single point) or run-suite (matrix).rebuild-reports for repaired/updated interpretation.aggregate-reports for cross-run comparison/publication.This separation keeps debugging forensics and publication artifacts decoupled, while preserving reproducibility.