GripProbe is a compatibility benchmark for local CLI AI agents. It checks whether LLM-powered agents actually execute commands, modify files, and produce verifiable artifacts in real CLI environments.
GripProbe measures observable side effects in real CLI agent environments.
First published report with 23 supported models and 4 CLI agents.
The goal is not to measure general intelligence or produce a model-only leaderboard. GripProbe evaluates tool-use reliability in real local CLI setups and verifies whether local LLMs actually use tools, modify files, and execute commands.
Results are environment-specific and may vary by runtime, quantization, prompt formatting, hardware, seed, and model version.
Typical tool-calling benchmarks evaluate parsed tool calls in synthetic, API-like settings. It distinguishes textual success from executed success. It is aimed at CLI agent/model/backend compatibility rather than model-only ranking.
These initial results suggest that CLI agent choice is a first-order variable. The same model can perform strongly in one CLI agent and fail badly in another, which supports GripProbe’s focus on CLI agent × model × backend compatibility rather than model-only ranking.
In the current published runs, continue-cli is the strongest CLI agent.
It delivers the highest pass rates across multiple local 9B–12B class models and is currently the most consistent CLI agent in the published report.
Other CLI agents show lower pass rates in the same published environment, which is why the report should be read as compatibility evidence rather than a universal ranking.
Within continue-cli, the top result comes from local/aravhawk/qwen3.5-opus-4.6-text:9b, followed by official local/qwen3.5:9b.
This suggests that fine-tuning or distillation can materially affect CLI execution reliability, not just general chat quality.
Within codex, local/aravhawk/qwen3.5-opus-4.6-text:9b shows stronger results when the patch-apply workflow is available.
For public recommendations, official local/qwen3.5:9b with continue-cli is the safer reference point.
Its results are close to the top of the current report, while its provenance and licensing surface are clearer than third-party distilled variants.
These observations are preliminary. GripProbe results are environment-specific and may change with runtime, model version, backend, quantization, prompt formatting, hardware, CLI agent version, and future benchmark coverage.
Privacy and publication policy:
results/runs/...results/aggregate/...results/aggregate/... onlyresults/runs/... as internal diagnostic dataIf you would like to add a new test, refine an existing one, support another shell or backend, or suggest improvements to the benchmark, reporting, or documentation, contributions are welcome. Feel free to open an issue or submit a pull request.