GripProbe Reports

Feature Request Spec: Event-Based Tool Evaluation via Proxy + Wrapper Telemetry

Use this feature request before implementing proxy/wrapper telemetry for real tool-call observation. Scope: local workflow only (no PR requirements).

1. Change Summary

2. Problem and Goal

3. Scope

4. Definitions and Trigger Rules

5. Current Behavior (As-Is)

6. Proposed Behavior (To-Be)

7. Data and Contracts

Statuses matrix: | validators | process | tool_event_verdict | reason | status | |—|—|—|—|—| | any | timeout | any | any | TIMEOUT | | any | shell_error | any | process_error | SHELL_ERROR | | any | harness_capture_failed | any | capture_missing | HARNESS_ERROR | | pass | nonzero_exit | confirmed_tool_use | validators_pass_after_nonzero | SHELL_ERROR | | pass | ok | confirmed_tool_use | any | PASS | | pass | ok | no_tool_event_observed | structured_event_absent | PASS_WITH_POLICY_VIOLATION | | pass | ok | tool_event_not_observable | parser_not_capable_for_shell | PASS_WITH_POLICY_VIOLATION | | pass | ok | tool_event_inconclusive | proxy_error | PASS_WITH_POLICY_VIOLATION | | pass | ok | tool_event_inconclusive | source_parse_inconclusive | PASS_WITH_POLICY_VIOLATION | | fail | ok | confirmed_tool_use | any | FAIL | | fail | ok | no_tool_event_observed | structured_event_absent | NO_TOOL_CALL | | fail | ok | tool_event_not_observable | parser_not_capable_for_shell | FAIL | | fail | ok | tool_event_inconclusive | source_parse_inconclusive/proxy_error | FAIL | | fail | ok | tool_event_inconclusive | wrapper_parse_error/capture_missing | HARNESS_ERROR |

Evaluator must not treat parser_not_capable_for_shell the same as no_tool_event_observed.

8. Design Options and Tradeoffs

Option A: Wrapper parsing only

Option B: Protocol proxy only

Option C: Hybrid (wrapper baseline + optional proxy hardening)

Selected option

Why selected:

9. Privacy and Safety

10. Implementation Plan

  1. Extend/verify CaseResult.status enum supports PASS_WITH_POLICY_VIOLATION and SHELL_ERROR.
  2. Replace adapter-local final classification with event_evaluator output after validators and telemetry extraction.
  3. Update aggregate cell label/class/score logic to render PASS_WITH_POLICY_VIOLATION separately from PASS.
  4. Add mandatory wrapper event extraction and event schema.
  5. Add optional per-phase proxy lifecycle in runner and adapter routing hooks.
  6. Add event_evaluator that derives canonical status, trajectory, invoked, scoring fields, and failure_reason.
  7. Add metadata/artifact persistence and report rendering.
  8. Add tests for event extraction, evaluator verdicts, proxy skip/error paths, and new taxonomy.

11. Acceptance Criteria (Must Be Testable)

12. Test Plan

13. Observability and Reporting Impact

14. Rollout and Rollback

15. Local Review Checklist (No PR)

16. Implementation Notes