Use this feature request before implementing proxy/wrapper telemetry for real tool-call observation. Scope: local workflow only (no PR requirements).
tool_telemetry_proxy_wrapperinvoked and trajectory are currently heuristic and adapter-specific; structured tool evidence is inconsistent across shells.tool_call_start/tool_call_result/model_response events from real runs and use them as evidence for a new event-based case evaluator.trajectory/invoked/match compatibility is not required; all reports will be regenerated under the new result contract.stdout/stderr/transcript/json).event_evaluator that derives verdict fields from validators, process result, and telemetry/proxy events.failure_reason and trajectory with event-based evaluation.runner (orchestration, proxy lifecycle, telemetry persistence)adapters (base URL/env wiring where needed)event_evaluator (new canonical verdict layer)reporters (case/aggregate telemetry display)rebuild (recompute from new artifacts; legacy fallback only if explicitly implemented)specs/CLI flags (feature toggles)docsA: structured protocol/proxy events.B: structured agent output events.C: wrapper parsing from unstructured stdout/stderr markers.D: prompt-level self-report (fallback only, optional, for future).telemetry_proxy_mode: off|auto|force
always extract wrapper events after warmup/measured phasesif telemetry_proxy_mode != off and backend/protocol supports routing then run phase-scoped proxy captureproxy lifecycle is per phase: start/stop once for warmup, then start/stop once for measureddisabled, unsupported_shell, unsupported_backend, capture_missing, wrapper_parse_error, budget_exceededproxy_bind_error, proxy_connect_error if proxy mode is auto|forceD must never produce confirmed_tool_use for full PASS.D may only explain/debug inconclusive cases; validators pass + tier D only evidence maxes at PASS_WITH_POLICY_VIOLATION.warmup then measured.trace_analysis.status: derived by event_evaluator from validator result, process result, and telemetry events.trajectory: derived from event timeline (clean|recovered|violated based on failed tool results, retries, timeout, and recovery).invoked: derived from event evidence, not from shell-specific text heuristics.match_percent: removed from the new result contract; reports use artifact_match and overall_score.failure_reason: derived from event taxonomy, for example no_tool_call_observed, tool_call_without_result, tool_result_error, tool_retry_loop, artifact_mismatch_after_success, protocol_malformed, backend_tool_unsupported.PASS: validators pass and tool evidence is confirmedPASS_WITH_POLICY_VIOLATION: validators pass, but required tool evidence is missing, inconclusive, or not observable.FAIL: validators fail after observed execution.NO_TOOL_CALL: validators fail and observable telemetry shows no tool event.TIMEOUT: measured process timed out.TOOL_UNSUPPORTED: shell/backend cannot execute the required tool capability.HARNESS_ERROR: GripProbe infrastructure failed.SHELL_ERROR: CLI agent process failed before a reliable measured result could be evaluated.
PASS counts as pass.PASS_WITH_POLICY_VIOLATION is not a full pass and contributes 0.8 to overall_score, but 0.0 to strict_pass_score.strict_pass_score.event_evaluator reads validator output, process output, and normalized events, then writes the canonical verdict fields.--telemetry-proxy off|auto|forceevent_type: tool_call_start | tool_call_result | model_responserun_id, case_id, phase, event_id, trace_idtool_call_id, response_id (where applicable)source_tier: A|B|C|Dpayload (redacted fields only)timestamp, sequencesource: wrapper|proxy|agent_outputstatus: started|success|error|timeout|unknownlatency_ms, exit_code, error_type where applicableraw_artifact_refredaction_status: raw_internal|redacted|summary_onlycases/<id>/artifacts/events.warmup.jsonlcases/<id>/artifacts/events.measured.jsonlcases/<id>/artifacts/events.summary.jsoncases/<id>/artifacts/proxy.warmup.http.jsonl (if proxy enabled)cases/<id>/artifacts/proxy.measured.http.jsonl (if proxy enabled)proxy.warmup.http.jsonl, proxy.measured.http.jsonl):
x_gripprobe_.x_gripprobe_timestampx_gripprobe_methodx_gripprobe_pathx_gripprobe_queryx_gripprobe_upstream_urlx_gripprobe_duration_msx_gripprobe_requestx_gripprobe_responsex_gripprobe_tool_call_countx_gripprobe_tool_call_nonstructured_countx_gripprobe_tool_namesx_gripprobe_tool_names_nonstructuredx_gripprobe_tool_call_ids_nonstructuredx_gripprobe_tool_result_countx_gripprobe_proxy_errorevent_capture_status: collected|partial|missing|wrapper_parse_errorskipped; unsupported parser capability must produce tool_event_not_observable, not skipped capture.telemetry_proxy_status: collected|skipped|errortelemetry_capture_skip_reason: str | nulltelemetry_proxy_skip_reason: str | nulltelemetry_source_tier: A|B|C|D|nonetelemetry_event_count: inttelemetry_tool_call_count: inttelemetry_proxy_tool_call_nonstructured_count: inttelemetry_tool_result_count: inttelemetry_invoked_confidence: floattelemetry_retry_loop_detected: booltool_event_verdict: confirmed_tool_use|no_tool_event_observed|tool_event_not_observable|tool_event_inconclusivetool_event_not_observable: shell/backend may execute tools, but GripProbe cannot observe structured tool evidence.tool_event_verdict_reason: none|parser_not_capable_for_shell|capture_missing|proxy_error|structured_event_absent|wrapper_parse_error|source_parse_inconclusivecapture_missing: telemetry artifacts were not produced or are unreadable.artifact_missing: expected case output artifact for validator/evaluator is missing.capture_missing means evaluator cannot trust telemetry; if required wrapper artifacts are missing, status is HARNESS_ERROR.strict_pass_score: float where only PASS contributes 1.0.overall_score: float where PASS_WITH_POLICY_VIOLATION contributes 0.8.strict_pass_score; overall_score is secondary.nonzero_exit is treated as SHELL_ERROR even when validators pass, because the CLI agent process did not complete cleanly.Statuses matrix: | validators | process | tool_event_verdict | reason | status | |—|—|—|—|—| | any | timeout | any | any | TIMEOUT | | any | shell_error | any | process_error | SHELL_ERROR | | any | harness_capture_failed | any | capture_missing | HARNESS_ERROR | | pass | nonzero_exit | confirmed_tool_use | validators_pass_after_nonzero | SHELL_ERROR | | pass | ok | confirmed_tool_use | any | PASS | | pass | ok | no_tool_event_observed | structured_event_absent | PASS_WITH_POLICY_VIOLATION | | pass | ok | tool_event_not_observable | parser_not_capable_for_shell | PASS_WITH_POLICY_VIOLATION | | pass | ok | tool_event_inconclusive | proxy_error | PASS_WITH_POLICY_VIOLATION | | pass | ok | tool_event_inconclusive | source_parse_inconclusive | PASS_WITH_POLICY_VIOLATION | | fail | ok | confirmed_tool_use | any | FAIL | | fail | ok | no_tool_event_observed | structured_event_absent | NO_TOOL_CALL | | fail | ok | tool_event_not_observable | parser_not_capable_for_shell | FAIL | | fail | ok | tool_event_inconclusive | source_parse_inconclusive/proxy_error | FAIL | | fail | ok | tool_event_inconclusive | wrapper_parse_error/capture_missing | HARNESS_ERROR |
verdict_source: event_evaluatorartifact_match: floattool_invocation_match: float where 1.0 means required tool evidence confirmed, 0.0 means missing/not observable/inconclusive.protocol_match: float | nullstrict_pass_score: floatoverall_score: floatevaluator_reason_code: str
artifact_missing: expected validator/evaluator artifact is missing; this is not a tool-event verdict reason.evaluator_reason_text: strproxy_error: optional evidence source failed; never overrides confirmed lower-tier evidence.wrapper_parse_error: mandatory wrapper extraction failed; status is HARNESS_ERROR.source_parse_inconclusive: artifact was readable, but shell output is not structured enough; max status is PASS_WITH_POLICY_VIOLATION when validators pass.validators_pass + no_tool_event_observed produces PASS_WITH_POLICY_VIOLATION.validators_fail + no_tool_event_observed produces NO_TOOL_CALL.proxy_error only causes tool_event_inconclusive when no lower-tier source can confirm required tool use.HARNESS_ERROR if artifacts are missing/corrupt, not agent policy violation.timeout > shell_error > harness_capture_failed > nonzero_exit > validator/tool verdict.nonzero_exit with readable artifacts and passing validators is SHELL_ERROR, not PASS_WITH_POLICY_VIOLATION.capture_missing: required artifacts are absent or unreadable.no_tool_event_observed: telemetry was captured with sufficient observability, but no tool event was found.parser_not_capable_for_shell: shell/output mode cannot provide reliable structured tool evidence. Evaluator may use validators/process result for artifact correctness, but final status cannot be PASS; if validators pass, maximum status is PASS_WITH_POLICY_VIOLATION.validators_pass + tool_event_not_observable(reason=parser_not_capable_for_shell) produces PASS_WITH_POLICY_VIOLATION.Evaluator must not treat parser_not_capable_for_shell the same as no_tool_event_observed.
Why selected:
proxy.warmup.http.jsonl and proxy.measured.http.jsonl are internal-only under results/runs.CaseResult.status enum supports PASS_WITH_POLICY_VIOLATION and SHELL_ERROR.event_evaluator output after validators and telemetry extraction.PASS_WITH_POLICY_VIOLATION separately from PASS.event_evaluator that derives canonical status, trajectory, invoked, scoring fields, and failure_reason.auto mode and failure-safe; proxy errors are represented as evaluator evidence, not harness crashes.event_evaluator.run, rebuild-reports, aggregate-reports) work with the new result contract.validators_pass + tool_event_not_observable(reason=parser_not_capable_for_shell) produces PASS_WITH_POLICY_VIOLATION.validators_pass + no_tool_event_observed produces PASS_WITH_POLICY_VIOLATION.validators_fail + no_tool_event_observed produces NO_TOOL_CALL.validators_pass + confirmed_tool_use produces PASS.validators_pass + tool_event_inconclusive produces PASS_WITH_POLICY_VIOLATION.proxy_error does not downgrade PASS when wrapper/agent-output telemetry confirms tool use.PASS_WITH_POLICY_VIOLATION is rendered separately and is not counted as full PASS in aggregate scoring.shell_error produces SHELL_ERROR.nonzero_exit + validators_pass + confirmed_tool_use produces SHELL_ERROR.harness_capture_failed + capture_missing produces HARNESS_ERROR.strict_pass_score.PASS_WITH_POLICY_VIOLATION contributes 0.8 to overall_score and 0.0 to strict_pass_score.wrapper_parse_error produces HARNESS_ERROR.artifacts/events.warmup.jsonlartifacts/events.measured.jsonlartifacts/proxy.warmup.http.jsonlartifacts/proxy.measured.http.jsonlproxy.warmup.http.jsonl and proxy.measured.http.jsonl use prefixed proxy keys only (x_gripprobe_*), including tool counters (x_gripprobe_tool_call_count, x_gripprobe_tool_result_count).event_evaluatorwrapper|proxy with tier A|B|C|Devents.*.jsonl, events.summary.json, proxy.warmup.http.jsonl, proxy.measured.http.jsonl.events.warmup.jsonl, events.measured.jsonl, proxy.warmup.http.jsonl, proxy.measured.http.jsonl) when presentauto2026-05-14: in current MVP implementation, --telemetry-proxy force is treated as a strict requirement.status=HARNESS_ERRORinvoked=nomatch_percent=0failure_reason=proxy_required_but_not_availabletelemetry_proxy_status=errortelemetry_proxy_skip_reason=unsupported_backend (until active proxy capture is implemented)tool_event_verdict=tool_event_inconclusivetool_event_verdict_reason=proxy_error--telemetry-proxy auto remains non-fatal (skipped when proxy routing is unsupported).