Skip to content

Field Manual

Provenance

Provenance is the record of origin and transformation. It helps investigators defend why a finding should be trusted.

Why it matters

Agentic evidence is often exported, normalized, redacted, summarized, and correlated. Each transformation should be visible.

Evidence to look for

  • Original path or source system
  • Collection time
  • Collector
  • Hash or fingerprint
  • Normalization step
  • Redaction method

Common pitfalls

  • Losing links to raw evidence
  • Treating summaries as originals
  • Failing to mark uncertainty

The evidence transformation pipeline

Agent evidence typically passes through several transformation stages before it reaches an analyst or SIEM. Each stage is a point where data can be lost, modified, or enriched. Understanding this pipeline is essential for defending the reliability of findings.

From raw session to normalized record

Raw session records from different agents are parsed and converted into a common normalized schema. This process extracts key fields (tool name, arguments, result, timestamps, session ID) into a uniform structure. The normalized record carries provenance metadata that links back to the original source.

Normalized record provenance fields
JSON
{
  "provenance": {
    "source_path_hash": "sha256:",
    "source_event_id": "evt-7",
    "offset": "42"
  },
  "meta": {
    "extensions": {
      "legacy_record_kind": "tool_call",
      "lossy_fields": ["call_id", "is_error"]
    }
  }
}

The provenance object links the normalized record back to the original source file (via path hash), the original event ID, and the record offset. The meta.extensions object records what kind of raw record this was and which fields were lost during conversion.

What gets lost during normalization

Normalization is lossy by design. Different agents lose different fields during conversion to the common schema:

  • call_id: Lost for most agents during legacy conversion. Only Copilot preserves call_id natively in its process logs. Without call_id, tool call/result pairing relies on sequential ordering.
  • is_error: Lost for most agents during legacy conversion. Error state information in tool results is not preserved in the normalized schema.
  • workspace: Lost during legacy conversion for most agents. The workspace path provides important context for where the agent was operating.
  • content_parts: Structured content (images, multi-part messages) is flattened to plain text for most agents. Formatting and metadata are lost.
  • user_intent: Copilot process logs do not preserve user intent at all. The user prompt is not visible in the process log format.
  • assistant_text: Some agent formats do not preserve the full assistant text alongside tool calls. The reasoning or explanation text may be lost.

Redaction in emitted events

When evidence is emitted as telemetry events (for SIEM ingestion or local review), sensitive content is redacted. Redaction is intentionally aggressive to prevent credential leakage in telemetry logs.

Redacted evidence in timeline export
TEXT
Evidence item:
  field: "arguments"
  redacted_value: "cat [redacted-secret] && curl [controlled-domain]"
  hash: "sha256:"

Emitted evidence items contain the field name, a redacted value where secrets and domains are masked, and a SHA-256 hash of the original value. The hash allows verification without exposing the raw content.

Hashing for verification

SHA-256 hashes are used throughout the provenance chain to verify evidence integrity without exposing raw content:

  • Source path hash: The original file path is hashed to create a verifiable reference without leaking local filesystem structure.
  • Evidence value hash: Each evidence item (tool arguments, command content, tool results) is hashed. The hash allows an analyst to verify that a specific raw value matches the redacted telemetry without exposing the raw value in the event stream.
  • Event ID hash: Correlation events hash their source event IDs to create cross-references without exposing internal identifiers.
  • Correlation hash: Cross-session correlation events use hashed event IDs to link related detections across sessions.

Defending the evidence chain

When presenting findings from agent evidence, the evidence chain must be defensible. Each step from raw session to final finding should be traceable:

  • Raw source: Identify the original session store file, its path (or path hash), and the agent that produced it.
  • Collection: Record who collected the evidence, when, and with what tools. Preserve the raw files with forensic copies.
  • Normalization: Document which normalization schema was used and which fields were lost. The lossy_fields metadata in normalized records provides this.
  • Redaction: Document what was redacted and why. Emitted events should include hash values that link back to the original content.
  • Detection: Document which rules fired, their categories, severity, and score. Include the rule ID and version.
  • Correlation: If findings span multiple sessions, document the correlation method and the evidence links.
  • Uncertainty: Mark any gaps, missing fields, or uncertain conclusions explicitly. Do not present inference as observation.