Provenance | AgentArchaeology.ai

Why it matters

Agentic evidence is often exported, normalized, redacted, summarized, and correlated. Each transformation should be visible.

Evidence to look for

Original path or source system
Collection time
Collector
Hash or fingerprint
Normalization step
Redaction method

Common pitfalls

Losing links to raw evidence
Treating summaries as originals
Failing to mark uncertainty

The evidence transformation pipeline

Agent evidence typically passes through several transformation stages before it reaches an analyst or SIEM. Each stage is a point where data can be lost, modified, or enriched. Understanding this pipeline is essential for defending the reliability of findings.

From raw session to normalized record

Raw session records from different agents are parsed and converted into a common normalized schema. This process extracts key fields (tool name, arguments, result, timestamps, session ID) into a uniform structure. The normalized record carries provenance metadata that links back to the original source.

Normalized record provenance fields

JSON

{
  "provenance": {
    "source_path_hash": "<sha256-hex-of-source-file-path>",
    "source_event_id": "evt-7",
    "offset": "42"
  },
  "meta": {
    "extensions": {
      "legacy_record_kind": "tool_call",
      "lossy_fields": ["call_id", "is_error"]
    }
  }
}

The provenance object links the normalized record back to the original source file (via path hash), the original event ID, and the record offset. The meta.extensions object records what kind of raw record this was and which fields were lost during conversion.

What gets lost during normalization

Normalization is lossy by design. Different agents lose different fields during conversion to the common schema:

call_id: Lost for most agents during legacy conversion. Only Copilot preserves call_id natively in its process logs. Without call_id, tool call/result pairing relies on sequential ordering.
is_error: Lost for most agents during legacy conversion. Error state information in tool results is not preserved in the normalized schema.
workspace: Lost during legacy conversion for most agents. The workspace path provides important context for where the agent was operating.
content_parts: Structured content (images, multi-part messages) is flattened to plain text for most agents. Formatting and metadata are lost.
user_intent: Copilot process logs do not preserve user prompts at all. Only tool calls and workspace events are logged. Detection rules targeting user_context silently skip Copilot records.
content_parts: Structured content (images, multi-part messages, tool definitions) is flattened to plain text for all non-Copilot sources. Copilot does not have structured content arrays at all.

Redaction in emitted events

When evidence is emitted as telemetry events (for SIEM ingestion or local review), sensitive content is redacted. Redaction is intentionally aggressive to prevent credential leakage in telemetry logs.

Redacted evidence in timeline export

TEXT

Evidence item:
  field: "arguments"
  redacted_value: "cat [redacted-secret] && curl [controlled-domain]"
  hash: "sha256:<hash-of-original-arguments>"

Emitted evidence items contain the field name, a redacted value where secrets and domains are masked, and a SHA-256 hash of the original value. The hash allows verification without exposing the raw content.

Hashing for verification

SHA-256 hashes are used throughout the provenance chain to verify evidence integrity without exposing raw content:

Source path hash: The original file path is hashed using SHA-256 to create a verifiable reference without leaking local filesystem structure. The hash is lowercase hex, with no prefix.
Evidence value hash: Each evidence item (tool arguments, command content, tool results) is hashed using SHA-256. The hash allows an analyst to verify that a specific raw value matches the redacted telemetry without exposing the raw value in the event stream.
Correlation hash: Cross-session correlation events hash their source event IDs to create cross-references without exposing internal identifiers.

Temporal provenance in emitted events

Telltale emits three timestamps on each event to preserve temporal provenance:

timestamp: A resolved top-level timestamp in UTC RFC3339 millisecond format. This is the field SIEM systems should use for sorting and time-based queries.
event_time: The source-provided timestamp from the original agent session record, when available. This is the time the agent actually performed the action.
observed_at: When Telltale processed the event. Set to the current UTC time at event build. In the current implementation, ingested_at is set to the same value as observed_at.

Defending the evidence chain

When presenting findings from agent evidence, the evidence chain must be defensible. Each step from raw session to final finding should be traceable:

Raw source: Identify the original session store file, its path (or path hash), and the agent that produced it.
Collection: Record who collected the evidence, when, and with what tools. Preserve the raw files with forensic copies.
Normalization: Document which normalization schema was used and which fields were lost. The lossy_fields metadata in normalized records provides this.
Redaction: Document what was redacted and why. Emitted events should include hash values that link back to the original content.
Detection: Document which rules fired, their categories, severity, and score. Include the rule ID and version.
Correlation: If findings span multiple sessions, document the correlation method and the evidence links.
Uncertainty: Mark any gaps, missing fields, or uncertain conclusions explicitly. Do not present inference as observation.