Sources & Methodology

DisclosureFeed is a real-time, machine-readable intelligence feed of formally-filed cybersecurity / data-breach disclosures from public regulatory sources worldwide.

Source list

We ingest only primary regulatory sources plus declared press supplementation. Current Tier-1 sources (V1):

US SEC EDGAR — 8-K Item 1.05, 8-K Item 8.01, 10-K Item 1C, 10-Q cyber references
US California Attorney General — breach notification portal
US Maine Attorney General — breach notice archive (legacy + modern portals)
US HHS Office for Civil Rights — HIPAA breach report ("Wall of Shame")

Extraction pipeline

Each source document passes through:

Sanitization — Unicode normalization, dangerous-HTML stripping, bidi override removal, prompt-injection guard.
Triage — Claude Haiku 4.5 classifier decides if the document is a breach disclosure and selects an extraction template.
Structured extraction — Claude Sonnet 4.6 with Instructor + Pydantic schema validation produces a BreachDisclosure v1 object with per-field source-span citations and per-field confidence scores.
Hard-pass review — Claude Opus 4.7 with extended thinking re-extracts any document where Pass-2 overall confidence is < 0.80 or any single field is < 0.70.
Entity resolution — name + LEI/CIK lookup via the GLEIF and SEC EDGAR registries.
Cross-jurisdiction dedup — same incident filed across multiple jurisdictions (SEC + state AG + OCR) is grouped under a canonical incident id.
PII redaction — natural-person names appearing in incident narratives are replaced with type-tags before customer-visible fields are emitted.
Human review queue — any record below the confidence threshold is reviewed by a DisclosureFeed operator.

AI-assisted output disclosure (EU AI Act Art. 50)

Every record carries ai_assisted: true. Every API response envelope carries meta.ai_assisted: true. Every dashboard view shows the disclosure inline + in the footer. EU AI Act Article 50 takes effect August 2, 2026.

Accuracy SLOs

Victim entity resolution: ≥ 99% accuracy
Dates: ≥ 98% accuracy (±1 day)
Numeric counts: ≥ 95% accuracy
Attack vector classification: ≥ 90% (often not disclosed)

Corrections

Email corrections@disclosurefeed.com — 48-hour SLA from receipt.

Provenance

Every record carries:

source.url — the originating regulator-published URL
source.filed_at — the regulator-recorded filing timestamp
source.raw_hash — SHA-256 of the source body at ingestion time
GET /v1/disclosures/{id}/extraction → model, provider, extracted_at, system_prompt_version, token counts, cost_usd
overall_confidence on the same response (per-field confidence is on the roadmap; tracked at extraction time but not yet persisted)