Sources & Methodology
DisclosureFeed is a real-time, machine-readable intelligence feed of formally-filed cybersecurity / data-breach disclosures from public regulatory sources worldwide.
Source list
We ingest only primary regulatory sources plus declared press supplementation. Current Tier-1 sources (V1):
- US SEC EDGAR — 8-K Item 1.05, 8-K Item 8.01, 10-K Item 1C, 10-Q cyber references
- US California Attorney General — breach notification portal
- US Maine Attorney General — breach notice archive (legacy + modern portals)
- US HHS Office for Civil Rights — HIPAA breach report ("Wall of Shame")
Extraction pipeline
Each source document passes through:
- Sanitization — Unicode normalization, dangerous-HTML stripping, bidi override removal, prompt-injection guard.
- Triage — Claude Haiku 4.5 classifier decides if the document is a breach disclosure and selects an extraction template.
- Structured extraction — Claude Sonnet 4.6 with Instructor + Pydantic schema validation produces a
BreachDisclosure v1object with per-field source-span citations and per-field confidence scores. - Hard-pass review — Claude Opus 4.7 with extended thinking re-extracts any document where Pass-2 overall confidence is < 0.80 or any single field is < 0.70.
- Entity resolution — name + LEI/CIK lookup via the GLEIF and SEC EDGAR registries.
- Cross-jurisdiction dedup — same incident filed across multiple jurisdictions (SEC + state AG + OCR) is grouped under a canonical incident id.
- PII redaction — natural-person names appearing in incident narratives are replaced with type-tags before customer-visible fields are emitted.
- Human review queue — any record below the confidence threshold is reviewed by a DisclosureFeed operator.
AI-assisted output disclosure (EU AI Act Art. 50)
Every record carries ai_assisted: true. Every API response envelope carries meta.ai_assisted: true. Every dashboard view shows the disclosure inline + in the footer. EU AI Act Article 50 takes effect August 2, 2026.
Accuracy SLOs
- Victim entity resolution: ≥ 99% accuracy
- Dates: ≥ 98% accuracy (±1 day)
- Numeric counts: ≥ 95% accuracy
- Attack vector classification: ≥ 90% (often not disclosed)
Corrections
Email corrections@disclosurefeed.com — 48-hour SLA from receipt.
Provenance
Every record carries:
source.url— the originating regulator-published URLsource.filed_at— the regulator-recorded filing timestampsource.raw_hash— SHA-256 of the source body at ingestion timeGET /v1/disclosures/{id}/extraction→model,provider,extracted_at,system_prompt_version, token counts,cost_usdoverall_confidenceon the same response (per-field confidence is on the roadmap; tracked at extraction time but not yet persisted)