Benchmark Report

Enron Corpus
Benchmark Results.

Validated against 3,034 emails from the Enron investigation (allen-p custodian). Ground truth filtered by custodian for accurate precision/recall measurement.

Micro-tier benchmark on a single custodian. Recall metrics reflect single-query performance - in practice, reviewers run multiple queries per topic, and cumulative recall across a review session is significantly higher. Results will vary with corpus size and composition.

55% Single-Query Recall

Hybrid search finds over half of all relevant documents in a single query. Cumulative recall across a review session is significantly higher.

5 Search Modes

Keyword, semantic, hybrid, RAG fusion, and full pipeline. Each optimized for different use cases.

18 Minutes Total

From raw email upload to fully searchable, triaged corpus with embeddings and AI relevance scores.

$0.03/GB Triage Cost

Embedding + LLM triage costs. Semantic and hybrid search modes are free after initial embedding.

Pipeline Performance

4 parallel workers - Hetzner AX41 bare-metal (64GB RAM)

Stage	Duration	Documents	Details
Ingest	2.9s	3,034	1,054 docs/sec throughput
Process	10m 4s	3,034	Text extraction, metadata, MIME detection
Embed	7m 54s	3,034	OpenAI text-embedding-3-small (1536 dims)
Triage	~2m	3,034	AI relevance scoring with citations
Total	~18 min	3,034	End-to-end with 4 parallel workers

Search Accuracy

Micro-tier benchmark - ground truth: custodian-filtered Enron dataset

Best Overall

Hybrid

Precision16%

Recall55%

F1 Score25%

Per-Query Cost$0

Semantic

Precision11%

Recall36%

F1 Score17%

Per-Query Cost$0

Most Precise

Full (RAG+Rerank)

Precision100%

Recall11%

F1 Score20%

Per-Query Cost~$0.02/query

Methodology

Dataset

3,034 emails from allen-p custodian (Enron corpus)
Ground truth queries filtered by custodian presence
Multiple query categories: factual, temporal, relational
Reproducible via open-source CLI: cli bench run --tier micro

Infrastructure

Hetzner AX41 bare-metal: 64GB RAM, 2x512GB SSD RAID 1
PostgreSQL 15 + pgvector (HNSW index, ef_construction=256)
4 parallel RQ workers (forking mode)
Total infrastructure cost: $34.70/month

See These Results Live

The demo matter contains the exact same Enron dataset. Search, triage, and export it yourself.

TRY_INTERACTIVE_DEMO

Enron Corpus Benchmark Results.