BENCHMARK / METHODOLOGY FIRST

A four-item smoke eval, not a benchmark win.

The v0.6 report proves the retrieval and Evidence Ledger path works on the public demo fixture. It does not compare Lore against LOCOMO, LongMemEval, Mem0, Letta, or Zep at production scale.

Recall@51.0004 gold memories found vs baseline
Precision@50.200small corpus, k=5 vs baseline
MRR0.875one rank-2 tie vs baseline
Stale-hit0.000no stale demo rows vs baseline
p95 latencypendingnot measured via MCP vs baseline
Dataset
4 public demo memories and 4 questions from examples/demo-dataset/eval/lore-demo-eval-dataset.json.
Method
Write demo memories, query top 5, compute Recall@5, Precision@5, MRR, and stale-hit rate using the same formulas as @lore/eval.
Reproduce
pnpm eval:report -- --project-id demo-private --public-safe for the public-safe report path; use the demo dataset for a smoke run.
Limitations
At 4 items, any system returning all items can score Recall@5 = 1.0. The result is useful as a pipeline smoke test, not a scale claim.
Next
Run a 50+ memory dataset, measure direct API latency, and only then compare against public benchmark literature.

Raw data is tracked in the launch workspace and should be moved into a public repo path only after a second reviewer reproduces the run from a clean checkout.