BENCHMARK / METHODOLOGY FIRST
A four-item smoke eval, not a benchmark win.
The v0.6 report proves the retrieval and Evidence Ledger path works on the public demo fixture. It does not compare Lore against LOCOMO, LongMemEval, Mem0, Letta, or Zep at production scale.
Recall@51.0004 gold memories found vs baseline
Precision@50.200small corpus, k=5 vs baseline
MRR0.875one rank-2 tie vs baseline
Stale-hit0.000no stale demo rows vs baseline
p95 latencypendingnot measured via MCP vs baseline
- Dataset
- 4 public demo memories and 4 questions from
examples/demo-dataset/eval/lore-demo-eval-dataset.json. - Method
- Write demo memories, query top 5, compute Recall@5, Precision@5, MRR, and stale-hit rate using the same formulas as
@lore/eval. - Reproduce
pnpm eval:report -- --project-id demo-private --public-safefor the public-safe report path; use the demo dataset for a smoke run.- Limitations
- At 4 items, any system returning all items can score Recall@5 = 1.0. The result is useful as a pipeline smoke test, not a scale claim.
- Next
- Run a 50+ memory dataset, measure direct API latency, and only then compare against public benchmark literature.
Raw data is tracked in the launch workspace and should be moved into a public repo path only after a second reviewer reproduces the run from a clean checkout.