Autonomous research agents that learn from runs

Give your research agent a second brain

LoopGraph is the memory layer for autonomous ML research. It remembers every experiment, ranks methods by what actually worked, and turns papers into the next winning edit.

~12
experiments / hour unattended
10.5–13.0%
fewer tokens vs. baseline
9.5–17.3%
faster wall-clock time
FTS5 + SQLite
local, auditable experiment memory

Most agents forget everything

Today’s coding agents start from scratch on every run. Grep is useful, but it is not memory: agents need active retrieval, code maps, provenance, and a record of which ideas worked.

Re-running failed ideas

Without memory, agents repeat experiments that already flopped.

Literature stays separate

Papers live in PDFs; agents can’t turn “use SwiGLU” into a working code change.

No provenance

When something works, no one knows which method or prompt caused it.

# Without LoopGraph: every run is a blank slate
agent.run() # propose → edit → train → evaluate → forget
# With LoopGraph: every run learns from the last
agent.use(LoopGraph()) # retrieve ranked methods → edit → train → log → improve

How it works

LoopGraph wraps your existing training loop and adds a retrieval layer that gets smarter with every experiment.

1

Seed the knowledge base

Ingest papers from arXiv or GitHub, or use the curated method pack covering SwiGLU, GQA, Muon LR, sliding-window attention, and more.

2

Run experiments

The agent edits train.py, trains for a fixed 5-minute budget, and logs the result to results.tsv and ResearchFS.

3

Rank what works

ResearchFS scores each method by success rate, BPB delta, query fatigue, and recency — so the agent reuses winners.

4

Keep improving

Every retrieval, outcome, and token cost is stored locally. The next experiment starts smarter than the last.

Meet ResearchFS

A local SQLite knowledge store with full-text search, empirical scoring, and an agent-native SDK. It turns a passive code search into active experiment memory.

🔍

FTS5 hybrid search

Combine full-text search with empirical rankings to surface the right method at the right time.

📊

Success-weighted scoring

Methods are ranked by prior BPB improvement, win rate, and fatigue — not just keyword match.

🔗

Full provenance

Every method links back to its paper, chunk, retrieval event, and experiment outcome.

# Query ResearchFS from your agent
from loopgraph import ResearchFSClient client = ResearchFSClient("researchfs.db") context = client.suggest_query( goal="improve validation bpb", current_code=train_py ) # Returns ranked method pack + experiment brief
Benchmarks from local smoke-test runs

Benchmarks: faster, cheaper, and honest about model quality

LoopGraph was compared against a baseline autonomous agent on the nanochat BPB benchmark. Same run harness, same fixed training budget; one agent received ResearchFS retrieval context and one did not.

GPT-4o-mini BPB delta
+0.02454
LoopGraph best BPB: 2.180112 vs. baseline 2.204652. Lower BPB is better.
Token savings
10.5–13.0%
20,986 vs. 24,127 tokens on GPT-4o-mini; 28,431 vs. 31,755 on GLM 5.2.
Wall-clock savings
9.5–17.3%
LoopGraph completed faster across both comparison runs.
Query reuse
Run 3
ResearchFS reused the prior query MLP SwiGLU during the GLM run.
RunBaseline best BPBLoopGraph best BPBTokens savedTime savedOutcome
GPT-4o-mini2.2046522.18011213.0%17.3%LoopGraph wins BPB, tokens, and time
GLM 5.22.1288342.15301910.5%9.5%Baseline wins BPB; LoopGraph wins efficiency
Honest read: these are local agent-loop smoke tests, not SOTA claims. Published B200 baselines such as Recursive’s optimized_from_karpathy report mean BPB around 0.9109, while these quick local LoopGraph runs are around 2.13–2.20. The benchmark signal today is efficiency and retrieval behavior; the next milestone is closing the quality gap.

Built for researchers, not tourists

Single-GPU real training

Runs on real PyTorch + nanochat, not a toy environment. Tested on H100; MPS/CPU fallbacks included.

🔒

Local-first memory

Your experiment history, papers, and API calls stay in a local SQLite database — no cloud lock-in.

🧩

Provider agnostic

Plug in OpenAI, OpenRouter, Venice, or local models. Switch models without rewriting the loop.

🧪

Tested retrieval logic

1,693 lines of tests cover parsing, ranking, fatigue, deduplication, and harness-agent behavior.

Stop letting your agent forget

Join the early access list. We’re working with ML teams to turn LoopGraph into the default memory layer for autonomous research.