From Grep to Semantic Search: How OpenClaw Builds Memory for AI Agents

Last week I updated OpenClaw to a new version. Everything worked — until I tried memory_search. Error. Missing API token for the embedding model.

Digging into the config, I found it: older versions used simple grep to search memory; the new version switched to vector embedding, which needs an API key. Instead of just adding the token and moving on, I read the source to understand why.

The answer: grep only finds what it already knows where to look.

The Problem: Grep Is the Wrong Lookup

The old approach was straightforward: the agent called read() on each file, scanned the contents, found relevant info. This is a linear scan — every file read in full from top to bottom.

The problem isn't speed. The problem is the agent needs to know which file holds the information ahead of time — and there's no mechanism to know that.

# Agent needs: "AWS instance ID"
read("MEMORY.md")             # Maybe there, maybe not
read("memory/2026-05-14.md")  # Maybe has AWS migration info
read("memory/2026-05-28.md")  # Has SSH tunnel info
...                           # Dozens of daily notes

Without indexing, the agent tries each file — or falls back to grep, but grep is still exact text match. Searching "AWS setup" won't surface "EC2 instance in Singapore" even though it's relevant.

Architecture: Hybrid Retrieval Pipeline

Instead of reading files on demand, all memory is indexed upfront and searched when needed.

graph TD
    A[Memory Files] --> B[Chunking]
    B --> C[Embedding]
    C --> D[(SQLite + sqlite-vec)]
    Q[Query] --> V[Vector Search 70%]
    Q --> K[BM25 Search 30%]
    D --> V
    D --> K
    V --> H[Hybrid Scoring]
    K --> H
    H --> R[Top-K to Agent Context]

Chunking: Why Overlap?

Each memory file is split into ~400-token chunks with 80-token overlap — because meaning often lives at the boundary between adjacent passages. Hard-cutting without overlap can split a sentence mid-thought, breaking the context that makes search work.

Embedding: Text as Coordinates

An embedding model takes text and returns a float array — the numeric "fingerprint" of its meaning. Text with similar meaning produces arrays that are close together in that space.

This is why "EC2 instance in Singapore" can match query "AWS setup". The two phrases share no words — but the embedding model understands they're related.

OpenClaw supports multiple providers, chosen automatically from config:

Provider	Speed	Cost	Quality
Local (GGUF / node-llama-cpp)	~50 tok/s	Free	Low
OpenAI (text-embedding-3-small)	~1000 tok/s	$0.02/1M tokens ($0.01 via Batch API)	High
Gemini / Voyage	~800 tok/s	Plan-dependent, fallback chain	High
BM25 only	—	Free	Keywords only

Currently using embeddinggemma — a GGUF model running locally via node-llama-cpp. No GPU, no API key.

Storage: SQLite Per Agent

Vectors are stored in SQLite with the sqlite-vec extension, at ~/.openclaw/memory/<agentId>.sqlite. Each agent has its own database — no shared index, no server.

SHA-256 caching means OpenClaw skips re-embedding chunks that are already indexed. Only when file contents actually change are chunks recomputed.

Hybrid Search: Two Branches Combined

When the agent calls memory_search("AWS setup"), the pipeline runs two branches in parallel:

finalScore = 0.7 × vectorScore + 0.3 × bm25Score

Vector search (70%) retrieves by meaning — matching even when different words are used. BM25 search (30%) retrieves by keyword — exact match. Top-K chunks are injected into context before the LLM call. The full pipeline runs in under 100ms.

Why Hybrid, Not Pure Vector?

Many systems use only vector search. OpenClaw chose hybrid for one practical reason: vector search is weak at exact match.

Searching "PR #17", vector embedding doesn't understand this is a specific identifier. It matches chunks about "pull requests" in general — useless when you need exactly PR #17.

BM25 handles exact match well — it calculates term frequency and inverse document frequency, returning chunks that contain "PR #17" precisely. Conversely, BM25 is blind to paraphrase: "AWS infrastructure" and "cloud setup on Amazon" have zero keyword overlap, but vector search understands they're related.

Hybrid gets the strengths of both.

Two Important Optimizations

Hybrid search solves the lookup problem. Two smaller issues surface in practice.

Problem 1: Top results often repeat. When memory has many chunks about "AWS EC2 setup", top-3 can be the same information rewritten three times. The agent gets redundant context with no diversity.

Problem 2: Old memory is less relevant than new memory. A note from 3 months ago is usually less useful than one from yesterday.

Temporal Decay

Score chunks from older files lower using:

decayFactor = Math.pow(0.5, daysSinceCreation / halfLifeDays)
// halfLifeDays = 30 (default)
// Today's note:  decay = 1.0
// 30 days ago:   decay = 0.5
// 60 days ago:   decay = 0.25

MEMORY.md — long-term memory — is exempt from decay since it doesn't go stale the way daily notes do.

MMR (Maximal Marginal Relevance)

Instead of picking top-K by raw score, MMR balances relevance against diversity:

Without MMR:
1. "AWS EC2 setup" (0.95)
2. "AWS EC2 configuration" (0.94)  // near-duplicate
3. "AWS EC2 instance" (0.93)       // near-duplicate

With MMR (λ=0.7):
1. "AWS EC2 setup" (0.95)
2. "AWS Lambda config" (0.82)      // different topic
3. "AWS S3 bucket" (0.78)          // different topic

The agent gets more diverse context covering more angles within the same token budget.

Lessons Learned

The missing API key was what started the debug. Understanding the full pipeline was what made it worthwhile.

The lookup problem is not the speed problem. I first thought grep was slow because it read many files. The real issue wasn't speed — the agent had no mechanism to know where to look. Indexing solves lookup, not speed.

Hybrid beats pure in practice. Pure vector search misses exact matches like "PR #17" or instance IDs. Pure BM25 misses paraphrase. Practical memory systems need both — not either-or.

Simplicity is a technical decision. SQLite instead of a dedicated vector database. One file per agent instead of a distributed cluster. SHA-256 caching instead of real-time sync. Not because of limited capability — but because it's enough and easy to debug when something breaks.

Content assisted by AI (Amy 🌸). Reviewed by the author.