SQL-Cached Static NLP: Indexing 10,000+ Docs in Seconds for $0
Frame Codex indexes thousands of markdown files on every commit. We needed sub-5-second indexing without spending money on LLM APIs. Here's how we built a SQL-cached static NLP pipeline that runs in GitHub Actions for free.
Performance
Real numbers from production
The Problem
Every time someone opens a PR to Frame Codex, we need to:
- Extract keywords from every file (TF-IDF)
- Detect common phrases (n-grams)
- Auto-categorize content (vocabulary matching)
- Validate schema compliance
- Build searchable index
Doing this for 100 files takes ~30 seconds. For 1,000 files? 5 minutes. For 10,000? 50 minutes. GitHub Actions has a 6-hour timeout, but nobody wants to wait an hour for CI.
Solution: SHA-Based Diffing
The key insight: most files don't change between commits. A typical PR modifies 1-10 files out of thousands.
So we cache the analysis results and only re-process files that changed:
Step 1: Calculate SHA-256
const crypto = require('crypto')
function calculateSHA(content) {
return crypto
.createHash('sha256')
.update(content)
.digest('hex')
}Step 2: Store in SQLite
CREATE TABLE files (
path TEXT PRIMARY KEY,
sha TEXT NOT NULL,
mtime INTEGER NOT NULL,
analysis TEXT NOT NULL, -- JSON blob
indexed_at INTEGER NOT NULL
);
-- On first run
INSERT INTO files (path, sha, mtime, analysis, indexed_at)
VALUES ('intro.md', 'abc123...', 1699999999, '{"keywords":[...]}', 1699999999);Step 3: Compute Diff
async function getDiff(currentFiles) {
const cached = await db.all('SELECT path, sha FROM files')
const cachedMap = new Map(cached.map(f => [f.path, f.sha]))
const added = []
const modified = []
const unchanged = []
for (const file of currentFiles) {
const currentSHA = calculateSHA(readFile(file))
const cachedSHA = cachedMap.get(file)
if (!cachedSHA) {
added.push(file) // New file
} else if (cachedSHA !== currentSHA) {
modified.push(file) // Changed
} else {
unchanged.push(file) // Use cache
}
}
return { added, modified, unchanged }
}Step 4: Merge Results
// Load cached analyses
for (const file of unchanged) {
const cached = await db.get(
'SELECT analysis FROM files WHERE path = ?',
[file]
)
index.push(JSON.parse(cached.analysis))
}
// Process only changed files
for (const file of [...added, ...modified]) {
const analysis = analyzeFile(file) // TF-IDF, n-grams, etc.
index.push(analysis)
await db.run(
'INSERT OR REPLACE INTO files (path, sha, analysis, ...) VALUES (?, ?, ?, ...)',
[file, calculateSHA(file), JSON.stringify(analysis), ...]
)
}GitHub Actions Cache
The .cache/codex.db file persists across workflow runs using GitHub's cache action:
- name: Restore SQL cache
uses: actions/cache@v4
with:
path: .cache/codex.db
key: codex-cache-${{ hashFiles('weaves/**/*.md') }}
restore-keys: |
codex-cache-The cache key includes a hash of all markdown files. If any file changes, the cache is invalidated and rebuilt. But within a single PR (multiple commits), the cache persists.
Static NLP Pipeline
We use classical NLP techniques that don't require API calls:
TF-IDF (Keyword Extraction)
// Term Frequency
tf = count(term, document) / totalWords(document)
// Inverse Document Frequency
idf = log(totalDocuments / documentsContaining(term))
// TF-IDF Score
tfidf = tf * idf
// Extract top 10 keywords per document
keywords = allTerms
.map(term => ({ term, score: tfidf(term) }))
.sort((a, b) => b.score - a.score)
.slice(0, 10)N-Gram Extraction
// Find common 2-3 word phrases
function extractNGrams(text, n = 2) {
const words = text.toLowerCase().split(/\s+/)
const ngrams = []
for (let i = 0; i <= words.length - n; i++) {
ngrams.push(words.slice(i, i + n).join(' '))
}
// Count frequencies
const counts = {}
ngrams.forEach(ng => counts[ng] = (counts[ng] || 0) + 1)
// Return top phrases
return Object.entries(counts)
.sort((a, b) => b[1] - a[1])
.slice(0, 20)
}Vocabulary Matching
// Controlled vocabulary in tags/index.yaml
const vocabulary = {
subjects: ['technology', 'science', 'arts', ...],
topics: ['algorithms', 'machine-learning', ...]
}
// Match keywords to vocabulary
function categorize(keywords) {
const matches = {}
for (const subject of vocabulary.subjects) {
const score = keywords.filter(k =>
k.includes(subject) || subject.includes(k)
).length
if (score > 0) matches[subject] = score
}
return Object.entries(matches)
.sort((a, b) => b[1] - a[1])
.map(([subject]) => subject)
}Performance Breakdown
Here's where the time goes in a typical 100-file index:
With SQL caching, we skip 85-95% of this work on subsequent runs.
Implementation Details
Cache Database Schema
-- File metadata and analysis cache CREATE TABLE files ( path TEXT PRIMARY KEY, sha TEXT NOT NULL, -- SHA-256 of content mtime INTEGER NOT NULL, -- Last modified timestamp size INTEGER NOT NULL, -- File size in bytes analysis TEXT NOT NULL, -- JSON analysis result indexed_at INTEGER NOT NULL -- When indexed ); -- Keyword cache (for TF-IDF optimization) CREATE TABLE keywords ( file_path TEXT NOT NULL, keyword TEXT NOT NULL, tfidf_score REAL NOT NULL, frequency INTEGER NOT NULL, PRIMARY KEY (file_path, keyword), FOREIGN KEY (file_path) REFERENCES files(path) ON DELETE CASCADE ); -- Loom/Weave aggregate statistics CREATE TABLE stats ( scope TEXT PRIMARY KEY, -- Loom or weave path scope_type TEXT NOT NULL, -- 'loom' or 'weave' total_files INTEGER NOT NULL, total_keywords INTEGER NOT NULL, avg_difficulty TEXT, subjects TEXT, -- JSON array topics TEXT, -- JSON array last_updated INTEGER NOT NULL ); -- Indexes for performance CREATE INDEX idx_files_sha ON files(sha); CREATE INDEX idx_keywords_score ON keywords(tfidf_score DESC);
Diff Algorithm
async function computeDiff(currentFiles) {
const startTime = Date.now()
// Load cached SHAs (fast: ~50ms for 1000 files)
const cached = await db.all('SELECT path, sha FROM files')
const cachedMap = new Map(cached.map(f => [f.path, f.sha]))
const added = []
const modified = []
const unchanged = []
// Compare SHAs (fast: ~100ms for 1000 files)
for (const file of currentFiles) {
const content = fs.readFileSync(file, 'utf8')
const currentSHA = calculateSHA(content)
const cachedSHA = cachedMap.get(file)
if (!cachedSHA) {
added.push(file)
} else if (cachedSHA !== currentSHA) {
modified.push(file)
} else {
unchanged.push(file)
}
}
console.log(`Diff computed in ${Date.now() - startTime}ms`)
console.log(` Added: ${added.length}`)
console.log(` Modified: ${modified.length}`)
console.log(` Unchanged: ${unchanged.length} (using cache)`)
return { added, modified, unchanged }
}Incremental Index Build
async function buildIndex() {
const cache = await CodexCacheDB.create()
const allFiles = collectMarkdownFiles()
// Compute diff
const { added, modified, unchanged } = await cache.getDiff(allFiles)
// Load cached analyses (instant)
const index = []
for (const file of unchanged) {
const cached = await cache.getCachedAnalysis(file)
if (cached) index.push(cached)
}
// Process only changed files
for (const file of [...added, ...modified]) {
const analysis = await analyzeFile(file) // TF-IDF, n-grams, etc.
index.push(analysis)
await cache.saveFileAnalysis(file, content, analysis)
}
// Save index
fs.writeFileSync('codex-index.json', JSON.stringify(index, null, 2))
await cache.close()
}Why better-sqlite3?
We use better-sqlite3 instead of Postgres or a cloud database because:
- Zero latency: No network calls, file is local
- Transactional: ACID guarantees, WAL mode
- Portable: Single .db file, easy to cache/restore
- Fast: 1M+ reads/sec on modern hardware
- Free: No hosting costs
Browser Caching (IndexedDB)
The same caching strategy works in the browser using IndexedDB:
import { createDatabase } from '@framers/sql-storage-adapter'
// Automatically uses IndexedDB in browser
const db = await createDatabase({
priority: ['indexeddb', 'sqljs']
})
// Cache fetched index
await db.run(
'INSERT OR REPLACE INTO local_index (etag, data) VALUES (?, ?)',
[response.headers.get('etag'), JSON.stringify(index)]
)
// On next visit, check etag
const cached = await db.get('SELECT data FROM local_index WHERE etag = ?', [etag])
if (cached) {
// Use cached data, no network request
return JSON.parse(cached.data)
}Cost Analysis
Let's compare approaches for indexing 10,000 markdown files:
| Approach | Time | Cost | Notes |
|---|---|---|---|
| Naive (no cache) | 50 min | $0 | Too slow for CI |
| GPT-4 analysis | 15 min | $200 | $0.02/file |
| Embedding API | 5 min | $5 | $0.0005/file |
| SQL-cached NLP | 30s | $0 | First run |
| SQL-cached NLP | 3s | $0 | Subsequent (95% cache hit) |
Scaling to Millions
The architecture scales linearly. With proper indexing and loom-scoped caching:
- 100 files: 30s first run, 2-5s subsequent
- 1,000 files: 5 min first run, 10-20s subsequent
- 10,000 files: 50 min first run, 1-2 min subsequent
- 100,000 files: 8 hours first run, 5-10 min subsequent
The first run is expensive, but you only do it once. After that, PRs complete in seconds.
Try It Yourself
The entire caching system is open source:
Lessons Learned
- SHA hashing is fast: 1000 files in ~100ms
- SQLite is underrated: Faster than Postgres for local workloads
- Static NLP works: TF-IDF gets you 80% of the way without LLMs
- GitHub Actions cache is reliable: 95%+ cache restore rate
- Incremental is always better: Even 50% cache hit = 2x speedup
What's Next?
We're exploring:
- Distributed caching: Share cache across team members
- Embedding vectors: Pre-compute for semantic search
- Graph materialization: Cache relationship edges
- Real-time updates: WebSocket push when index changes
Related Posts
Introducing Frame: The OS for Your Life
Today we're thrilled to announce Frame—a revolutionary suite of operating systems designed to organize, simplify, and enhance every aspect of your digital existence.
Read more →AgentOS is Now Live
Our production-ready runtime for AI agents is now available. Deploy, manage, and orchestrate AI agents at scale with TypeScript-native tooling.
Read more →