Back to Blog

SQL-Cached Static NLP: Indexing 10,000+ Docs in Seconds for $0

November 19, 202412 min readBy Engineering Team

Frame Codex indexes thousands of markdown files on every commit. We needed sub-5-second indexing without spending money on LLM APIs. Here's how we built a SQL-cached static NLP pipeline that runs in GitHub Actions for free.

Performance

Real numbers from production

30s
First run (100 files)
2-5s
Subsequent (5 changed)
85-95%
Cache hit rate
$0
API costs

The Problem

Every time someone opens a PR to Frame Codex, we need to:

  1. Extract keywords from every file (TF-IDF)
  2. Detect common phrases (n-grams)
  3. Auto-categorize content (vocabulary matching)
  4. Validate schema compliance
  5. Build searchable index

Doing this for 100 files takes ~30 seconds. For 1,000 files? 5 minutes. For 10,000? 50 minutes. GitHub Actions has a 6-hour timeout, but nobody wants to wait an hour for CI.

Solution: SHA-Based Diffing

The key insight: most files don't change between commits. A typical PR modifies 1-10 files out of thousands.

So we cache the analysis results and only re-process files that changed:

Step 1: Calculate SHA-256

const crypto = require('crypto')

function calculateSHA(content) {
  return crypto
    .createHash('sha256')
    .update(content)
    .digest('hex')
}

Step 2: Store in SQLite

CREATE TABLE files (
  path TEXT PRIMARY KEY,
  sha TEXT NOT NULL,
  mtime INTEGER NOT NULL,
  analysis TEXT NOT NULL,  -- JSON blob
  indexed_at INTEGER NOT NULL
);

-- On first run
INSERT INTO files (path, sha, mtime, analysis, indexed_at)
VALUES ('intro.md', 'abc123...', 1699999999, '{"keywords":[...]}', 1699999999);

Step 3: Compute Diff

async function getDiff(currentFiles) {
  const cached = await db.all('SELECT path, sha FROM files')
  const cachedMap = new Map(cached.map(f => [f.path, f.sha]))
  
  const added = []
  const modified = []
  const unchanged = []
  
  for (const file of currentFiles) {
    const currentSHA = calculateSHA(readFile(file))
    const cachedSHA = cachedMap.get(file)
    
    if (!cachedSHA) {
      added.push(file)  // New file
    } else if (cachedSHA !== currentSHA) {
      modified.push(file)  // Changed
    } else {
      unchanged.push(file)  // Use cache
    }
  }
  
  return { added, modified, unchanged }
}

Step 4: Merge Results

// Load cached analyses
for (const file of unchanged) {
  const cached = await db.get(
    'SELECT analysis FROM files WHERE path = ?',
    [file]
  )
  index.push(JSON.parse(cached.analysis))
}

// Process only changed files
for (const file of [...added, ...modified]) {
  const analysis = analyzeFile(file)  // TF-IDF, n-grams, etc.
  index.push(analysis)
  await db.run(
    'INSERT OR REPLACE INTO files (path, sha, analysis, ...) VALUES (?, ?, ?, ...)',
    [file, calculateSHA(file), JSON.stringify(analysis), ...]
  )
}

GitHub Actions Cache

The .cache/codex.db file persists across workflow runs using GitHub's cache action:

- name: Restore SQL cache
  uses: actions/cache@v4
  with:
    path: .cache/codex.db
    key: codex-cache-${{ hashFiles('weaves/**/*.md') }}
    restore-keys: |
      codex-cache-

The cache key includes a hash of all markdown files. If any file changes, the cache is invalidated and rebuilt. But within a single PR (multiple commits), the cache persists.

Static NLP Pipeline

We use classical NLP techniques that don't require API calls:

TF-IDF (Keyword Extraction)

// Term Frequency
tf = count(term, document) / totalWords(document)

// Inverse Document Frequency
idf = log(totalDocuments / documentsContaining(term))

// TF-IDF Score
tfidf = tf * idf

// Extract top 10 keywords per document
keywords = allTerms
  .map(term => ({ term, score: tfidf(term) }))
  .sort((a, b) => b.score - a.score)
  .slice(0, 10)

N-Gram Extraction

// Find common 2-3 word phrases
function extractNGrams(text, n = 2) {
  const words = text.toLowerCase().split(/\s+/)
  const ngrams = []
  
  for (let i = 0; i <= words.length - n; i++) {
    ngrams.push(words.slice(i, i + n).join(' '))
  }
  
  // Count frequencies
  const counts = {}
  ngrams.forEach(ng => counts[ng] = (counts[ng] || 0) + 1)
  
  // Return top phrases
  return Object.entries(counts)
    .sort((a, b) => b[1] - a[1])
    .slice(0, 20)
}

Vocabulary Matching

// Controlled vocabulary in tags/index.yaml
const vocabulary = {
  subjects: ['technology', 'science', 'arts', ...],
  topics: ['algorithms', 'machine-learning', ...]
}

// Match keywords to vocabulary
function categorize(keywords) {
  const matches = {}
  
  for (const subject of vocabulary.subjects) {
    const score = keywords.filter(k => 
      k.includes(subject) || subject.includes(k)
    ).length
    
    if (score > 0) matches[subject] = score
  }
  
  return Object.entries(matches)
    .sort((a, b) => b[1] - a[1])
    .map(([subject]) => subject)
}

Performance Breakdown

Here's where the time goes in a typical 100-file index:

File I/O:
6s (20%)
TF-IDF:
15s (50%)
N-grams:
6s (20%)
Validation:
3s (10%)

With SQL caching, we skip 85-95% of this work on subsequent runs.

Implementation Details

Cache Database Schema

-- File metadata and analysis cache
CREATE TABLE files (
  path TEXT PRIMARY KEY,
  sha TEXT NOT NULL,           -- SHA-256 of content
  mtime INTEGER NOT NULL,       -- Last modified timestamp
  size INTEGER NOT NULL,        -- File size in bytes
  analysis TEXT NOT NULL,       -- JSON analysis result
  indexed_at INTEGER NOT NULL   -- When indexed
);

-- Keyword cache (for TF-IDF optimization)
CREATE TABLE keywords (
  file_path TEXT NOT NULL,
  keyword TEXT NOT NULL,
  tfidf_score REAL NOT NULL,
  frequency INTEGER NOT NULL,
  PRIMARY KEY (file_path, keyword),
  FOREIGN KEY (file_path) REFERENCES files(path) ON DELETE CASCADE
);

-- Loom/Weave aggregate statistics
CREATE TABLE stats (
  scope TEXT PRIMARY KEY,       -- Loom or weave path
  scope_type TEXT NOT NULL,     -- 'loom' or 'weave'
  total_files INTEGER NOT NULL,
  total_keywords INTEGER NOT NULL,
  avg_difficulty TEXT,
  subjects TEXT,                -- JSON array
  topics TEXT,                  -- JSON array
  last_updated INTEGER NOT NULL
);

-- Indexes for performance
CREATE INDEX idx_files_sha ON files(sha);
CREATE INDEX idx_keywords_score ON keywords(tfidf_score DESC);

Diff Algorithm

async function computeDiff(currentFiles) {
  const startTime = Date.now()
  
  // Load cached SHAs (fast: ~50ms for 1000 files)
  const cached = await db.all('SELECT path, sha FROM files')
  const cachedMap = new Map(cached.map(f => [f.path, f.sha]))
  
  const added = []
  const modified = []
  const unchanged = []
  
  // Compare SHAs (fast: ~100ms for 1000 files)
  for (const file of currentFiles) {
    const content = fs.readFileSync(file, 'utf8')
    const currentSHA = calculateSHA(content)
    const cachedSHA = cachedMap.get(file)
    
    if (!cachedSHA) {
      added.push(file)
    } else if (cachedSHA !== currentSHA) {
      modified.push(file)
    } else {
      unchanged.push(file)
    }
  }
  
  console.log(`Diff computed in ${Date.now() - startTime}ms`)
  console.log(`  Added: ${added.length}`)
  console.log(`  Modified: ${modified.length}`)
  console.log(`  Unchanged: ${unchanged.length} (using cache)`)
  
  return { added, modified, unchanged }
}

Incremental Index Build

async function buildIndex() {
  const cache = await CodexCacheDB.create()
  const allFiles = collectMarkdownFiles()
  
  // Compute diff
  const { added, modified, unchanged } = await cache.getDiff(allFiles)
  
  // Load cached analyses (instant)
  const index = []
  for (const file of unchanged) {
    const cached = await cache.getCachedAnalysis(file)
    if (cached) index.push(cached)
  }
  
  // Process only changed files
  for (const file of [...added, ...modified]) {
    const analysis = await analyzeFile(file)  // TF-IDF, n-grams, etc.
    index.push(analysis)
    await cache.saveFileAnalysis(file, content, analysis)
  }
  
  // Save index
  fs.writeFileSync('codex-index.json', JSON.stringify(index, null, 2))
  
  await cache.close()
}

Why better-sqlite3?

We use better-sqlite3 instead of Postgres or a cloud database because:

  • Zero latency: No network calls, file is local
  • Transactional: ACID guarantees, WAL mode
  • Portable: Single .db file, easy to cache/restore
  • Fast: 1M+ reads/sec on modern hardware
  • Free: No hosting costs

Browser Caching (IndexedDB)

The same caching strategy works in the browser using IndexedDB:

import { createDatabase } from '@framers/sql-storage-adapter'

// Automatically uses IndexedDB in browser
const db = await createDatabase({
  priority: ['indexeddb', 'sqljs']
})

// Cache fetched index
await db.run(
  'INSERT OR REPLACE INTO local_index (etag, data) VALUES (?, ?)',
  [response.headers.get('etag'), JSON.stringify(index)]
)

// On next visit, check etag
const cached = await db.get('SELECT data FROM local_index WHERE etag = ?', [etag])
if (cached) {
  // Use cached data, no network request
  return JSON.parse(cached.data)
}

Cost Analysis

Let's compare approaches for indexing 10,000 markdown files:

ApproachTimeCostNotes
Naive (no cache)50 min$0Too slow for CI
GPT-4 analysis15 min$200$0.02/file
Embedding API5 min$5$0.0005/file
SQL-cached NLP30s$0First run
SQL-cached NLP3s$0Subsequent (95% cache hit)

Scaling to Millions

The architecture scales linearly. With proper indexing and loom-scoped caching:

  • 100 files: 30s first run, 2-5s subsequent
  • 1,000 files: 5 min first run, 10-20s subsequent
  • 10,000 files: 50 min first run, 1-2 min subsequent
  • 100,000 files: 8 hours first run, 5-10 min subsequent

The first run is expensive, but you only do it once. After that, PRs complete in seconds.

Try It Yourself

The entire caching system is open source:

Lessons Learned

  1. SHA hashing is fast: 1000 files in ~100ms
  2. SQLite is underrated: Faster than Postgres for local workloads
  3. Static NLP works: TF-IDF gets you 80% of the way without LLMs
  4. GitHub Actions cache is reliable: 95%+ cache restore rate
  5. Incremental is always better: Even 50% cache hit = 2x speedup

What's Next?

We're exploring:

  • Distributed caching: Share cache across team members
  • Embedding vectors: Pre-compute for semantic search
  • Graph materialization: Cache relationship edges
  • Real-time updates: WebSocket push when index changes