Storage Strategy: Valkey vs JSON vs NumPy
This article explains why we use different storage technologies for different types of data in the SEO pipeline and search service.
The Problem: One Size Doesn't Fit All
Different data has different access patterns:
-
Embeddings (5K x 384 floats): Need fast vector operations (cosine similarity)
-
Phrase mappings (2500+ phrases): Need human editing, version control
-
Query cache (live queries): Need fast key-value lookups, TTL expiration
-
Product data (64K products): Need structured access, compatibility rules
Using the same storage for all would be inefficient.
Three Storage Technologies
1. NumPy Arrays: Vector Operations
Use case: Embeddings (products, queries, phrases)
Why NumPy:
-
Fast vector math: Optimized C/Fortran libraries for matrix operations
-
Memory-mapped files: Load large arrays without copying to RAM
-
Batch operations: Process thousands of vectors in milliseconds
-
Standard format: Compatible with ML libraries (scikit-learn, TensorFlow)
File format: Binary .npy files
Loading:
import numpy as np
# Memory-mapped (doesn't load entire file into RAM)
embeddings = np.load('embeddings.npy', mmap_mode='r')
# Compute cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(query_embedding, embeddings)
Performance: massive similarity searches at speeds suitable for real-time inference.
2. JSON Files: Human-Editable Data
Use case: Configuration, mappings, metadata
Why JSON:
-
Human-readable: Easy to inspect and debug
-
Version control: Git diffs show exactly what changed
-
Manual editing: Can fix errors without code
-
Universal format: Every language can parse JSON
What we store:
-
Phrase-to-filter mappings: phrases → filter rules
-
Product features: products → feature dictionaries
-
Query metadata: queries → clicks, impressions, sources
-
Pipeline configuration: Step parameters, thresholds
File format: Text .json files
Loading:
import json
with open('phrase_mappings.json') as f:
mappings = json.load(f)
# Access data
filters = mappings['mini pc'] # ['form_factor:mini', 'category:pc']
3. Valkey (Redis): Fast Key-Value Cache
Use case: Live search, autocomplete, popular queries
Why Valkey:
-
In-memory: Microsecond latency for lookups
-
RediSearch: Vector similarity search with indexes
-
TTL expiration: Automatic cache invalidation
-
Pub/sub: Real-time updates across servers
-
Persistence: Optional disk snapshots for durability
What we store:
-
Query embeddings cache: Recent queries → embeddings
-
Popular queries: Top 1000 queries by traffic
-
Autocomplete index: Prefix → query suggestions
-
Filter extraction cache: Query → extracted filters
-
Related searches cache: Query → related queries
Data structures:
-
Strings: Simple key-value (query → embedding)
-
Sorted sets: Ranked data (popular queries by score)
-
RediSearch indexes: Vector similarity search
-
Hashes: Structured data (query metadata)
Loading:
from app.shared.valkey_cache import get_valkey
valkey = get_valkey()
# ... (implementation details omitted)
Decision Matrix
When to Use NumPy
Criteria:
-
Data is numeric (floats, ints)
-
Need fast vector operations (dot product, cosine similarity)
-
Data is read-heavy (rarely updated)
-
Data is large (millions of numbers)
-
Batch processing (process many items at once)
Examples:
-
Embeddings (products, queries, phrases)
-
Feature vectors
-
Similarity matrices
When to Use JSON
Criteria:
-
Data is structured (objects, arrays)
-
Need human readability
-
Need version control (Git)
-
Data changes occasionally (manual edits)
-
Data is small-medium (<100 MB)
Examples:
-
Configuration files
-
Phrase mappings
-
Product metadata
-
Pipeline parameters
When to Use Valkey
Criteria:
-
Need fast lookups (microseconds)
-
Data changes frequently (live queries)
-
Need TTL expiration (cache invalidation)
-
Need pub/sub (real-time updates)
-
Need vector search (RediSearch)
Examples:
-
Query cache
-
Autocomplete
-
Popular queries
-
Session data
-
Rate limiting
Hybrid Approach
We combine all three for optimal performance:
Pipeline (Offline Processing)
NumPy: Compute embeddings, similarity matrices
JSON: Store mappings, metadata, configuration
Valkey: Not used (pipeline runs offline)
Search Service (Live Queries)
NumPy: Load embeddings from disk (memory-mapped)
JSON: Load mappings from disk (cached in memory)
Valkey: Cache live queries, autocomplete, popular queries
Data Flow
graph LR
Pipeline[SEO Pipeline
Offline]
subgraph Storage
NP[NumPy Files
embeddings.npy]
JS[JSON Files
mappings.json]
end
Search[Search Service
Live]
VK[Valkey
Cache]
Pipeline --> NP
Pipeline --> JS
NP --> Search
JS --> Search
Search --> VK
VK --> SearchPerformance Comparison
NumPy (memory-mapped):
-
Load time: Instantaneous (zero-copy mapping to virtual memory)
-
Lookup time: Near-zero (direct CPU pointer to array index)
-
Memory: Minimal (OS-managed page cache, not resident in process RAM)
JSON:
-
Load time: High latency (sequential parsing and object instantiation)
-
Lookup time: Efficient (standard hash map overhead)
-
Memory: Substantial (entire serialized structure resides in the heap)
Valkey:
-
Load time: Persistent (resident in background service)
-
Lookup time: Moderate (includes network round-trip and protocol serialization)
-
Memory: Externalized (isolated within the database process)
Winner: NumPy for high-throughput batching, Valkey for distributed single-key access
Vector Similarity Search
NumPy (cosine_similarity):
-
Batch operations: Rapid (optimized via SIMD/vectorized linear algebra)
-
Parallelizable: High (scales across all available physical CPU cores)
-
Memory: Linear (scales directly with embedding dimensions)
Valkey (RediSearch):
-
Search speed: Superior (utilizes specialized HNSW/Vector indexing)
-
Parallelizable: Limited (constrained by the engine's threading model)
-
Memory: Intensive (requires raw embeddings plus indexing metadata)
Winner: Valkey for sub-perceptual live search, NumPy for heavy offline analytical processing
Configuration Lookup
JSON:
-
Load time: Variable (proportional to configuration complexity)
-
Lookup time: Fast (native dictionary access)
-
Edit time: Frictionless (human-readable text modification)
Valkey:
-
Load time: Immediate (active on connection)
-
Lookup time: Network-bound (dependent on request overhead)
-
Edit time: Programmatic (requires CLI or client SET commands)
Winner: JSON for static configurations, Valkey for dynamic shared state
References
Technologies
-
NumPy - Array computing library
-
JSON - Data interchange format
-
Valkey - In-memory data store (Redis fork)
-
RediSearch - Vector search module
Technical Concepts
-
Memory-mapped file - Wikipedia
-
Key-value database - Wikipedia
-
Vector database - Wikipedia
Related Articles
-
SEO Pipeline Overview - Complete pipeline architecture
-
Search Service Architecture - Live search with Valkey
-
Multi-Server Architecture - Storage per server
-
Incremental Processing - Caching strategy
Summary
We use three storage technologies optimized for different use cases:
NumPy (embeddings):
-
Fast vector operations (cosine similarity)
-
Memory-mapped
-
Batch processing
-
Standard ML format
JSON (configuration):
-
Human-readable (easy debugging)
-
Version control (Git diffs)
-
Manual editing (no code needed)
-
Universal format
Valkey (live cache):
-
Fast lookups (microseconds)
-
Vector search (RediSearch)
-
TTL expiration (auto-invalidation)
-
Pub/sub (real-time updates)
Hybrid approach: Use the right tool for each job, combine for optimal performance.