Storage Strategy: Valkey vs JSON vs NumPy

This article explains why we use different storage technologies for different types of data in the SEO pipeline and search service.

The Problem: One Size Doesn't Fit All

Different data has different access patterns:

  • Embeddings (5K x 384 floats): Need fast vector operations (cosine similarity)

  • Phrase mappings (2500+ phrases): Need human editing, version control

  • Query cache (live queries): Need fast key-value lookups, TTL expiration

  • Product data (64K products): Need structured access, compatibility rules

Using the same storage for all would be inefficient.

Three Storage Technologies

1. NumPy Arrays: Vector Operations

Use case: Embeddings (products, queries, phrases)

Why NumPy:

  • Fast vector math: Optimized C/Fortran libraries for matrix operations

  • Memory-mapped files: Load large arrays without copying to RAM

  • Batch operations: Process thousands of vectors in milliseconds

  • Standard format: Compatible with ML libraries (scikit-learn, TensorFlow)

File format: Binary .npy files

Loading:

import numpy as np

# Memory-mapped (doesn't load entire file into RAM)
embeddings = np.load('embeddings.npy', mmap_mode='r')

# Compute cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(query_embedding, embeddings)

Performance: massive similarity searches at speeds suitable for real-time inference.

2. JSON Files: Human-Editable Data

Use case: Configuration, mappings, metadata

Why JSON:

  • Human-readable: Easy to inspect and debug

  • Version control: Git diffs show exactly what changed

  • Manual editing: Can fix errors without code

  • Universal format: Every language can parse JSON

What we store:

  • Phrase-to-filter mappings: phrases → filter rules

  • Product features: products → feature dictionaries

  • Query metadata: queries → clicks, impressions, sources

  • Pipeline configuration: Step parameters, thresholds

File format: Text .json files

Loading:

import json

with open('phrase_mappings.json') as f:
    mappings = json.load(f)

# Access data
filters = mappings['mini pc']  # ['form_factor:mini', 'category:pc']

3. Valkey (Redis): Fast Key-Value Cache

Use case: Live search, autocomplete, popular queries

Why Valkey:

  • In-memory: Microsecond latency for lookups

  • RediSearch: Vector similarity search with indexes

  • TTL expiration: Automatic cache invalidation

  • Pub/sub: Real-time updates across servers

  • Persistence: Optional disk snapshots for durability

What we store:

  • Query embeddings cache: Recent queries → embeddings

  • Popular queries: Top 1000 queries by traffic

  • Autocomplete index: Prefix → query suggestions

  • Filter extraction cache: Query → extracted filters

  • Related searches cache: Query → related queries

Data structures:

  • Strings: Simple key-value (query → embedding)

  • Sorted sets: Ranked data (popular queries by score)

  • RediSearch indexes: Vector similarity search

  • Hashes: Structured data (query metadata)

Loading:

from app.shared.valkey_cache import get_valkey

valkey = get_valkey()
# ... (implementation details omitted)

Decision Matrix

When to Use NumPy

Criteria:

  • Data is numeric (floats, ints)

  • Need fast vector operations (dot product, cosine similarity)

  • Data is read-heavy (rarely updated)

  • Data is large (millions of numbers)

  • Batch processing (process many items at once)

Examples:

  • Embeddings (products, queries, phrases)

  • Feature vectors

  • Similarity matrices

When to Use JSON

Criteria:

  • Data is structured (objects, arrays)

  • Need human readability

  • Need version control (Git)

  • Data changes occasionally (manual edits)

  • Data is small-medium (<100 MB)

Examples:

  • Configuration files

  • Phrase mappings

  • Product metadata

  • Pipeline parameters

When to Use Valkey

Criteria:

  • Need fast lookups (microseconds)

  • Data changes frequently (live queries)

  • Need TTL expiration (cache invalidation)

  • Need pub/sub (real-time updates)

  • Need vector search (RediSearch)

Examples:

  • Query cache

  • Autocomplete

  • Popular queries

  • Session data

  • Rate limiting

Hybrid Approach

We combine all three for optimal performance:

Pipeline (Offline Processing)

NumPy: Compute embeddings, similarity matrices

JSON: Store mappings, metadata, configuration

Valkey: Not used (pipeline runs offline)

Search Service (Live Queries)

NumPy: Load embeddings from disk (memory-mapped)

JSON: Load mappings from disk (cached in memory)

Valkey: Cache live queries, autocomplete, popular queries

Data Flow

graph LR
    Pipeline[SEO Pipeline
Offline] subgraph Storage NP[NumPy Files
embeddings.npy] JS[JSON Files
mappings.json] end Search[Search Service
Live] VK[Valkey
Cache] Pipeline --> NP Pipeline --> JS NP --> Search JS --> Search Search --> VK VK --> Search

Performance Comparison

NumPy (memory-mapped):

  • Load time: Instantaneous (zero-copy mapping to virtual memory)

  • Lookup time: Near-zero (direct CPU pointer to array index)

  • Memory: Minimal (OS-managed page cache, not resident in process RAM)

JSON:

  • Load time: High latency (sequential parsing and object instantiation)

  • Lookup time: Efficient (standard hash map overhead)

  • Memory: Substantial (entire serialized structure resides in the heap)

Valkey:

  • Load time: Persistent (resident in background service)

  • Lookup time: Moderate (includes network round-trip and protocol serialization)

  • Memory: Externalized (isolated within the database process)

Winner: NumPy for high-throughput batching, Valkey for distributed single-key access

Vector Similarity Search

NumPy (cosine_similarity):

  • Batch operations: Rapid (optimized via SIMD/vectorized linear algebra)

  • Parallelizable: High (scales across all available physical CPU cores)

  • Memory: Linear (scales directly with embedding dimensions)

Valkey (RediSearch):

  • Search speed: Superior (utilizes specialized HNSW/Vector indexing)

  • Parallelizable: Limited (constrained by the engine's threading model)

  • Memory: Intensive (requires raw embeddings plus indexing metadata)

Winner: Valkey for sub-perceptual live search, NumPy for heavy offline analytical processing

Configuration Lookup

JSON:

  • Load time: Variable (proportional to configuration complexity)

  • Lookup time: Fast (native dictionary access)

  • Edit time: Frictionless (human-readable text modification)

Valkey:

  • Load time: Immediate (active on connection)

  • Lookup time: Network-bound (dependent on request overhead)

  • Edit time: Programmatic (requires CLI or client SET commands)

Winner: JSON for static configurations, Valkey for dynamic shared state

References

Technologies

  • NumPy - Array computing library

  • JSON - Data interchange format

  • Valkey - In-memory data store (Redis fork)

  • RediSearch - Vector search module

Technical Concepts

Related Articles

Summary

We use three storage technologies optimized for different use cases:

NumPy (embeddings):

  • Fast vector operations (cosine similarity)

  • Memory-mapped

  • Batch processing

  • Standard ML format

JSON (configuration):

  • Human-readable (easy debugging)

  • Version control (Git diffs)

  • Manual editing (no code needed)

  • Universal format

Valkey (live cache):

  • Fast lookups (microseconds)

  • Vector search (RediSearch)

  • TTL expiration (auto-invalidation)

  • Pub/sub (real-time updates)

Hybrid approach: Use the right tool for each job, combine for optimal performance.


← Back to Documentation Index