Query Clustering: All-to-All Similarity with Batched Processing
This article explains how we cluster search queries into semantic groups using all-to-all cosine similarity comparison with batched processing for memory efficiency.
The Problem: Grouping Similar Queries
Users search for the same thing in many ways:
-
"mini pc"
-
"small computer"
-
"compact desktop"
-
"tiny pc"
These queries should lead to the same page. We need to cluster semantically similar queries together so we can:
-
Create one query page per cluster
-
Consolidate traffic across similar searches
-
Avoid duplicate content
-
Improve SEO by focusing on high-traffic clusters
With large query volumes, we need an efficient clustering algorithm that finds the best matches without exhausting memory.
The Algorithm: All-to-All Batched Comparison
We use an all-to-all similarity comparison where every query is compared to every other query to find its best match. To avoid memory exhaustion with large numbers of queries, we batch the process.
Step 1: Load Embeddings
We load pre-computed embeddings into memory. Each query is represented as a fixed-dimensional vector from a sentence transformer model.
Step 2: Batch Processing
We process queries in batches to avoid memory overflow. Each batch computes cosine similarity scores between its queries and all available queries. By iterating through batches, every query gets compared to all others without loading all similarities into memory simultaneously.
Step 3: Find Best Match
For each query in the batch, we identify its best match across all queries. The query with the highest similarity score becomes the cluster center. If similarity is below the configured threshold, the query remains unclustered for now.
Step 4: Singleton Clusters
After all batches complete, any query that wasn't assigned to a cluster becomes a singleton cluster (a cluster containing only itself). This ensures every query belongs to exactly one cluster.
Configurable Threshold
The similarity threshold determines how strict clustering is:
-
Higher threshold: Only queries with cosine similarity above this value join the same cluster; results in more clusters with tighter semantic grouping
-
Lower threshold: Queries with lower similarity scores get grouped together; results in fewer, larger clusters with broader semantic coverage
The threshold is configured in the application config file and can be adjusted without changing code.
Cluster Ranking
Clusters are ranked by total traffic score from Google Search Console impressions and clicks. For each cluster, the sum of all member queries' traffic scores determines its ranking. High-traffic clusters get priority for query page generation.
Output Format
The clustering produces a JSON file with statistics and cluster data:
{
"stats": {
"threshold": "configured_value",
Clusters are sorted by total score (highest traffic first).
Memory Efficiency
The batched approach keeps memory usage manageable:
Embeddings representation:
embeddings = [query_1_vector, query_2_vector, ..., query_N_vector]
Each vector = [dimension_1, dimension_2, ..., dimension_D]
Batch similarity calculation:
batch_similarities = cosine_similarity(batch_vectors, all_embeddings)
Result shape = (batch_size, num_queries)
Memory comparison:
Batch memory = batch_size × num_queries × bytes_per_value
Full matrix = num_queries × num_queries × bytes_per_value
Batching processes one dimension in chunks, reducing memory requirements.
Blacklist Filtering
Before clustering, we filter out blacklisted queries (brand names, competitors, irrelevant terms). This reduces noise and improves cluster quality.
The blacklist is maintained separately and checked during query loading.
Incremental Clustering
When new queries arrive, we don't re-cluster everything:
- Load existing clusters
- Embed new queries
- Compare new queries to existing cluster centers
- Assign to best matching cluster or create new cluster
- Re-rank clusters by updated scores
This incremental approach processes only new data, saving computation time.
Integration with SEO Pipeline
Query clustering is Step 5 in the SEO pipeline:
- Step 0: Embed Source Data - Products, parts, articles
- Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
- Step 2: Combine Queries - Merge all sources
- Step 3a: Generate Base Phrase Mappings - Initial filters
- Step 3b: Embed Queries - Convert to vectors
- Step 4: Expand Phrase Mappings - Find similar phrases
- Step 5: Cluster Queries ← You are here
- Step 6: Match Products - Query-product matching
- Step 7: Build Query Pages - Generate HTML
- Step 8: Generate Related Searches - Find related queries
- Step 11: Migrate to Valkey - Load into search service
See SEO Pipeline Overview for the complete flow.
Why All-to-All Instead of K-Means?
We previously used k-means clustering with pre-selected cluster centers. The all-to-all approach has advantages:
-
Better matches: Every query finds its true best match, not just the nearest of a few pre-selected centers
-
No pre-selection bias: Cluster centers emerge naturally from the data
-
Adaptive cluster count: Number of clusters adjusts to data distribution
-
Higher quality: Queries cluster with their most similar match, not a compromise
The trade-off is increased computation time, but the quality improvement justifies it.
Performance Characteristics
The clustering process has these characteristics:
-
Processing time: Scales with query volume and similarity threshold; larger batches reduce iterations
-
Memory usage: Embeddings matrix + active batch + temporary working space; configurable via batch size
-
Disk I/O: Single sequential read of full embeddings array at start; minimal I/O during processing
-
CPU usage: Dominated by cosine similarity computations across all batch-to-all comparisons
The process is CPU-bound. Using NumPy with BLAS acceleration speeds up matrix operations significantly.
References
Technical Concepts
-
Cosine Similarity - Wikipedia
-
K-means Clustering - Wikipedia
-
NumPy - Official documentation
-
BLAS - Wikipedia
Model Documentation
-
all-mpnet-base-v2 - Hugging Face
-
Sentence Transformers - Official docs
Related Articles
-
Embedding Strategy - How we generate embeddings
-
SEO Pipeline Overview - Complete pipeline architecture
-
Product Matching Algorithm - Semantic matching
-
Related Search Generation - Finding related queries
-
Phrase Mapping Expansion - Using embeddings for filters
Summary
We cluster queries using all-to-all cosine similarity comparison with batched processing:
-
Batched computation: Similarities computed in configurable batch sizes
-
Best match selection: Each query joins the cluster of its most similar match
-
Configurable threshold: Adjustable similarity threshold without code changes
-
Memory efficient: Batching prevents memory overflow for large datasets
-
Traffic-ranked output: Clusters sorted by total impression/click score
-
Singleton handling: Unclustered queries become single-member clusters
This approach produces higher-quality clusters than k-means by finding true best matches instead of nearest pre-selected centers. The result is better query pages with more relevant traffic consolidation.