Query Clustering: All-to-All Similarity with Batched Processing

This article explains how we cluster search queries into semantic groups using all-to-all cosine similarity comparison with batched processing for memory efficiency.

The Problem: Grouping Similar Queries

Users search for the same thing in many ways:

"mini pc"
"small computer"
"compact desktop"
"tiny pc"

These queries should lead to the same page. We need to cluster semantically similar queries together so we can:

Create one query page per cluster
Consolidate traffic across similar searches
Avoid duplicate content
Improve SEO by focusing on high-traffic clusters

With large query volumes, we need an efficient clustering algorithm that finds the best matches without exhausting memory.

The Algorithm: All-to-All Batched Comparison

We use an all-to-all similarity comparison where every query is compared to every other query to find its best match. To avoid memory exhaustion with large numbers of queries, we batch the process.

Step 1: Load Embeddings

We load pre-computed embeddings into memory. Each query is represented as a fixed-dimensional vector from a sentence transformer model.

Step 2: Batch Processing

We process queries in batches to avoid memory overflow. Each batch computes cosine similarity scores between its queries and all available queries. By iterating through batches, every query gets compared to all others without loading all similarities into memory simultaneously.

Step 3: Find Best Match

For each query in the batch, we identify its best match across all queries. The query with the highest similarity score becomes the cluster center. If similarity is below the configured threshold, the query remains unclustered for now.

Step 4: Singleton Clusters

After all batches complete, any query that wasn't assigned to a cluster becomes a singleton cluster (a cluster containing only itself). This ensures every query belongs to exactly one cluster.

Configurable Threshold

The similarity threshold determines how strict clustering is:

Higher threshold: Only queries with cosine similarity above this value join the same cluster; results in more clusters with tighter semantic grouping
Lower threshold: Queries with lower similarity scores get grouped together; results in fewer, larger clusters with broader semantic coverage

The threshold is configured in the application config file and can be adjusted without changing code.

Cluster Ranking

Clusters are ranked by total traffic score from Google Search Console impressions and clicks. For each cluster, the sum of all member queries' traffic scores determines its ranking. High-traffic clusters get priority for query page generation.

Output Format

The clustering produces a JSON file with statistics and cluster data:

{
  "stats": {
    "threshold": "configured_value",

Clusters are sorted by total score (highest traffic first).

Memory Efficiency

The batched approach keeps memory usage manageable:

Embeddings representation:

embeddings = [query_1_vector, query_2_vector, ..., query_N_vector]
Each vector = [dimension_1, dimension_2, ..., dimension_D]

Batch similarity calculation:

batch_similarities = cosine_similarity(batch_vectors, all_embeddings)
Result shape = (batch_size, num_queries)

Memory comparison:

Batch memory = batch_size × num_queries × bytes_per_value
Full matrix = num_queries × num_queries × bytes_per_value

Batching processes one dimension in chunks, reducing memory requirements.

Blacklist Filtering

Before clustering, we filter out blacklisted queries (brand names, competitors, irrelevant terms). This reduces noise and improves cluster quality.

The blacklist is maintained separately and checked during query loading.

Incremental Clustering

When new queries arrive, we don't re-cluster everything:

Load existing clusters
Embed new queries
Compare new queries to existing cluster centers
Assign to best matching cluster or create new cluster
Re-rank clusters by updated scores

This incremental approach processes only new data, saving computation time.

Integration with SEO Pipeline

Query clustering is Step 5 in the SEO pipeline:

Step 0: Embed Source Data - Products, parts, articles
Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
Step 2: Combine Queries - Merge all sources
Step 3a: Generate Base Phrase Mappings - Initial filters
Step 3b: Embed Queries - Convert to vectors
Step 4: Expand Phrase Mappings - Find similar phrases
Step 5: Cluster Queries ← You are here
Step 6: Match Products - Query-product matching
Step 7: Build Query Pages - Generate HTML
Step 8: Generate Related Searches - Find related queries
Step 11: Migrate to Valkey - Load into search service

See SEO Pipeline Overview for the complete flow.

Why All-to-All Instead of K-Means?

We previously used k-means clustering with pre-selected cluster centers. The all-to-all approach has advantages:

Better matches: Every query finds its true best match, not just the nearest of a few pre-selected centers
No pre-selection bias: Cluster centers emerge naturally from the data
Adaptive cluster count: Number of clusters adjusts to data distribution
Higher quality: Queries cluster with their most similar match, not a compromise

The trade-off is increased computation time, but the quality improvement justifies it.

Performance Characteristics

The clustering process has these characteristics:

Processing time: Scales with query volume and similarity threshold; larger batches reduce iterations
Memory usage: Embeddings matrix + active batch + temporary working space; configurable via batch size
Disk I/O: Single sequential read of full embeddings array at start; minimal I/O during processing
CPU usage: Dominated by cosine similarity computations across all batch-to-all comparisons

The process is CPU-bound. Using NumPy with BLAS acceleration speeds up matrix operations significantly.

References

Technical Concepts

Cosine Similarity - Wikipedia
K-means Clustering - Wikipedia
NumPy - Official documentation
BLAS - Wikipedia

Model Documentation

all-mpnet-base-v2 - Hugging Face
Sentence Transformers - Official docs

Embedding Strategy - How we generate embeddings
SEO Pipeline Overview - Complete pipeline architecture
Product Matching Algorithm - Semantic matching
Related Search Generation - Finding related queries
Phrase Mapping Expansion - Using embeddings for filters

Summary

We cluster queries using all-to-all cosine similarity comparison with batched processing:

Batched computation: Similarities computed in configurable batch sizes
Best match selection: Each query joins the cluster of its most similar match
Configurable threshold: Adjustable similarity threshold without code changes
Memory efficient: Batching prevents memory overflow for large datasets
Traffic-ranked output: Clusters sorted by total impression/click score
Singleton handling: Unclustered queries become single-member clusters

This approach produces higher-quality clusters than k-means by finding true best matches instead of nearest pre-selected centers. The result is better query pages with more relevant traffic consolidation.

← Back to Documentation Index

Products

Popular Searches and Blogs