Related Search Generation: Three-Tier Semantic Matching
This article explains how we generate related search links for every page on the site using a three-tier strategy that balances semantic relevance, traffic distribution, and query diversity.
The Problem: Relevant Navigation Links
Every page needs related search links to help users discover more content. But which queries should we show?
The challenge is balancing multiple goals:
-
Relevance: Links should be semantically related to the current page
-
Traffic distribution: High-traffic queries should appear on high-traffic pages
-
Query diversity: Each query should appear on multiple pages (10+ times)
-
URL uniqueness: Don't link to the same destination twice
-
Coverage: Every page should have 8 quality links
A naive approach (pure semantic similarity) might show the same queries on every page. A better approach uses a three-tier strategy with traffic-aware processing.
Three-Tier Strategy
We generate related searches using three tiers of decreasing relevance:
Tier 1: Related Searches (High Similarity)
High semantic similarity queries that are directly related to the page content:
-
Example page: "Treo N100 Mini PC"
-
Example queries: "n100 mini pc", "mini pc 8gb", "compact desktop n100"
These are the most relevant matches.
Tier 2: Popular Searches (Medium Similarity)
Medium similarity queries that are topically related but less specific:
-
Example page: "Treo N100 Mini PC"
-
Example queries: "mini pc", "small computer", "fanless pc"
These are broader queries in the same category.
Tier 3: Trending Searches (Fallback)
Global power queries (highest traffic) used when semantic matches are exhausted:
-
Example page: Any page
-
Example queries: "mini pc", "thin client", "industrial pc", "all in one pc"
These are the most popular queries site-wide.
The Algorithm: Traffic-First Processing
Step 1: Collect All Pages
We collect every page on the site:
-
Products (/p/): From Step 0 source data
-
Parts (/i/): From Step 0 source data
-
Articles (/a/): From Step 0 source data
-
Families (/f/): From Step 0 source data
-
Categories (/c/): From Step 0 source data
-
Query pages (/q/): From routing data (Step 7)
-
Static pages: Home, about, contact, etc.
This typically yields ~64,000 pages.
Step 2: Load Embeddings
We load pre-computed embeddings for:
- Pages: Reuse from Step 0 (products, parts, articles) and Step 6 (query pages)
- Queries: From Step 3b (all 65K queries)
Reusing embeddings avoids re-embedding unchanged data:
-
Step 0 reuse: ~5,000 product/part/article embeddings
-
Step 6 reuse: ~12,500 query page embeddings
-
New embeddings: ~46,500 remaining pages
Step 3: Load Page Traffic
We query Athena for page views over the last 90 days:
traffic_data = generate_all_pages_traffic_index(lookback_days=90)
This returns a dictionary mapping URLs to view counts:
{
"/p/Treo-N100-8-256-2H-W6-11P": 5000,
"/q/mini-pc": 3000,
"/": 50000
}
Step 4: Sort Pages by Traffic
We process pages in descending traffic order:
sorted_pages = sorted(all_pages, key=lambda p: page_traffic[p['url']], reverse=True)
This ensures high-traffic pages get first pick of queries.
Step 5: Compute Similarity (Batched)
We process pages in batches of 1,000 for memory efficiency:
batch_page_embeddings = all_page_embeddings[batch_indices]
batch_similarities = util.cos_sim(batch_page_embeddings, query_embeddings)
# ... (implementation details omitted)
This produces a similarity matrix for the batch.
Step 6: Select Queries (Three-Tier)
For each page, we select queries using the three-tier strategy:
Tier 1: High Similarity
for idx, similarity in enumerate(sorted_similarities):
if similarity < threshold_high:
break
# ... (implementation details omitted)
Tier 2: Medium Similarity (0.60-0.85)
if len(selected_queries) < 8:
for idx, similarity in enumerate(sorted_similarities):
if similarity < 0.60:
break
if try_add_query(queries[idx]):
continue
Tier 3: Power Queries (Fallback)
if len(selected_queries) < 8:
for power_query in power_query_list:
if try_add_query(power_query):
continue
Step 7: Enforce Constraints
The try_add_query() function enforces multiple constraints:
Query Usage Cap (10 appearances):
if query_usage[query] >= MAX_APPEARANCES:
return False
URL Uniqueness (no duplicate destinations):
url = get_dest_url(query)
if url in selected_urls:
return False
selected_urls.add(url)
Self-Link Prevention:
selected_urls = {page_url} # Initialize with current page
Limit to 8 Links:
if len(selected_queries) >= 8:
return False
Step 8: Convert Queries to URLs
We use the routing data from Step 7 to convert queries to destination URLs:
query_to_url_map = {}
for route in routing_data['routes']:
for query in route['queries']:
query_to_url_map[query] = route['destination']
This creates a fast O(1) lookup for query destinations.
Step 9: Language Propagation (Optional)
If language propagation is enabled, we append the query language to the URL:
lang = detect_query_language(query)
if lang != "en":
url = f"{url}?lang={lang}"
This ensures users stay in their preferred language when clicking related searches.
Step 10: Save to DynamoDB
We save the related searches to DynamoDB using batch writes:
with RelatedSearchWidget.batch_write() as batch:
for page in batch_pages:
widget = RelatedSearchWidget(
page_url=page['url'],
links=[...],
tier_label="Related Searches"
)
batch.save(widget)
Batch writes provide significant performance improvement over individual writes.
Query Usage Tracking
We track how many times each query has been used:
query_usage = {query: 0 for query in all_queries}
MAX_APPEARANCES = 10
As queries are selected, we increment their usage:
query_usage[query] += 1
Once a query reaches 10 appearances, it's excluded from future selections:
available_mask[query_idx] = False
This ensures query diversity across the site.
Traffic-First Processing Benefits
Processing pages by traffic (highest first) has advantages:
High-traffic pages get best matches:
-
Home page (50K views) → Top 8 most relevant queries
-
Popular product (5K views) → Next 8 most relevant queries
-
Niche product (100 views) → Remaining relevant queries
Automatic fallback:
-
If high-similarity queries are exhausted, medium-similarity queries are used
-
If medium-similarity queries are exhausted, power queries are used
Traffic distribution:
-
High-traffic queries appear on high-traffic pages
-
Low-traffic queries appear on niche pages
-
Maximizes overall click-through potential
Static Page Handling
Static pages (home, about, contact) skip Tier 1 and Tier 2, going straight to Tier 3:
if page_type == 'static':
for power_query in power_query_list:
try_add_query(power_query)
tier_label = "Trending Searches"
This ensures static pages show the most popular queries site-wide.
Embedding Reuse Strategy
We reuse embeddings from previous pipeline steps:
Step 0 (Source Data):
-
Products:
/p/Treo-N100-8-256-2H-W6-11P -
Parts:
/i/N100 -
Articles:
/a/mini-pc-guide
Step 6 (Cluster Queries):
- Query pages:
/q/mini-pc(slug → query text)
New Embeddings:
- Remaining pages not in Step 0 or Step 6
This reduces embedding time from ~2 hours to ~20 minutes.
Incremental Embedding
For pages that need new embeddings, we use incremental caching:
new_embeddings = incremental_embed_with_keys(
items=page_contents,
keys=page_urls,
cache_embeddings_path=SEO_PAGE_EMBEDDINGS_PATH,
cache_keys_path=SEO_PAGE_URLS_PATH,
model_name='all-mpnet-base-v2'
)
This caches embeddings for future runs. See Embedding Strategy for details.
Output Format
The related searches are stored in DynamoDB:
{
"page_url": "/p/Treo-N100-8-256-2H-W6-11P",
"tier_label": "Related Searches",
# ... (implementation details omitted)
The web application queries this table to display related searches on each page.
Performance Characteristics
On a typical server:
-
Processing time: Varies based on page count
-
Memory usage: Depends on embedding size and batch size
-
DynamoDB writes: Batch writes for all pages
-
Embedding reuse: High percentage from incremental cache
The process is CPU-bound during similarity computation. Using NumPy with BLAS acceleration speeds up matrix operations significantly.
Integration with SEO Pipeline
Related search generation is Step 8 in the SEO pipeline:
- Step 0: Embed Source Data - Products, parts, articles
- Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
- Step 2: Combine Queries - Merge all sources
- Step 3a: Generate Base Phrase Mappings - Initial filters
- Step 3b: Embed Queries - Convert to vectors
- Step 4: Expand Phrase Mappings - Find similar phrases
- Step 5: Cluster Queries - Group into pages
- Step 6: Match Products - Query-product matching
- Step 7: Build Query Pages - Generate HTML
- Step 8: Generate Related Searches ← You are here
- Step 11: Migrate to Valkey - Load into search service
See SEO Pipeline Overview for the complete flow.
Why Three Tiers?
Tier 1 (High Similarity) provides:
-
✅ Maximum relevance
-
✅ Best user experience
-
❌ Limited coverage (not all pages have high-similarity matches)
Tier 2 (Medium Similarity) provides:
-
✅ Good relevance
-
✅ Broader coverage
-
❌ Less specific matches
Tier 3 (Power Queries) provides:
-
✅ Universal coverage (every page gets 8 links)
-
✅ High click-through (popular queries)
-
❌ Lower relevance
Together, they ensure every page has quality links with the best possible relevance.
Statistics and Monitoring
After generation, we calculate statistics:
-
Total pages: All pages in the system
-
Total outlinks: Pages × links per page
-
Queries used: Subset of available queries
-
Query distribution: Most queries appear on multiple pages
-
Unused queries: Low traffic or low similarity queries
We also log category and family page cross-linking for analysis.
References
Technical Concepts
-
Cosine Similarity - Wikipedia
-
NumPy - Official documentation
-
BLAS - Wikipedia
-
DynamoDB - AWS documentation
-
Athena - AWS documentation
Model Documentation
-
all-mpnet-base-v2 - Hugging Face
-
Sentence Transformers - Official docs
Related Articles
-
Embedding Strategy - How we generate embeddings
-
Query Clustering - Grouping similar queries
-
Product Matching - Semantic matching
-
SEO Pipeline Overview - Complete pipeline architecture
-
Embed Source Data - Embedding products, parts, articles
Summary
We generate related searches for all pages using a three-tier strategy:
Three Tiers:
-
Tier 1: High similarity - "Related Searches"
-
Tier 2: Medium similarity - "Popular Searches"
-
Tier 3: Power queries (fallback) - "Trending Searches"
Traffic-First Processing:
-
Sort pages by traffic (highest first)
-
High-traffic pages get first pick of queries
-
Automatic fallback to lower tiers
Constraints:
-
Minimum appearances per query (diversity)
-
Maximum links per page (limit)
-
No duplicate destinations (uniqueness)
-
No self-links (prevention)
Embedding Reuse:
-
High reuse from previous steps
-
Incremental cache for new pages
The result is comprehensive related search coverage across the entire site with optimal relevance, traffic distribution, and query diversity.