Related Search Generation: Three-Tier Semantic Matching

This article explains how we generate related search links for every page on the site using a three-tier strategy that balances semantic relevance, traffic distribution, and query diversity.

The Problem: Relevant Navigation Links

Every page needs related search links to help users discover more content. But which queries should we show?

The challenge is balancing multiple goals:

Relevance: Links should be semantically related to the current page
Traffic distribution: High-traffic queries should appear on high-traffic pages
Query diversity: Each query should appear on multiple pages (10+ times)
URL uniqueness: Don't link to the same destination twice
Coverage: Every page should have 8 quality links

A naive approach (pure semantic similarity) might show the same queries on every page. A better approach uses a three-tier strategy with traffic-aware processing.

Three-Tier Strategy

We generate related searches using three tiers of decreasing relevance:

Tier 1: Related Searches (High Similarity)

High semantic similarity queries that are directly related to the page content:

Example page: "Treo N100 Mini PC"
Example queries: "n100 mini pc", "mini pc 8gb", "compact desktop n100"

These are the most relevant matches.

Tier 2: Popular Searches (Medium Similarity)

Medium similarity queries that are topically related but less specific:

Example page: "Treo N100 Mini PC"
Example queries: "mini pc", "small computer", "fanless pc"

These are broader queries in the same category.

Tier 3: Trending Searches (Fallback)

Global power queries (highest traffic) used when semantic matches are exhausted:

Example page: Any page
Example queries: "mini pc", "thin client", "industrial pc", "all in one pc"

These are the most popular queries site-wide.

The Algorithm: Traffic-First Processing

Step 1: Collect All Pages

We collect every page on the site:

Products (/p/): From Step 0 source data
Parts (/i/): From Step 0 source data
Articles (/a/): From Step 0 source data
Families (/f/): From Step 0 source data
Categories (/c/): From Step 0 source data
Query pages (/q/): From routing data (Step 7)
Static pages: Home, about, contact, etc.

This typically yields ~64,000 pages.

Step 2: Load Embeddings

We load pre-computed embeddings for:

Pages: Reuse from Step 0 (products, parts, articles) and Step 6 (query pages)
Queries: From Step 3b (all 65K queries)

Reusing embeddings avoids re-embedding unchanged data:

Step 0 reuse: ~5,000 product/part/article embeddings
Step 6 reuse: ~12,500 query page embeddings
New embeddings: ~46,500 remaining pages

Step 3: Load Page Traffic

We query Athena for page views over the last 90 days:

traffic_data = generate_all_pages_traffic_index(lookback_days=90)

This returns a dictionary mapping URLs to view counts:

{
  "/p/Treo-N100-8-256-2H-W6-11P": 5000,
  "/q/mini-pc": 3000,
  "/": 50000
}

Step 4: Sort Pages by Traffic

We process pages in descending traffic order:

sorted_pages = sorted(all_pages, key=lambda p: page_traffic[p['url']], reverse=True)

This ensures high-traffic pages get first pick of queries.

Step 5: Compute Similarity (Batched)

We process pages in batches of 1,000 for memory efficiency:

batch_page_embeddings = all_page_embeddings[batch_indices]
batch_similarities = util.cos_sim(batch_page_embeddings, query_embeddings)
# ... (implementation details omitted)

This produces a similarity matrix for the batch.

Step 6: Select Queries (Three-Tier)

For each page, we select queries using the three-tier strategy:

Tier 1: High Similarity

for idx, similarity in enumerate(sorted_similarities):
    if similarity < threshold_high:
        break
# ... (implementation details omitted)

Tier 2: Medium Similarity (0.60-0.85)

if len(selected_queries) < 8:
    for idx, similarity in enumerate(sorted_similarities):
        if similarity < 0.60:
            break
        if try_add_query(queries[idx]):
            continue

Tier 3: Power Queries (Fallback)

if len(selected_queries) < 8:
    for power_query in power_query_list:
        if try_add_query(power_query):
            continue

Step 7: Enforce Constraints

The try_add_query() function enforces multiple constraints:

Query Usage Cap (10 appearances):

if query_usage[query] >= MAX_APPEARANCES:
    return False

URL Uniqueness (no duplicate destinations):

url = get_dest_url(query)
if url in selected_urls:
    return False
selected_urls.add(url)

Self-Link Prevention:

selected_urls = {page_url}  # Initialize with current page

Limit to 8 Links:

if len(selected_queries) >= 8:
    return False

Step 8: Convert Queries to URLs

We use the routing data from Step 7 to convert queries to destination URLs:

query_to_url_map = {}
for route in routing_data['routes']:
    for query in route['queries']:
        query_to_url_map[query] = route['destination']

This creates a fast O(1) lookup for query destinations.

Step 9: Language Propagation (Optional)

If language propagation is enabled, we append the query language to the URL:

lang = detect_query_language(query)
if lang != "en":
    url = f"{url}?lang={lang}"

This ensures users stay in their preferred language when clicking related searches.

Step 10: Save to DynamoDB

We save the related searches to DynamoDB using batch writes:

with RelatedSearchWidget.batch_write() as batch:
    for page in batch_pages:
        widget = RelatedSearchWidget(
            page_url=page['url'],
            links=[...],
            tier_label="Related Searches"
        )
        batch.save(widget)

Batch writes provide significant performance improvement over individual writes.

Query Usage Tracking

We track how many times each query has been used:

query_usage = {query: 0 for query in all_queries}
MAX_APPEARANCES = 10

As queries are selected, we increment their usage:

query_usage[query] += 1

Once a query reaches 10 appearances, it's excluded from future selections:

available_mask[query_idx] = False

This ensures query diversity across the site.

Traffic-First Processing Benefits

Processing pages by traffic (highest first) has advantages:

High-traffic pages get best matches:

Home page (50K views) → Top 8 most relevant queries
Popular product (5K views) → Next 8 most relevant queries
Niche product (100 views) → Remaining relevant queries

Automatic fallback:

If high-similarity queries are exhausted, medium-similarity queries are used
If medium-similarity queries are exhausted, power queries are used

Traffic distribution:

High-traffic queries appear on high-traffic pages
Low-traffic queries appear on niche pages
Maximizes overall click-through potential

Static Page Handling

Static pages (home, about, contact) skip Tier 1 and Tier 2, going straight to Tier 3:

if page_type == 'static':
    for power_query in power_query_list:
        try_add_query(power_query)
    tier_label = "Trending Searches"

This ensures static pages show the most popular queries site-wide.

Embedding Reuse Strategy

We reuse embeddings from previous pipeline steps:

Step 0 (Source Data):

Products: /p/Treo-N100-8-256-2H-W6-11P
Parts: /i/N100
Articles: /a/mini-pc-guide

Step 6 (Cluster Queries):

Query pages: /q/mini-pc (slug → query text)

New Embeddings:

Remaining pages not in Step 0 or Step 6

This reduces embedding time from ~2 hours to ~20 minutes.

Incremental Embedding

For pages that need new embeddings, we use incremental caching:

new_embeddings = incremental_embed_with_keys(
    items=page_contents,
    keys=page_urls,
    cache_embeddings_path=SEO_PAGE_EMBEDDINGS_PATH,
    cache_keys_path=SEO_PAGE_URLS_PATH,
    model_name='all-mpnet-base-v2'
)

This caches embeddings for future runs. See Embedding Strategy for details.

Output Format

The related searches are stored in DynamoDB:

{
  "page_url": "/p/Treo-N100-8-256-2H-W6-11P",
  "tier_label": "Related Searches",
# ... (implementation details omitted)

The web application queries this table to display related searches on each page.

Performance Characteristics

On a typical server:

Processing time: Varies based on page count
Memory usage: Depends on embedding size and batch size
DynamoDB writes: Batch writes for all pages
Embedding reuse: High percentage from incremental cache

The process is CPU-bound during similarity computation. Using NumPy with BLAS acceleration speeds up matrix operations significantly.

Integration with SEO Pipeline

Related search generation is Step 8 in the SEO pipeline:

Step 0: Embed Source Data - Products, parts, articles
Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
Step 2: Combine Queries - Merge all sources
Step 3a: Generate Base Phrase Mappings - Initial filters
Step 3b: Embed Queries - Convert to vectors
Step 4: Expand Phrase Mappings - Find similar phrases
Step 5: Cluster Queries - Group into pages
Step 6: Match Products - Query-product matching
Step 7: Build Query Pages - Generate HTML
Step 8: Generate Related Searches ← You are here
Step 11: Migrate to Valkey - Load into search service

See SEO Pipeline Overview for the complete flow.

Why Three Tiers?

Tier 1 (High Similarity) provides:

✅ Maximum relevance
✅ Best user experience
❌ Limited coverage (not all pages have high-similarity matches)

Tier 2 (Medium Similarity) provides:

✅ Good relevance
✅ Broader coverage
❌ Less specific matches

Tier 3 (Power Queries) provides:

✅ Universal coverage (every page gets 8 links)
✅ High click-through (popular queries)
❌ Lower relevance

Together, they ensure every page has quality links with the best possible relevance.

Statistics and Monitoring

After generation, we calculate statistics:

Total pages: All pages in the system
Total outlinks: Pages × links per page
Queries used: Subset of available queries
Query distribution: Most queries appear on multiple pages
Unused queries: Low traffic or low similarity queries

We also log category and family page cross-linking for analysis.

References

Technical Concepts

Cosine Similarity - Wikipedia
NumPy - Official documentation
BLAS - Wikipedia
DynamoDB - AWS documentation
Athena - AWS documentation

Model Documentation

all-mpnet-base-v2 - Hugging Face
Sentence Transformers - Official docs

Embedding Strategy - How we generate embeddings
Query Clustering - Grouping similar queries
Product Matching - Semantic matching
SEO Pipeline Overview - Complete pipeline architecture
Embed Source Data - Embedding products, parts, articles

Summary

We generate related searches for all pages using a three-tier strategy:

Three Tiers:

Tier 1: High similarity - "Related Searches"
Tier 2: Medium similarity - "Popular Searches"
Tier 3: Power queries (fallback) - "Trending Searches"

Traffic-First Processing:

Sort pages by traffic (highest first)
High-traffic pages get first pick of queries
Automatic fallback to lower tiers

Constraints:

Minimum appearances per query (diversity)
Maximum links per page (limit)
No duplicate destinations (uniqueness)
No self-links (prevention)

Embedding Reuse:

High reuse from previous steps
Incremental cache for new pages

The result is comprehensive related search coverage across the entire site with optimal relevance, traffic distribution, and query diversity.

← Back to Documentation Index

Products

Popular Searches and Blogs