Related Search Generation: Three-Tier Semantic Matching

This article explains how we generate related search links for every page on the site using a three-tier strategy that balances semantic relevance, traffic distribution, and query diversity.

The Problem: Relevant Navigation Links

Every page needs related search links to help users discover more content. But which queries should we show?

The challenge is balancing multiple goals:

  • Relevance: Links should be semantically related to the current page

  • Traffic distribution: High-traffic queries should appear on high-traffic pages

  • Query diversity: Each query should appear on multiple pages (10+ times)

  • URL uniqueness: Don't link to the same destination twice

  • Coverage: Every page should have 8 quality links

A naive approach (pure semantic similarity) might show the same queries on every page. A better approach uses a three-tier strategy with traffic-aware processing.

Three-Tier Strategy

We generate related searches using three tiers of decreasing relevance:

Tier 1: Related Searches (High Similarity)

High semantic similarity queries that are directly related to the page content:

  • Example page: "Treo N100 Mini PC"

  • Example queries: "n100 mini pc", "mini pc 8gb", "compact desktop n100"

These are the most relevant matches.

Tier 2: Popular Searches (Medium Similarity)

Medium similarity queries that are topically related but less specific:

  • Example page: "Treo N100 Mini PC"

  • Example queries: "mini pc", "small computer", "fanless pc"

These are broader queries in the same category.

Tier 3: Trending Searches (Fallback)

Global power queries (highest traffic) used when semantic matches are exhausted:

  • Example page: Any page

  • Example queries: "mini pc", "thin client", "industrial pc", "all in one pc"

These are the most popular queries site-wide.

The Algorithm: Traffic-First Processing

Step 1: Collect All Pages

We collect every page on the site:

  • Products (/p/): From Step 0 source data

  • Parts (/i/): From Step 0 source data

  • Articles (/a/): From Step 0 source data

  • Families (/f/): From Step 0 source data

  • Categories (/c/): From Step 0 source data

  • Query pages (/q/): From routing data (Step 7)

  • Static pages: Home, about, contact, etc.

This typically yields ~64,000 pages.

Step 2: Load Embeddings

We load pre-computed embeddings for:

  1. Pages: Reuse from Step 0 (products, parts, articles) and Step 6 (query pages)
  2. Queries: From Step 3b (all 65K queries)

Reusing embeddings avoids re-embedding unchanged data:

  • Step 0 reuse: ~5,000 product/part/article embeddings

  • Step 6 reuse: ~12,500 query page embeddings

  • New embeddings: ~46,500 remaining pages

Step 3: Load Page Traffic

We query Athena for page views over the last 90 days:

traffic_data = generate_all_pages_traffic_index(lookback_days=90)

This returns a dictionary mapping URLs to view counts:

{
  "/p/Treo-N100-8-256-2H-W6-11P": 5000,
  "/q/mini-pc": 3000,
  "/": 50000
}

Step 4: Sort Pages by Traffic

We process pages in descending traffic order:

sorted_pages = sorted(all_pages, key=lambda p: page_traffic[p['url']], reverse=True)

This ensures high-traffic pages get first pick of queries.

Step 5: Compute Similarity (Batched)

We process pages in batches of 1,000 for memory efficiency:

batch_page_embeddings = all_page_embeddings[batch_indices]
batch_similarities = util.cos_sim(batch_page_embeddings, query_embeddings)
# ... (implementation details omitted)

This produces a similarity matrix for the batch.

Step 6: Select Queries (Three-Tier)

For each page, we select queries using the three-tier strategy:

Tier 1: High Similarity

for idx, similarity in enumerate(sorted_similarities):
    if similarity < threshold_high:
        break
# ... (implementation details omitted)

Tier 2: Medium Similarity (0.60-0.85)

if len(selected_queries) < 8:
    for idx, similarity in enumerate(sorted_similarities):
        if similarity < 0.60:
            break
        if try_add_query(queries[idx]):
            continue

Tier 3: Power Queries (Fallback)

if len(selected_queries) < 8:
    for power_query in power_query_list:
        if try_add_query(power_query):
            continue

Step 7: Enforce Constraints

The try_add_query() function enforces multiple constraints:

Query Usage Cap (10 appearances):

if query_usage[query] >= MAX_APPEARANCES:
    return False

URL Uniqueness (no duplicate destinations):

url = get_dest_url(query)
if url in selected_urls:
    return False
selected_urls.add(url)

Self-Link Prevention:

selected_urls = {page_url}  # Initialize with current page

Limit to 8 Links:

if len(selected_queries) >= 8:
    return False

Step 8: Convert Queries to URLs

We use the routing data from Step 7 to convert queries to destination URLs:

query_to_url_map = {}
for route in routing_data['routes']:
    for query in route['queries']:
        query_to_url_map[query] = route['destination']

This creates a fast O(1) lookup for query destinations.

Step 9: Language Propagation (Optional)

If language propagation is enabled, we append the query language to the URL:

lang = detect_query_language(query)
if lang != "en":
    url = f"{url}?lang={lang}"

This ensures users stay in their preferred language when clicking related searches.

Step 10: Save to DynamoDB

We save the related searches to DynamoDB using batch writes:

with RelatedSearchWidget.batch_write() as batch:
    for page in batch_pages:
        widget = RelatedSearchWidget(
            page_url=page['url'],
            links=[...],
            tier_label="Related Searches"
        )
        batch.save(widget)

Batch writes provide significant performance improvement over individual writes.

Query Usage Tracking

We track how many times each query has been used:

query_usage = {query: 0 for query in all_queries}
MAX_APPEARANCES = 10

As queries are selected, we increment their usage:

query_usage[query] += 1

Once a query reaches 10 appearances, it's excluded from future selections:

available_mask[query_idx] = False

This ensures query diversity across the site.

Traffic-First Processing Benefits

Processing pages by traffic (highest first) has advantages:

High-traffic pages get best matches:

  • Home page (50K views) → Top 8 most relevant queries

  • Popular product (5K views) → Next 8 most relevant queries

  • Niche product (100 views) → Remaining relevant queries

Automatic fallback:

  • If high-similarity queries are exhausted, medium-similarity queries are used

  • If medium-similarity queries are exhausted, power queries are used

Traffic distribution:

  • High-traffic queries appear on high-traffic pages

  • Low-traffic queries appear on niche pages

  • Maximizes overall click-through potential

Static Page Handling

Static pages (home, about, contact) skip Tier 1 and Tier 2, going straight to Tier 3:

if page_type == 'static':
    for power_query in power_query_list:
        try_add_query(power_query)
    tier_label = "Trending Searches"

This ensures static pages show the most popular queries site-wide.

Embedding Reuse Strategy

We reuse embeddings from previous pipeline steps:

Step 0 (Source Data):

  • Products: /p/Treo-N100-8-256-2H-W6-11P

  • Parts: /i/N100

  • Articles: /a/mini-pc-guide

Step 6 (Cluster Queries):

  • Query pages: /q/mini-pc (slug → query text)

New Embeddings:

  • Remaining pages not in Step 0 or Step 6

This reduces embedding time from ~2 hours to ~20 minutes.

Incremental Embedding

For pages that need new embeddings, we use incremental caching:

new_embeddings = incremental_embed_with_keys(
    items=page_contents,
    keys=page_urls,
    cache_embeddings_path=SEO_PAGE_EMBEDDINGS_PATH,
    cache_keys_path=SEO_PAGE_URLS_PATH,
    model_name='all-mpnet-base-v2'
)

This caches embeddings for future runs. See Embedding Strategy for details.

Output Format

The related searches are stored in DynamoDB:

{
  "page_url": "/p/Treo-N100-8-256-2H-W6-11P",
  "tier_label": "Related Searches",
# ... (implementation details omitted)

The web application queries this table to display related searches on each page.

Performance Characteristics

On a typical server:

  • Processing time: Varies based on page count

  • Memory usage: Depends on embedding size and batch size

  • DynamoDB writes: Batch writes for all pages

  • Embedding reuse: High percentage from incremental cache

The process is CPU-bound during similarity computation. Using NumPy with BLAS acceleration speeds up matrix operations significantly.

Integration with SEO Pipeline

Related search generation is Step 8 in the SEO pipeline:

  1. Step 0: Embed Source Data - Products, parts, articles
  2. Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
  3. Step 2: Combine Queries - Merge all sources
  4. Step 3a: Generate Base Phrase Mappings - Initial filters
  5. Step 3b: Embed Queries - Convert to vectors
  6. Step 4: Expand Phrase Mappings - Find similar phrases
  7. Step 5: Cluster Queries - Group into pages
  8. Step 6: Match Products - Query-product matching
  9. Step 7: Build Query Pages - Generate HTML
  10. Step 8: Generate Related Searches ← You are here
  11. Step 11: Migrate to Valkey - Load into search service

See SEO Pipeline Overview for the complete flow.

Why Three Tiers?

Tier 1 (High Similarity) provides:

  • ✅ Maximum relevance

  • ✅ Best user experience

  • ❌ Limited coverage (not all pages have high-similarity matches)

Tier 2 (Medium Similarity) provides:

  • ✅ Good relevance

  • ✅ Broader coverage

  • ❌ Less specific matches

Tier 3 (Power Queries) provides:

  • ✅ Universal coverage (every page gets 8 links)

  • ✅ High click-through (popular queries)

  • ❌ Lower relevance

Together, they ensure every page has quality links with the best possible relevance.

Statistics and Monitoring

After generation, we calculate statistics:

  • Total pages: All pages in the system

  • Total outlinks: Pages × links per page

  • Queries used: Subset of available queries

  • Query distribution: Most queries appear on multiple pages

  • Unused queries: Low traffic or low similarity queries

We also log category and family page cross-linking for analysis.

References

Technical Concepts

Model Documentation

Related Articles

Summary

We generate related searches for all pages using a three-tier strategy:

Three Tiers:

  • Tier 1: High similarity - "Related Searches"

  • Tier 2: Medium similarity - "Popular Searches"

  • Tier 3: Power queries (fallback) - "Trending Searches"

Traffic-First Processing:

  • Sort pages by traffic (highest first)

  • High-traffic pages get first pick of queries

  • Automatic fallback to lower tiers

Constraints:

  • Minimum appearances per query (diversity)

  • Maximum links per page (limit)

  • No duplicate destinations (uniqueness)

  • No self-links (prevention)

Embedding Reuse:

  • High reuse from previous steps

  • Incremental cache for new pages

The result is comprehensive related search coverage across the entire site with optimal relevance, traffic distribution, and query diversity.


← Back to Documentation Index