SEO Pipeline

This article explains how Thinvent generates query pages from search data. Our SEO pipeline processes queries from multiple sources, clusters them, matches them to products, and generates optimized pages for search engines.

Data Sources

We aggregate queries from multiple sources:

  • Google Search Console (GSC): Organic search queries with impressions and clicks

  • Google Ads: Search terms that triggered paid ads

  • Google Ads Keyword Ideas: Keyword suggestions with search volume

  • Live queries: Real-time search queries from our search service

These sources provide a comprehensive view of what users are searching for, combining historical data with real-time insights.

Pipeline Architecture

The SEO pipeline runs weekly on Sundays and consists of 11 steps:

flowchart TD
    A[Step 0: Embed Source Data] --> B[Step 1a: Fetch GSC]
    A --> C[Step 1b: Fetch Ads]
    A --> D[Step 1c: Fetch Keywords]
    A --> E[Step 1d: Fetch Live]
    
    B --> G[Step 2: Combine Queries]
    C --> G
    D --> G
    E --> G
    
    G --> H[Step 3a: Generate Base Phrases]
    G --> I[Step 3b: Embed Queries]
    
    H --> J[Step 4: Expand Phrase Mappings]
    I --> J
    
    J --> K[Step 5: Cluster Queries]
    I --> K
    
    K --> L[Step 6: Match Source Data]
    A --> L
    
    L --> M[Step 7: Route Queries]
    J --> M
    K --> M
    
    M --> N[Step 8a: Build Query Pages]
    K --> O[Step 8b: Generate Related Searches]
    
    N --> O[Step 8b: Generate Related Searches]

Step-by-Step Process

Step 0: Embed Source Data

We embed product data using SentenceTransformer to create semantic embeddings. These embeddings are used later to match queries to products based on semantic similarity, not just keyword matching.

Steps 1a-1e: Fetch Queries

We fetch queries from multiple sources:

Each source provides different insights into user intent.

Step 2: Combine Queries

We combine all queries into a single dataset, deduplicating and aggregating metrics like impressions and clicks.

Steps 3a-3b: Generate Phrases and Embed

We generate base phrases from product features and embed queries using SentenceTransformer. The phrase mappings are used to extract filters from search queries.

Step 4: Expand Phrase Mappings

We expand phrase mappings by:

  • Resolving memory/storage collisions (e.g., "8GB RAM" vs "8GB storage")

  • Building phrase-to-filter mappings

  • Extracting n-grams from queries

Step 5: Cluster Queries

We cluster similar queries using vector similarity. Queries that are semantically similar are grouped together and will share the same query page.

Step 6: Match Source Data

We match queries to products using:

  • Vector similarity between query embeddings and product embeddings

  • Filter extraction from phrase mappings

  • Product name matching

Step 7: Route Queries

We route queries to appropriate pages:

  • Find family matches (e.g., "Treo" family)

  • Find category matches (e.g., "Mini PC" category)

  • Generate slugs for query pages

Steps 8a-8b: Build Pages and Related Searches

We build query pages and generate related searches:

  • Step 8a: Build query pages with product lists

  • Step 8b: Generate related searches using vector similarity

Query Page Generation

Query pages are generated at /q/<slug> and include:

  • Title: Optimized for search engines

  • Description: AI-generated content

  • Products: Top matching products

  • Filters: Extracted from query

  • Related searches: Semantic similarity matches

AI Content Generation

We use AI to generate query page content:

  • DeepSeek: Product descriptions, query page content

  • System prompts: For caching efficiency

  • Temperature: 0.7 for balanced creativity

The AI generates:

  • Tagline: Short, compelling headline

  • Body: Detailed product information

Multilingual Support

Query pages support multiple languages:

  • English (source)

  • Spanish, French, German, Italian, Portuguese

  • Russian, Hindi, Bengali, Gujarati, Kannada

  • Malayalam, Marathi, Punjabi, Tamil, Telugu

  • Arabic, Chinese, Japanese, Korean

Summary

Our SEO pipeline provides:

  • Comprehensive data: Multiple query sources

  • Semantic matching: Vector similarity for better matches

  • AI content: Automated content generation

  • Multilingual: Support for 15+ languages

  • Automated: Weekly pipeline with checkpoints