SEO Pipeline
This article explains how Thinvent generates query pages from search data. Our SEO pipeline processes queries from multiple sources, clusters them, matches them to products, and generates optimized pages for search engines.
Data Sources
We aggregate queries from multiple sources:
-
Google Search Console (GSC): Organic search queries with impressions and clicks
-
Google Ads: Search terms that triggered paid ads
-
Google Ads Keyword Ideas: Keyword suggestions with search volume
-
Live queries: Real-time search queries from our search service
These sources provide a comprehensive view of what users are searching for, combining historical data with real-time insights.
Pipeline Architecture
The SEO pipeline runs weekly on Sundays and consists of 11 steps:
flowchart TD
A[Step 0: Embed Source Data] --> B[Step 1a: Fetch GSC]
A --> C[Step 1b: Fetch Ads]
A --> D[Step 1c: Fetch Keywords]
A --> E[Step 1d: Fetch Live]
B --> G[Step 2: Combine Queries]
C --> G
D --> G
E --> G
G --> H[Step 3a: Generate Base Phrases]
G --> I[Step 3b: Embed Queries]
H --> J[Step 4: Expand Phrase Mappings]
I --> J
J --> K[Step 5: Cluster Queries]
I --> K
K --> L[Step 6: Match Source Data]
A --> L
L --> M[Step 7: Route Queries]
J --> M
K --> M
M --> N[Step 8a: Build Query Pages]
K --> O[Step 8b: Generate Related Searches]
N --> O[Step 8b: Generate Related Searches]Step-by-Step Process
Step 0: Embed Source Data
We embed product data using SentenceTransformer to create semantic embeddings. These embeddings are used later to match queries to products based on semantic similarity, not just keyword matching.
Steps 1a-1e: Fetch Queries
We fetch queries from multiple sources:
-
Google Search Console (GSC): Organic search queries with performance metrics
-
Google Ads: Search terms that triggered paid ads
-
Google Ads Keyword Ideas: Keyword suggestions with search volume
-
Live queries: Real-time search queries from our search service
Each source provides different insights into user intent.
Step 2: Combine Queries
We combine all queries into a single dataset, deduplicating and aggregating metrics like impressions and clicks.
Steps 3a-3b: Generate Phrases and Embed
We generate base phrases from product features and embed queries using SentenceTransformer. The phrase mappings are used to extract filters from search queries.
Step 4: Expand Phrase Mappings
We expand phrase mappings by:
-
Resolving memory/storage collisions (e.g., "8GB RAM" vs "8GB storage")
-
Building phrase-to-filter mappings
-
Extracting n-grams from queries
Step 5: Cluster Queries
We cluster similar queries using vector similarity. Queries that are semantically similar are grouped together and will share the same query page.
Step 6: Match Source Data
We match queries to products using:
-
Vector similarity between query embeddings and product embeddings
-
Filter extraction from phrase mappings
-
Product name matching
Step 7: Route Queries
We route queries to appropriate pages:
-
Find family matches (e.g., "Treo" family)
-
Find category matches (e.g., "Mini PC" category)
-
Generate slugs for query pages
Steps 8a-8b: Build Pages and Related Searches
We build query pages and generate related searches:
-
Step 8a: Build query pages with product lists
-
Step 8b: Generate related searches using vector similarity
Query Page Generation
Query pages are generated at /q/<slug> and include:
-
Title: Optimized for search engines
-
Description: AI-generated content
-
Products: Top matching products
-
Filters: Extracted from query
-
Related searches: Semantic similarity matches
AI Content Generation
We use AI to generate query page content:
-
DeepSeek: Product descriptions, query page content
-
System prompts: For caching efficiency
-
Temperature: 0.7 for balanced creativity
The AI generates:
-
Tagline: Short, compelling headline
-
Body: Detailed product information
Multilingual Support
Query pages support multiple languages:
-
English (source)
-
Spanish, French, German, Italian, Portuguese
-
Russian, Hindi, Bengali, Gujarati, Kannada
-
Malayalam, Marathi, Punjabi, Tamil, Telugu
-
Arabic, Chinese, Japanese, Korean
Summary
Our SEO pipeline provides:
-
Comprehensive data: Multiple query sources
-
Semantic matching: Vector similarity for better matches
-
AI content: Automated content generation
-
Multilingual: Support for 15+ languages
-
Automated: Weekly pipeline with checkpoints