Query Fetching: Gathering Search Data from Multiple Sources
This article explains how we collect search queries from five different sources to build a comprehensive dataset for the SEO pipeline.
The Problem: Incomplete Query Data
Relying on a single data source gives an incomplete picture:
-
Google Search Console: Only shows queries that led to clicks (misses high-impression, low-CTR queries)
-
Google Ads: Only shows paid search terms (misses organic traffic)
-
Live search logs: Only shows on-site searches (misses external discovery)
-
Algolia: Only shows autocomplete queries (misses full searches)
We need to aggregate queries from all sources to understand the complete search landscape.
Overview: Five Query Sources
The pipeline fetches from five sources:
-
Google Search Console: Organic search queries with engagement metrics
-
Google Ads: Paid search terms with conversion data
-
Live search logs: On-site user queries
-
Algolia analytics: Autocomplete search queries
-
Keyword Planner: Related keyword suggestions
Each source provides different insights; combined they create a comprehensive query dataset.
Step 1: Fetch Queries from Each Source
1.1 Google Search Console (GSC)
Purpose: Collect organic search queries that led to clicks.
Input: Website URL verified in Search Console.
Output: Query list with engagement metrics (queries, clicks, impressions, CTR, position).
Data specifications:
-
Query text
-
Clicks (number of users who clicked our result)
-
Impressions (number of times our result appeared)
-
CTR (click-through rate as percentage)
-
Average position in search results
Lookback period: 90 days (maximum for detailed query data)
Filtering: Minimum 1 click (excludes impression-only queries)
Access: Google Search Console API via service account with read-only scope
1.2 Google Ads Search Terms
Purpose: Collect paid search terms that triggered ads.
Input: Google Ads account with manager account access.
Output: Search term list with performance metrics (terms, impressions, clicks, cost).
Data specifications:
-
Search term text
-
Impressions (number of ad views)
-
Clicks (number of ad clicks)
-
Cost in micros (1 million units = 1 currency)
Lookback period: 2 years (all historical data available)
Aggregation: Combines data across all authorized accounts
Filtering: Minimum 1 impression
Access: Google Ads API via OAuth2
1.3 Live Search Queries
Purpose: Collect on-site user search queries.
Input: Internal search logs from search service.
Output: Query list with frequency metrics (queries, counts, engagement).
Data specifications:
-
Query text
-
Frequency (number of times searched)
-
Engagement data (searches that led to product interactions)
Lookback period: Varies (typically 30-90 days based on log retention)
Weighting: Queries weighted by engagement (searches with interactions score higher)
Filtering: Deduplicate and count occurrences
1.4 Algolia Top Searches
Purpose: Collect autocomplete searches from Algolia analytics.
Input: Algolia search analytics data.
Output: Top search query list with frequency (top searches, counts).
Data specifications:
-
Query text
-
Search count (number of times searched)
Lookback period: 90 days
Filtering: Limit to recent high-volume searches
Access: Algolia Analytics API via application credentials
1.5 Keyword Planner Ideas
Purpose: Collect related keyword suggestions.
Input: Seed keywords (product categories, families, common search terms).
Output: Keyword list with metrics (keywords, search volume, competition, bid suggestions).
Data specifications:
-
Keyword text
-
Average monthly searches
-
Competition level (low, medium, high)
-
Suggested bid value
Filtering: Relevant keywords only (exclude unrelated suggestions)
Access: Google Ads Keyword Planner API via OAuth2
Step 2: Normalize and Aggregate Queries
2.1 Normalize Query Text
Queries are standardized before aggregation:
-
Convert to lowercase
-
Normalize whitespace (collapse multiple spaces to single)
-
Trim leading/trailing whitespace
-
Remove special characters (when applicable)
2.2 Merge and Deduplicate
Queries from all sources are combined:
- Load all source query lists
- Deduplicate by normalized query text
- Merge metrics (combine clicks, impressions, searches)
- Track which sources contributed each query
- Sort by combined engagement score
2.3 Weight by Engagement Signals
Queries are scored based on engagement:
-
Clicks from GSC: Highest weight (direct engagement with our content)
-
Clicks from Ads: Highest weight (user paid intent converted to action)
-
Live searches with engagement: Medium weight (on-site interaction)
-
Live searches without engagement: Lower weight (search alone)
-
Algolia autocomplete: Lower weight (incomplete intent)
-
Keyword ideas: Lowest weight (suggested, not observed searches)
Weight formula is configurable; engagement is the primary ranking signal.
2.4 Filter Low-Quality Queries
Invalid queries are removed:
-
Spam: Queries with only numbers, special characters, or injection attempts
-
Brand-only: Queries that are just our brand name alone
-
Low score: Queries below minimum engagement threshold
Step 3: Technical Implementation
3.1 Query Fetching Process
Authentication and data fetching for each source:
- Obtain credentials (service account, OAuth2, or API key)
- Connect to respective API
- Request data for specified date range
- Parse response into standard JSON format
- Save to local storage
3.2 Google Search Console Client
Process:
- Authenticate with service account credentials
- Query Search Console API for specified site
- Paginate through results (each page returns up to 25,000 results)
- Extract query, clicks, impressions, CTR, position
- Aggregate and save as JSON
3.3 Google Ads Client
Process:
- Authenticate with OAuth2 credentials
- Enumerate all authorized accounts
- Query each account for search terms
- Aggregate metrics across accounts
- Sort by impressions and save
3.4 Live Query Log Parsing
Process:
- Read internal search log files
- Parse each log entry (JSON format, one per line)
- Extract query text and engagement signals
- Deduplicate queries and count occurrences
- Weight by engagement and save
3.5 Algolia Analytics Fetch
Process:
- Authenticate with Algolia credentials
- Call Analytics API for search data
- Request top searches with date range filter
- Return search count and result metrics
- Save to standard JSON format
Step 4: Incremental Updates
Query fetching supports incremental updates for faster processing:
First run:
-
Fetch all historical data (2 years for Ads, 90 days for GSC)
-
Compute embeddings (slower initial setup)
Subsequent runs:
-
Fetch only new data since last run
-
Reuse cached embeddings where possible
-
Incremental updates complete quickly
Optimization logic:
-
Check if output exists and is recent
-
Skip if no new data available
-
Resume from last checkpoint on failure
Step 5: Integration with Pipeline
Fetched queries feed into subsequent pipeline steps:
5.1 Query Clustering:
-
Input: Combined query list with engagement weights
-
Process: Group similar queries together using embeddings
-
Output: Query clusters for page generation
5.2 Product Matching:
-
Input: Query embeddings and clusters
-
Process: Match each cluster to relevant products
-
Output: Query-to-product mappings
5.3 Query Pages:
-
Input: Clustering and product matches
-
Process: Generate page content for each cluster
-
Output: HTML pages and routing data
5.4 Related Searches:
-
Input: Query embeddings and page data
-
Process: Find similar queries for each page
-
Output: Related search suggestions
See: SEO Pipeline Overview for full pipeline architecture
Data Quality Considerations
Handling Duplicates
Queries may appear in multiple sources with slight variations. Example:
-
"mini pc" (GSC)
-
"Mini PC" (Ads)
-
"mini pc" (Live with extra space)
Solution: Normalize before merging (lowercase, trim, collapse whitespace)
Handling Spam
Some sources contain invalid queries:
-
Random character strings
-
SQL injection attempts
-
Extremely long queries
Solution: Filter by engagement signals, character composition, length rules
Handling Seasonality
Query volume varies by season:
-
Higher volume during festivals
-
Lower volume during holidays
Solution: Use 90-day lookback period to smooth seasonal variations
Performance Characteristics
-
Fetch time: Varies by source; parallel fetching reduces overall time
-
Query volume: Thousands to millions depending on data sources
-
Aggregation time: Scales with total unique queries
-
Incremental updates: Significantly faster than full re-fetch
-
Resource usage: Moderate CPU for aggregation, memory for deduplication
Error Handling
The pipeline handles common failures:
-
API unavailable: Retry with exponential backoff
-
Missing data: Skip and log; continue with other sources
-
Invalid entries: Filter out and continue processing
-
Partial failures: Save successful data and checkpoint position
See Also
-
SEO Pipeline Overview — Complete pipeline architecture
-
Query Clustering — How similar queries are grouped
-
Product Matching — Matching queries to products
-
Embedding Strategy — Query and product embeddings
-
Related Search Generation — Building related links
References
APIs and Services
-
Google Search Console API — Official documentation
-
Google Ads API — Official documentation
-
Algolia Analytics API — Official documentation
Related Articles
-
SEO Pipeline Overview — Complete pipeline architecture
-
Query Clustering — How queries are clustered
-
Product Matching — Matching queries to products