Query Fetching: Gathering Search Data from Multiple Sources

This article explains how we collect search queries from five different sources to build a comprehensive dataset for the SEO pipeline.

The Problem: Incomplete Query Data

Relying on a single data source gives an incomplete picture:

Google Search Console: Only shows queries that led to clicks (misses high-impression, low-CTR queries)
Google Ads: Only shows paid search terms (misses organic traffic)
Live search logs: Only shows on-site searches (misses external discovery)
Algolia: Only shows autocomplete queries (misses full searches)

We need to aggregate queries from all sources to understand the complete search landscape.

Overview: Five Query Sources

The pipeline fetches from five sources:

Google Search Console: Organic search queries with engagement metrics
Google Ads: Paid search terms with conversion data
Live search logs: On-site user queries
Algolia analytics: Autocomplete search queries
Keyword Planner: Related keyword suggestions

Each source provides different insights; combined they create a comprehensive query dataset.

Step 1: Fetch Queries from Each Source

1.1 Google Search Console (GSC)

Purpose: Collect organic search queries that led to clicks.

Input: Website URL verified in Search Console.

Output: Query list with engagement metrics (queries, clicks, impressions, CTR, position).

Data specifications:

Query text
Clicks (number of users who clicked our result)
Impressions (number of times our result appeared)
CTR (click-through rate as percentage)
Average position in search results

Lookback period: 90 days (maximum for detailed query data)

Filtering: Minimum 1 click (excludes impression-only queries)

Access: Google Search Console API via service account with read-only scope

1.2 Google Ads Search Terms

Purpose: Collect paid search terms that triggered ads.

Input: Google Ads account with manager account access.

Output: Search term list with performance metrics (terms, impressions, clicks, cost).

Data specifications:

Search term text
Impressions (number of ad views)
Clicks (number of ad clicks)
Cost in micros (1 million units = 1 currency)

Lookback period: 2 years (all historical data available)

Aggregation: Combines data across all authorized accounts

Filtering: Minimum 1 impression

Access: Google Ads API via OAuth2

1.3 Live Search Queries

Purpose: Collect on-site user search queries.

Input: Internal search logs from search service.

Output: Query list with frequency metrics (queries, counts, engagement).

Data specifications:

Query text
Frequency (number of times searched)
Engagement data (searches that led to product interactions)

Lookback period: Varies (typically 30-90 days based on log retention)

Weighting: Queries weighted by engagement (searches with interactions score higher)

Filtering: Deduplicate and count occurrences

1.4 Algolia Top Searches

Purpose: Collect autocomplete searches from Algolia analytics.

Input: Algolia search analytics data.

Output: Top search query list with frequency (top searches, counts).

Data specifications:

Query text
Search count (number of times searched)

Lookback period: 90 days

Filtering: Limit to recent high-volume searches

Access: Algolia Analytics API via application credentials

1.5 Keyword Planner Ideas

Purpose: Collect related keyword suggestions.

Input: Seed keywords (product categories, families, common search terms).

Output: Keyword list with metrics (keywords, search volume, competition, bid suggestions).

Data specifications:

Keyword text
Average monthly searches
Competition level (low, medium, high)
Suggested bid value

Filtering: Relevant keywords only (exclude unrelated suggestions)

Access: Google Ads Keyword Planner API via OAuth2

Step 2: Normalize and Aggregate Queries

2.1 Normalize Query Text

Queries are standardized before aggregation:

Convert to lowercase
Normalize whitespace (collapse multiple spaces to single)
Trim leading/trailing whitespace
Remove special characters (when applicable)

2.2 Merge and Deduplicate

Queries from all sources are combined:

Load all source query lists
Deduplicate by normalized query text
Merge metrics (combine clicks, impressions, searches)
Track which sources contributed each query
Sort by combined engagement score

2.3 Weight by Engagement Signals

Queries are scored based on engagement:

Clicks from GSC: Highest weight (direct engagement with our content)
Clicks from Ads: Highest weight (user paid intent converted to action)
Live searches with engagement: Medium weight (on-site interaction)
Live searches without engagement: Lower weight (search alone)
Algolia autocomplete: Lower weight (incomplete intent)
Keyword ideas: Lowest weight (suggested, not observed searches)

Weight formula is configurable; engagement is the primary ranking signal.

2.4 Filter Low-Quality Queries

Invalid queries are removed:

Spam: Queries with only numbers, special characters, or injection attempts
Brand-only: Queries that are just our brand name alone
Low score: Queries below minimum engagement threshold

Step 3: Technical Implementation

3.1 Query Fetching Process

Authentication and data fetching for each source:

Obtain credentials (service account, OAuth2, or API key)
Connect to respective API
Request data for specified date range
Parse response into standard JSON format
Save to local storage

3.2 Google Search Console Client

Process:

Authenticate with service account credentials
Query Search Console API for specified site
Paginate through results (each page returns up to 25,000 results)
Extract query, clicks, impressions, CTR, position
Aggregate and save as JSON

3.3 Google Ads Client

Process:

Authenticate with OAuth2 credentials
Enumerate all authorized accounts
Query each account for search terms
Aggregate metrics across accounts
Sort by impressions and save

3.4 Live Query Log Parsing

Process:

Read internal search log files
Parse each log entry (JSON format, one per line)
Extract query text and engagement signals
Deduplicate queries and count occurrences
Weight by engagement and save

3.5 Algolia Analytics Fetch

Process:

Authenticate with Algolia credentials
Call Analytics API for search data
Request top searches with date range filter
Return search count and result metrics
Save to standard JSON format

Step 4: Incremental Updates

Query fetching supports incremental updates for faster processing:

First run:

Fetch all historical data (2 years for Ads, 90 days for GSC)
Compute embeddings (slower initial setup)

Subsequent runs:

Fetch only new data since last run
Reuse cached embeddings where possible
Incremental updates complete quickly

Optimization logic:

Check if output exists and is recent
Skip if no new data available
Resume from last checkpoint on failure

Step 5: Integration with Pipeline

Fetched queries feed into subsequent pipeline steps:

5.1 Query Clustering:

Input: Combined query list with engagement weights
Process: Group similar queries together using embeddings
Output: Query clusters for page generation

5.2 Product Matching:

Input: Query embeddings and clusters
Process: Match each cluster to relevant products
Output: Query-to-product mappings

5.3 Query Pages:

Input: Clustering and product matches
Process: Generate page content for each cluster
Output: HTML pages and routing data

5.4 Related Searches:

Input: Query embeddings and page data
Process: Find similar queries for each page
Output: Related search suggestions

See: SEO Pipeline Overview for full pipeline architecture

Data Quality Considerations

Handling Duplicates

Queries may appear in multiple sources with slight variations. Example:

"mini pc" (GSC)
"Mini PC" (Ads)
"mini pc" (Live with extra space)

Solution: Normalize before merging (lowercase, trim, collapse whitespace)

Handling Spam

Some sources contain invalid queries:

Random character strings
SQL injection attempts
Extremely long queries

Solution: Filter by engagement signals, character composition, length rules

Handling Seasonality

Query volume varies by season:

Higher volume during festivals
Lower volume during holidays

Solution: Use 90-day lookback period to smooth seasonal variations

Performance Characteristics

Fetch time: Varies by source; parallel fetching reduces overall time
Query volume: Thousands to millions depending on data sources
Aggregation time: Scales with total unique queries
Incremental updates: Significantly faster than full re-fetch
Resource usage: Moderate CPU for aggregation, memory for deduplication

Error Handling

The pipeline handles common failures:

API unavailable: Retry with exponential backoff
Missing data: Skip and log; continue with other sources
Invalid entries: Filter out and continue processing
Partial failures: Save successful data and checkpoint position

References

APIs and Services

Google Search Console API — Official documentation
Google Ads API — Official documentation
Algolia Analytics API — Official documentation

SEO Pipeline Overview — Complete pipeline architecture
Query Clustering — How queries are clustered
Product Matching — Matching queries to products

← Back to Documentation Index

Products

Popular Searches and Blogs

Query Fetching: Gathering Search Data from Multiple Sources

The Problem: Incomplete Query Data

Overview: Five Query Sources

Step 1: Fetch Queries from Each Source

1.1 Google Search Console (GSC)

1.2 Google Ads Search Terms

1.3 Live Search Queries

1.4 Algolia Top Searches

1.5 Keyword Planner Ideas

Step 2: Normalize and Aggregate Queries

2.1 Normalize Query Text

2.2 Merge and Deduplicate

2.3 Weight by Engagement Signals

2.4 Filter Low-Quality Queries

Step 3: Technical Implementation

3.1 Query Fetching Process

3.2 Google Search Console Client

3.3 Google Ads Client

3.4 Live Query Log Parsing

3.5 Algolia Analytics Fetch

Step 4: Incremental Updates

Step 5: Integration with Pipeline

Data Quality Considerations

Handling Duplicates

Handling Spam

Handling Seasonality

Performance Characteristics

Error Handling

See Also

References

APIs and Services

Related Articles