Query Fetching: Gathering Search Data from Multiple Sources

This article explains how we collect search queries from five different sources to build a comprehensive dataset for the SEO pipeline.

The Problem: Incomplete Query Data

Relying on a single data source gives an incomplete picture:

  • Google Search Console: Only shows queries that led to clicks (misses high-impression, low-CTR queries)

  • Google Ads: Only shows paid search terms (misses organic traffic)

  • Live search logs: Only shows on-site searches (misses external discovery)

  • Algolia: Only shows autocomplete queries (misses full searches)

We need to aggregate queries from all sources to understand the complete search landscape.

Overview: Five Query Sources

The pipeline fetches from five sources:

  • Google Search Console: Organic search queries with engagement metrics

  • Google Ads: Paid search terms with conversion data

  • Live search logs: On-site user queries

  • Algolia analytics: Autocomplete search queries

  • Keyword Planner: Related keyword suggestions

Each source provides different insights; combined they create a comprehensive query dataset.

Step 1: Fetch Queries from Each Source

1.1 Google Search Console (GSC)

Purpose: Collect organic search queries that led to clicks.

Input: Website URL verified in Search Console.

Output: Query list with engagement metrics (queries, clicks, impressions, CTR, position).

Data specifications:

  • Query text

  • Clicks (number of users who clicked our result)

  • Impressions (number of times our result appeared)

  • CTR (click-through rate as percentage)

  • Average position in search results

Lookback period: 90 days (maximum for detailed query data)

Filtering: Minimum 1 click (excludes impression-only queries)

Access: Google Search Console API via service account with read-only scope

1.2 Google Ads Search Terms

Purpose: Collect paid search terms that triggered ads.

Input: Google Ads account with manager account access.

Output: Search term list with performance metrics (terms, impressions, clicks, cost).

Data specifications:

  • Search term text

  • Impressions (number of ad views)

  • Clicks (number of ad clicks)

  • Cost in micros (1 million units = 1 currency)

Lookback period: 2 years (all historical data available)

Aggregation: Combines data across all authorized accounts

Filtering: Minimum 1 impression

Access: Google Ads API via OAuth2

1.3 Live Search Queries

Purpose: Collect on-site user search queries.

Input: Internal search logs from search service.

Output: Query list with frequency metrics (queries, counts, engagement).

Data specifications:

  • Query text

  • Frequency (number of times searched)

  • Engagement data (searches that led to product interactions)

Lookback period: Varies (typically 30-90 days based on log retention)

Weighting: Queries weighted by engagement (searches with interactions score higher)

Filtering: Deduplicate and count occurrences

1.4 Algolia Top Searches

Purpose: Collect autocomplete searches from Algolia analytics.

Input: Algolia search analytics data.

Output: Top search query list with frequency (top searches, counts).

Data specifications:

  • Query text

  • Search count (number of times searched)

Lookback period: 90 days

Filtering: Limit to recent high-volume searches

Access: Algolia Analytics API via application credentials

1.5 Keyword Planner Ideas

Purpose: Collect related keyword suggestions.

Input: Seed keywords (product categories, families, common search terms).

Output: Keyword list with metrics (keywords, search volume, competition, bid suggestions).

Data specifications:

  • Keyword text

  • Average monthly searches

  • Competition level (low, medium, high)

  • Suggested bid value

Filtering: Relevant keywords only (exclude unrelated suggestions)

Access: Google Ads Keyword Planner API via OAuth2

Step 2: Normalize and Aggregate Queries

2.1 Normalize Query Text

Queries are standardized before aggregation:

  • Convert to lowercase

  • Normalize whitespace (collapse multiple spaces to single)

  • Trim leading/trailing whitespace

  • Remove special characters (when applicable)

2.2 Merge and Deduplicate

Queries from all sources are combined:

  1. Load all source query lists
  2. Deduplicate by normalized query text
  3. Merge metrics (combine clicks, impressions, searches)
  4. Track which sources contributed each query
  5. Sort by combined engagement score

2.3 Weight by Engagement Signals

Queries are scored based on engagement:

  • Clicks from GSC: Highest weight (direct engagement with our content)

  • Clicks from Ads: Highest weight (user paid intent converted to action)

  • Live searches with engagement: Medium weight (on-site interaction)

  • Live searches without engagement: Lower weight (search alone)

  • Algolia autocomplete: Lower weight (incomplete intent)

  • Keyword ideas: Lowest weight (suggested, not observed searches)

Weight formula is configurable; engagement is the primary ranking signal.

2.4 Filter Low-Quality Queries

Invalid queries are removed:

  • Spam: Queries with only numbers, special characters, or injection attempts

  • Brand-only: Queries that are just our brand name alone

  • Low score: Queries below minimum engagement threshold

Step 3: Technical Implementation

3.1 Query Fetching Process

Authentication and data fetching for each source:

  1. Obtain credentials (service account, OAuth2, or API key)
  2. Connect to respective API
  3. Request data for specified date range
  4. Parse response into standard JSON format
  5. Save to local storage

3.2 Google Search Console Client

Process:

  1. Authenticate with service account credentials
  2. Query Search Console API for specified site
  3. Paginate through results (each page returns up to 25,000 results)
  4. Extract query, clicks, impressions, CTR, position
  5. Aggregate and save as JSON

3.3 Google Ads Client

Process:

  1. Authenticate with OAuth2 credentials
  2. Enumerate all authorized accounts
  3. Query each account for search terms
  4. Aggregate metrics across accounts
  5. Sort by impressions and save

3.4 Live Query Log Parsing

Process:

  1. Read internal search log files
  2. Parse each log entry (JSON format, one per line)
  3. Extract query text and engagement signals
  4. Deduplicate queries and count occurrences
  5. Weight by engagement and save

3.5 Algolia Analytics Fetch

Process:

  1. Authenticate with Algolia credentials
  2. Call Analytics API for search data
  3. Request top searches with date range filter
  4. Return search count and result metrics
  5. Save to standard JSON format

Step 4: Incremental Updates

Query fetching supports incremental updates for faster processing:

First run:

  • Fetch all historical data (2 years for Ads, 90 days for GSC)

  • Compute embeddings (slower initial setup)

Subsequent runs:

  • Fetch only new data since last run

  • Reuse cached embeddings where possible

  • Incremental updates complete quickly

Optimization logic:

  • Check if output exists and is recent

  • Skip if no new data available

  • Resume from last checkpoint on failure

Step 5: Integration with Pipeline

Fetched queries feed into subsequent pipeline steps:

5.1 Query Clustering:

  • Input: Combined query list with engagement weights

  • Process: Group similar queries together using embeddings

  • Output: Query clusters for page generation

5.2 Product Matching:

  • Input: Query embeddings and clusters

  • Process: Match each cluster to relevant products

  • Output: Query-to-product mappings

5.3 Query Pages:

  • Input: Clustering and product matches

  • Process: Generate page content for each cluster

  • Output: HTML pages and routing data

5.4 Related Searches:

  • Input: Query embeddings and page data

  • Process: Find similar queries for each page

  • Output: Related search suggestions

See: SEO Pipeline Overview for full pipeline architecture

Data Quality Considerations

Handling Duplicates

Queries may appear in multiple sources with slight variations. Example:

  • "mini pc" (GSC)

  • "Mini PC" (Ads)

  • "mini pc" (Live with extra space)

Solution: Normalize before merging (lowercase, trim, collapse whitespace)

Handling Spam

Some sources contain invalid queries:

  • Random character strings

  • SQL injection attempts

  • Extremely long queries

Solution: Filter by engagement signals, character composition, length rules

Handling Seasonality

Query volume varies by season:

  • Higher volume during festivals

  • Lower volume during holidays

Solution: Use 90-day lookback period to smooth seasonal variations

Performance Characteristics

  • Fetch time: Varies by source; parallel fetching reduces overall time

  • Query volume: Thousands to millions depending on data sources

  • Aggregation time: Scales with total unique queries

  • Incremental updates: Significantly faster than full re-fetch

  • Resource usage: Moderate CPU for aggregation, memory for deduplication

Error Handling

The pipeline handles common failures:

  • API unavailable: Retry with exponential backoff

  • Missing data: Skip and log; continue with other sources

  • Invalid entries: Filter out and continue processing

  • Partial failures: Save successful data and checkpoint position

See Also

References

APIs and Services

Related Articles


← Back to Documentation Index