Phrase-to-Filter Mappings: Semantic Search for Product Filters

This article explains how we generate and expand phrase-to-filter mappings that power our filter extraction system. These mappings connect natural language phrases from search queries to structured product filters.

The Problem: Extracting Filters from Natural Language

When users search for "mini pc with 16gb ram," we need to extract:

Form Factor: Mini PC
Main Memory: 16

But users express the same intent in many ways:

"16gb ram mini pc"
"mini computer 16 gb memory"
"small pc 16gb"
"compact desktop with 16 gigs"

We need a system that maps all these phrase variations to the correct filter values.

Two-Step Process

We generate phrase mappings in two steps:

Step 3a: Generate base phrases from product features using permutations and feature-specific rules
Step 4: Expand with semantic similarity using embeddings

This hybrid approach combines rule-based precision with semantic flexibility.

Step 3a: Base Phrase Generation

Feature Metadata

Every product feature has metadata in the feature sequence:

Heading: Category name (e.g., "Processing" for processor features)
Unit: Measurement unit (e.g., "GB" for memory, "GHz" for frequency)

This metadata guides phrase generation.

Permutation-Based Generation

For each feature value, we generate all permutations of:

Heading: "processor"
Feature key: "series"
Value: "i5"
Unit: (none for series)

This produces:

"processor series i5"
"series processor i5"
"i5 processor series"
"i5 series processor"
"processor i5"
"series i5"
"i5"

All permutations ensure we match queries regardless of word order.

Value + Unit Combinations

For features with units (memory, storage, frequency), we generate both spaced and non-spaced variants:

"16 gb" (spaced)
"16gb" (no space)

These combine with other components:

"16gb ram"
"ram 16gb"
"processor 16gb"

Port Suffix Rules

For connectivity features (USB, HDMI, DisplayPort), we add port suffixes:

"usb port"
"usb ports"
"2 usb ports"
"hdmi port"

This matches queries like "mini pc with 2 hdmi ports."

Feature-Specific Rules

Each feature type has custom phrase generation:

Processor Series (i3, i5, N-series):

Core processor designations generate variants: manufacturer name, core brand, combined forms
N-series processors (N100, etc.) create similar pattern combinations
Each generates multiple permutations with "processor"

Main Memory:

Numeric values generate both spaced and non-spaced GB variants("16gb","16 gb")
Auto-appended with "ram" and "memory" qualifiers("16gb memory","16 gb ram")

SSD Storage:

Numeric values generate GB variants("512gb", "512 gb")
Generates TB variants automatically for values ≥ 1024 (e.g., 1024GB → 1TB)
Auto-appended with "ssd" and "storage" qualifiers("512gb ssd", "512 gb storage")

Form Factor:

Mini PC → multiple variants including non-spaced form
All-in-One → "all in one", "aio", abbreviated forms
Thin Client → variants with/without spacing
Industrial PC → abbreviated and full forms

Generation:

Numeric generation values become: "12th gen", "12th generation", "gen 12"

Cores:

2 → "dual core"
4 → "quad core"
6 → "hexa core"
8 → "octa core"
All generate "[n] core processor" variants

Ethernet:

"1000" → ["gigabit ethernet", "gbe", "1gbps"]
"2500" → ["2.5gbe", "2.5 gigabit", "2.5gbps"]

Operating System:

"Windows 11" → ["windows", "windows 11", "win 11"]
"Ubuntu" → ["linux", "ubuntu"]
"FreeDOS" → ["freedos", "no os", "without os"]

Whitelist for Standalone Values

Only specific features allow standalone value matching to prevent false matches:

Series: "i3", "i5", "N100" (but not single letters)
Processor Model: "N100", "1335U"
Operating System: "Windows", "Linux", "Ubuntu"
Generation: "12th", "13th", "14th"
Processor Brand: "Intel", "ARM"

For example, "2" shouldn't match "2 cores" when the query is "2 hdmi ports." The length check prevents single letters or very short ambiguous values from being matched standalone. Features with explicit unit components, like Ethernet or ports, only generate standalone combinations when qualified with their unit.

Collision Prevention

Memory and storage values overlap (both have 64, 128, 256, etc.). Step 4 enforces value-range rules to disambiguate:

≤ 64 GB: Main Memory (RAM)
≥ 128 GB: SSD Storage
65-127 GB range: Skipped (ambiguous)

Phrases with explicit qualifiers ("ram"/"memory" or "ssd"/"storage") override this rule and can match any value.

Additionally, vague general terms that match too many unrelated filters are excluded during collision resolution via post-processing: terms like "connectivity", "audio", "display", "processor", and "physical" create too much ambiguity across multiple facets.

Step 4: Semantic Expansion

N-gram Extraction

We extract frequent phrases from real search queries, ranging from single words (1-grams like "mini") to longer sequences (up to 6-grams like "mini pc with 16gb ram ssd"). Only phrases appearing at least 3 times are retained. This filtering approach reduces noise while preserving genuine user language patterns.

Filter Search Text Generation

For each filter value, we generate search texts:

Main Memory: 16:

"16gb ram memory"
"16 main memory"
"main memory 16"

SSD Storage: 512:

"512gb storage ssd"
"512 ssd storage"
"ssd storage 512"

These search texts represent the filter in embedding space.

Embedding and Matching

We convert both query phrases and filter search texts into embeddings, then compute cosine similarity between them. Phrases with similarity scores above a configured threshold are added to that filter's mapping.

Manual Seed Phrases

We inject high-confidence manual seeds before expansion:

"Series:i5": [
    "i5",
    "core i5",
    "intel i5",
    "intel core i5",
    "i5 processor",
    "cpu i5",
]

These seeds represent core phrases we know are correct for each filter. By including them with maximum similarity scores, they guide the expansion algorithm to find related phrases with similar meanings.These seeds always get similarity 1.0 and guide the expansion.

Incremental Embedding

We cache embeddings for both phrases and filter search texts. When new queries arrive:

Load existing embeddings
Embed only new phrases
Append to cache

This avoids re-embedding unchanged data. See Embedding Strategy for details.

Collision Resolution

After similarity matching completes, we resolve collisions between Memory and Storage filters:

Number-based disambiguation:

For ambiguous phrases containing numbers: if phrase has ≤ 64 in value, keep only for Memory; if > 64, keep only for Storage
This prevents "16gb" from matching SSD Storage filters and "512gb" from matching Memory filters

Explicit qualifiers override rules:

Phrases with "ram", "memory", "cpu", "processor" keywords → Memory only
Phrases with "ssd", "storage", "disk", "nvme", "drive" keywords → Storage only
Example: "processor 8gb" maps to Memory despite being numeric

Vague Terms:

Drop phrases like "connectivity", "audio", "display" that match multiple unrelated filters

Output Format

The final mappings are sorted by similarity (highest first), then length (shortest first):

{
  "Main Memory:16": [
    {"phrase": "16gb ram", "similarity": 1.0},
# ... (implementation details omitted)

Integration with Filter Extraction

These mappings power the filter extraction algorithm:

Query arrives: "mini pc with 16gb ram"
Extract phrases: ["mini pc", "16gb ram", "mini", "pc", "16gb", "ram"]
Match phrases to filters using mappings
Return filters: {"Form Factor": ["Mini PC"], "Main Memory": ["16"]}

See Filter Extraction Algorithm for details.

Storage and Distribution

Phrase mappings are persisted as JSON files and used across the system:

Base mappings file: Generated in Step 3a, contains rule-based phrases
Expanded mappings file: Generated in Step 4, contains semantic expansion results

The expanded mappings are used by:

Query page generation: Extract filters from query text
Search service: Real-time filter extraction API
Product matching: Filter products by extracted filters

Performance Characteristics

Base Generation (Step 3a):

Processing time scales with the number of filter values
Generates a large number of base phrase variants
Memory usage remains moderate during generation

Semantic Expansion (Step 4):

Processing time scales with query volume and embedding size
Output contains significantly more phrases than base generation
Memory requirements increase due to embeddings storage

The expansion is CPU-bound due to similarity computation. Using NumPy with BLAS acceleration speeds up matrix operations significantly.

Integration with SEO Pipeline

Phrase mapping generation is Steps 3a and 4 in the SEO pipeline:

Step 0: Embed Source Data - Products, parts, articles
Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
Step 2: Combine Queries - Merge all sources
Step 3a: Generate Base Phrase Mappings ← You are here
Step 3b: Embed Queries - Convert to vectors
Step 4: Expand Phrase Mappings ← You are here
Step 5: Cluster Queries - Group into pages
Step 6: Match Products - Query-product matching
Step 7: Build Query Pages - Generate HTML
Step 8: Generate Related Searches - Find related queries
Step 11: Migrate to Valkey - Load into search service

See SEO Pipeline Overview for the complete flow.

Why Two Steps?

Base generation (Step 3a) provides:

Precision: Rule-based phrases match exactly what we expect
Coverage: Permutations ensure all word orders are covered
Control: Feature-specific rules handle domain knowledge

Semantic expansion (Step 4) provides:

Flexibility: Discovers phrases we didn't anticipate
Real user language: Learns from actual search queries
Synonyms: Finds equivalent phrases ("16 gigs" for "16gb")

Together, they balance precision and recall.

References

Technical Concepts

Permutation - Wikipedia
Cosine Similarity - Wikipedia
N-gram - Wikipedia
NumPy - Official documentation
BLAS - Wikipedia

Model Documentation

all-mpnet-base-v2 - Hugging Face
Sentence Transformers - Official documentation

Embedding Strategy - How we generate embeddings
Filter Extraction Algorithm - Using mappings to extract filters
SEO Pipeline Overview - Complete pipeline architecture
Query Clustering - Grouping similar queries
Product Matching - Semantic matching

Summary

We generate phrase-to-filter mappings in two steps:

Step 3a (Base Generation):

Generate all permutations of heading, key, value, unit
Apply feature-specific rules (processor series, memory, storage, form factor)
Add port suffixes for connectivity features
Whitelist standalone values for specific features
Output base phrase set from rules

Step 4 (Semantic Expansion):

Extract n-gram phrases from real search queries
Embed phrases and filter search texts
Match phrases with similarity scores above threshold
Inject manual seed phrases to guide expansion
Resolve memory/storage collisions
Output expanded and disambiguated phrase set

The result is a comprehensive mapping that handles both expected phrases (via rules) and unexpected variations (via semantic similarity). These mappings power filter extraction across query pages, search, and product matching.

← Back to Documentation Index

Products

Popular Searches and Blogs