Phrase-to-Filter Mappings: Semantic Search for Product Filters
This article explains how we generate and expand phrase-to-filter mappings that power our filter extraction system. These mappings connect natural language phrases from search queries to structured product filters.
The Problem: Extracting Filters from Natural Language
When users search for "mini pc with 16gb ram," we need to extract:
-
Form Factor: Mini PC
-
Main Memory: 16
But users express the same intent in many ways:
-
"16gb ram mini pc"
-
"mini computer 16 gb memory"
-
"small pc 16gb"
-
"compact desktop with 16 gigs"
We need a system that maps all these phrase variations to the correct filter values.
Two-Step Process
We generate phrase mappings in two steps:
- Step 3a: Generate base phrases from product features using permutations and feature-specific rules
- Step 4: Expand with semantic similarity using embeddings
This hybrid approach combines rule-based precision with semantic flexibility.
Step 3a: Base Phrase Generation
Feature Metadata
Every product feature has metadata in the feature sequence:
-
Heading: Category name (e.g., "Processing" for processor features)
-
Unit: Measurement unit (e.g., "GB" for memory, "GHz" for frequency)
This metadata guides phrase generation.
Permutation-Based Generation
For each feature value, we generate all permutations of:
-
Heading: "processor"
-
Feature key: "series"
-
Value: "i5"
-
Unit: (none for series)
This produces:
-
"processor series i5"
-
"series processor i5"
-
"i5 processor series"
-
"i5 series processor"
-
"processor i5"
-
"series i5"
-
"i5"
All permutations ensure we match queries regardless of word order.
Value + Unit Combinations
For features with units (memory, storage, frequency), we generate both spaced and non-spaced variants:
-
"16 gb" (spaced)
-
"16gb" (no space)
These combine with other components:
-
"16gb ram"
-
"ram 16gb"
-
"processor 16gb"
Port Suffix Rules
For connectivity features (USB, HDMI, DisplayPort), we add port suffixes:
-
"usb port"
-
"usb ports"
-
"2 usb ports"
-
"hdmi port"
This matches queries like "mini pc with 2 hdmi ports."
Feature-Specific Rules
Each feature type has custom phrase generation:
Processor Series (i3, i5, N-series):
-
Core processor designations generate variants: manufacturer name, core brand, combined forms
-
N-series processors (N100, etc.) create similar pattern combinations
-
Each generates multiple permutations with "processor"
Main Memory:
-
Numeric values generate both spaced and non-spaced GB variants("16gb","16 gb")
-
Auto-appended with "ram" and "memory" qualifiers("16gb memory","16 gb ram")
SSD Storage:
-
Numeric values generate GB variants("512gb", "512 gb")
-
Generates TB variants automatically for values ≥ 1024 (e.g., 1024GB → 1TB)
-
Auto-appended with "ssd" and "storage" qualifiers("512gb ssd", "512 gb storage")
Form Factor:
-
Mini PC → multiple variants including non-spaced form
-
All-in-One → "all in one", "aio", abbreviated forms
-
Thin Client → variants with/without spacing
-
Industrial PC → abbreviated and full forms
Generation:
- Numeric generation values become: "12th gen", "12th generation", "gen 12"
Cores:
-
2 → "dual core"
-
4 → "quad core"
-
6 → "hexa core"
-
8 → "octa core"
-
All generate "[n] core processor" variants
Ethernet:
-
"1000" → ["gigabit ethernet", "gbe", "1gbps"]
-
"2500" → ["2.5gbe", "2.5 gigabit", "2.5gbps"]
Operating System:
-
"Windows 11" → ["windows", "windows 11", "win 11"]
-
"Ubuntu" → ["linux", "ubuntu"]
-
"FreeDOS" → ["freedos", "no os", "without os"]
Whitelist for Standalone Values
Only specific features allow standalone value matching to prevent false matches:
-
Series: "i3", "i5", "N100" (but not single letters)
-
Processor Model: "N100", "1335U"
-
Operating System: "Windows", "Linux", "Ubuntu"
-
Generation: "12th", "13th", "14th"
-
Processor Brand: "Intel", "ARM"
For example, "2" shouldn't match "2 cores" when the query is "2 hdmi ports." The length check prevents single letters or very short ambiguous values from being matched standalone. Features with explicit unit components, like Ethernet or ports, only generate standalone combinations when qualified with their unit.
Collision Prevention
Memory and storage values overlap (both have 64, 128, 256, etc.). Step 4 enforces value-range rules to disambiguate:
-
≤ 64 GB: Main Memory (RAM)
-
≥ 128 GB: SSD Storage
-
65-127 GB range: Skipped (ambiguous)
Phrases with explicit qualifiers ("ram"/"memory" or "ssd"/"storage") override this rule and can match any value.
Additionally, vague general terms that match too many unrelated filters are excluded during collision resolution via post-processing: terms like "connectivity", "audio", "display", "processor", and "physical" create too much ambiguity across multiple facets.
Step 4: Semantic Expansion
N-gram Extraction
We extract frequent phrases from real search queries, ranging from single words (1-grams like "mini") to longer sequences (up to 6-grams like "mini pc with 16gb ram ssd"). Only phrases appearing at least 3 times are retained. This filtering approach reduces noise while preserving genuine user language patterns.
Filter Search Text Generation
For each filter value, we generate search texts:
Main Memory: 16:
-
"16gb ram memory"
-
"16 main memory"
-
"main memory 16"
SSD Storage: 512:
-
"512gb storage ssd"
-
"512 ssd storage"
-
"ssd storage 512"
These search texts represent the filter in embedding space.
Embedding and Matching
We convert both query phrases and filter search texts into embeddings, then compute cosine similarity between them. Phrases with similarity scores above a configured threshold are added to that filter's mapping.
Manual Seed Phrases
We inject high-confidence manual seeds before expansion:
"Series:i5": [
"i5",
"core i5",
"intel i5",
"intel core i5",
"i5 processor",
"cpu i5",
]
These seeds represent core phrases we know are correct for each filter. By including them with maximum similarity scores, they guide the expansion algorithm to find related phrases with similar meanings.These seeds always get similarity 1.0 and guide the expansion.
Incremental Embedding
We cache embeddings for both phrases and filter search texts. When new queries arrive:
- Load existing embeddings
- Embed only new phrases
- Append to cache
This avoids re-embedding unchanged data. See Embedding Strategy for details.
Collision Resolution
After similarity matching completes, we resolve collisions between Memory and Storage filters:
Number-based disambiguation:
-
For ambiguous phrases containing numbers: if phrase has ≤ 64 in value, keep only for Memory; if > 64, keep only for Storage
-
This prevents "16gb" from matching SSD Storage filters and "512gb" from matching Memory filters
Explicit qualifiers override rules:
-
Phrases with "ram", "memory", "cpu", "processor" keywords → Memory only
-
Phrases with "ssd", "storage", "disk", "nvme", "drive" keywords → Storage only
-
Example: "processor 8gb" maps to Memory despite being numeric
Vague Terms:
- Drop phrases like "connectivity", "audio", "display" that match multiple unrelated filters
Output Format
The final mappings are sorted by similarity (highest first), then length (shortest first):
{
"Main Memory:16": [
{"phrase": "16gb ram", "similarity": 1.0},
# ... (implementation details omitted)
Integration with Filter Extraction
These mappings power the filter extraction algorithm:
- Query arrives: "mini pc with 16gb ram"
- Extract phrases: ["mini pc", "16gb ram", "mini", "pc", "16gb", "ram"]
- Match phrases to filters using mappings
- Return filters:
{"Form Factor": ["Mini PC"], "Main Memory": ["16"]}
See Filter Extraction Algorithm for details.
Storage and Distribution
Phrase mappings are persisted as JSON files and used across the system:
-
Base mappings file: Generated in Step 3a, contains rule-based phrases
-
Expanded mappings file: Generated in Step 4, contains semantic expansion results
The expanded mappings are used by:
-
Query page generation: Extract filters from query text
-
Search service: Real-time filter extraction API
-
Product matching: Filter products by extracted filters
Performance Characteristics
Base Generation (Step 3a):
-
Processing time scales with the number of filter values
-
Generates a large number of base phrase variants
-
Memory usage remains moderate during generation
Semantic Expansion (Step 4):
-
Processing time scales with query volume and embedding size
-
Output contains significantly more phrases than base generation
-
Memory requirements increase due to embeddings storage
The expansion is CPU-bound due to similarity computation. Using NumPy with BLAS acceleration speeds up matrix operations significantly.
Integration with SEO Pipeline
Phrase mapping generation is Steps 3a and 4 in the SEO pipeline:
- Step 0: Embed Source Data - Products, parts, articles
- Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
- Step 2: Combine Queries - Merge all sources
- Step 3a: Generate Base Phrase Mappings ← You are here
- Step 3b: Embed Queries - Convert to vectors
- Step 4: Expand Phrase Mappings ← You are here
- Step 5: Cluster Queries - Group into pages
- Step 6: Match Products - Query-product matching
- Step 7: Build Query Pages - Generate HTML
- Step 8: Generate Related Searches - Find related queries
- Step 11: Migrate to Valkey - Load into search service
See SEO Pipeline Overview for the complete flow.
Why Two Steps?
Base generation (Step 3a) provides:
-
Precision: Rule-based phrases match exactly what we expect
-
Coverage: Permutations ensure all word orders are covered
-
Control: Feature-specific rules handle domain knowledge
Semantic expansion (Step 4) provides:
-
Flexibility: Discovers phrases we didn't anticipate
-
Real user language: Learns from actual search queries
-
Synonyms: Finds equivalent phrases ("16 gigs" for "16gb")
Together, they balance precision and recall.
References
Technical Concepts
-
Permutation - Wikipedia
-
Cosine Similarity - Wikipedia
-
N-gram - Wikipedia
-
NumPy - Official documentation
-
BLAS - Wikipedia
Model Documentation
-
all-mpnet-base-v2 - Hugging Face
-
Sentence Transformers - Official documentation
Related Articles
-
Embedding Strategy - How we generate embeddings
-
Filter Extraction Algorithm - Using mappings to extract filters
-
SEO Pipeline Overview - Complete pipeline architecture
-
Query Clustering - Grouping similar queries
-
Product Matching - Semantic matching
Summary
We generate phrase-to-filter mappings in two steps:
Step 3a (Base Generation):
-
Generate all permutations of heading, key, value, unit
-
Apply feature-specific rules (processor series, memory, storage, form factor)
-
Add port suffixes for connectivity features
-
Whitelist standalone values for specific features
-
Output base phrase set from rules
Step 4 (Semantic Expansion):
-
Extract n-gram phrases from real search queries
-
Embed phrases and filter search texts
-
Match phrases with similarity scores above threshold
-
Inject manual seed phrases to guide expansion
-
Resolve memory/storage collisions
-
Output expanded and disambiguated phrase set
The result is a comprehensive mapping that handles both expected phrases (via rules) and unexpected variations (via semantic similarity). These mappings power filter extraction across query pages, search, and product matching.