Filter Extraction: Regex Word Boundary Matching

This article explains how we extract structured product filters from natural language search queries using word boundary regex matching against phrase-to-filter mappings.

The Problem: From Natural Language to Structured Filters

When a user searches for "mini pc with 16gb ram," we need to extract:

{
  "Form Factor": "Mini PC",
  "Main Memory": "16"
}

These structured filters enable:

Product filtering: Show only matching products
Faceted navigation: Display available filter options
Query page generation: Create SEO-optimized pages
Related searches: Find similar queries

The challenge is handling all the ways users express the same intent.

The Algorithm: Word Boundary Regex Matching

Step 1: Load Phrase Mappings

We load the phrase-to-filter mappings generated by the SEO pipeline. These mappings connect phrases to filter values, allowing lookup from phrase to the corresponding filter key and value.

We invert this structure for fast lookup, so we can search by phrase rather than filter:

phrase_to_filter = {
    "16gb ram": ("Main Memory", "16"),
    "16 gb ram": ("Main Memory", "16"),
    "mini pc": ("Form Factor", "Mini PC"),
    "mini computer": ("Form Factor", "Mini PC")
}

Step 2: Normalize Query

Convert the query to lowercase for case-insensitive matching.

Step 3: Match Phrases with Word Boundaries

For each phrase in the mappings, we check if it appears in the query using word boundary anchors in regular expressions.

The \b anchors ensure we match complete words, not substrings: "mini pc" matches "mini pc with ram" but does NOT match "minipc" (no space), and "16gb" matches "16gb ram" but does NOT match "216gb" (partial match).

Step 4: Collect All Matches

We iterate through all phrases and collect matching filters. For "mini pc with 16gb ram", this produces filters like {"Form Factor": "Mini PC", "Main Memory": "16"}.

Why Word Boundaries?

Word boundaries prevent false matches:

Without word boundaries:

"i5" would match "i5000" (wrong)
"ram" would match "program" (wrong)
"pc" would match "pcie" (wrong)

With word boundaries: "i5" matches "i5 processor" ✅ but does NOT match "i5000" ❌; "ram" matches "16gb ram" ✅ but does NOT match "program" ❌. The \b anchor ensures we match only at word edges.

Handling Multiple Matches

If multiple phrases match the same filter, the last match wins:

# Query: "mini pc small computer"
# Both "mini pc" and "small computer" map to "Form Factor:Mini PC"
# Result: {"Form Factor": "Mini PC"} (deduplicated)

If multiple phrases match different values for the same filter, the last match wins:

# Query: "8gb 16gb ram"
# "8gb" → Main Memory:8
# "16gb" → Main Memory:16
# Result: {"Main Memory": "16"} (last match wins)

In practice, users rarely specify conflicting values, so this is not a problem.

Phrase Priority

Phrases are matched in the order they appear in the mappings. Since mappings are sorted by similarity (highest first) and then by length (shortest first), higher-quality matches are checked first. However, since we iterate through all phrases, order doesn't affect the final result—the last match wins.

Performance Optimization

Caching

The phrase-to-filter mapping is loaded once at startup and cached in memory, avoiding repeated disk I/O on every request.

Valkey Fallback

We try to load mappings from Valkey (Redis fork) first, falling back to JSON files:

Check Valkey for fast in-memory lookup
Load from JSON if Valkey miss
Store in Valkey with cache expiration

This reduces latency for subsequent requests.

Regex Matching

Patterns use re.search() with word boundary anchors to efficiently match complete words without false partial matches.

Integration with Search Service

Filter extraction is implemented as an API service endpoint that accepts search queries and returns extracted filters.

The service:

Accepts a search query as input
Loads phrase-to-filter mappings into memory
Normalizes and matches phrases with word boundaries
Returns structured filter key-value pairs

The main web server calls this service to extract filters from user queries:

filters = extract_filters_from_query("mini pc 16gb ram")
# Returns: {"Form Factor": "Mini PC", "Main Memory": "16"}

This separation allows:

Independent scaling: Search service can run on a separate server
Caching isolation: Search service manages its own cache
Service restart: Search service can restart without affecting main web server

See Search Service Architecture for details.

Query Logging

Every filter extraction is logged for SEO pipeline consumption. These logs feed back into the SEO pipeline for discovering new query patterns and improving phrase mappings over time.

Use Cases

Query Pages (/q/)

Query pages extract filters from the URL slug to determine which products to display.

Search API (/api/search)

The search API extracts filters from the search query to find and return matching products.

Autocomplete

Autocomplete suggestions extract filters to provide filter previews alongside search suggestions.

Error Handling

Missing Mappings

If phrase mappings are not loaded, we return empty filters. This prevents crashes when the SEO pipeline hasn't run yet.

Invalid Queries

Empty or whitespace-only queries return empty filters.

Regex Errors

We use regex escaping to sanitize phrases before regex matching, preventing syntax errors from special characters in phrases.

Performance Characteristics

Filter extraction operates efficiently due to word boundary regex matching and aggressive caching:

Mapping load happens once at startup
Per-query extraction is CPU-bound (regex matching)
Memory usage for caches remains manageable
Cache hit rates are high in production

Word boundary regex is faster than substring search because the regex engine can skip non-matching positions efficiently.

Limitations

Phrase Order Dependency

We match phrases in iteration order, which is not guaranteed. If two phrases overlap, the last match wins:

# Query: "mini pc"
# Phrases: ["mini", "mini pc"]
# If "mini" is checked last, it overwrites "mini pc"

In practice, this doesn't happen because:

Longer phrases are more specific and appear first in sorted mappings
Overlapping phrases usually map to the same filter value

No Phrase Combination

We don't combine multiple phrases into a single filter value:

# Query: "dual core quad core"
# Result: {"Cores": "4"} (last match wins)
# NOT: {"Cores": ["2", "4"]} (multiple values)

This is intentional—users rarely specify multiple values for the same filter.

No Negation

We don't handle negation (e.g., "mini pc without Windows"). Negation is rare in search queries, so it's not currently supported.

References

Technical Concepts

Regular Expression - Wikipedia
Word Boundary - Wikipedia
Valkey - Official website
Redis - Official website (Valkey fork)

Python Documentation

re.search() - Python docs
re.escape() - Python docs

Phrase-to-Filter Mappings - How mappings are generated
Search Service Architecture - Standalone search service
Query Pages vs Search Pages - Different page types
Related Search Generation - Using filters for related queries
SEO Pipeline Overview - Complete pipeline architecture

Summary

We extract filters from search queries using word boundary regex matching:

Load mappings: Phrase-to-filter mappings from SEO pipeline
Normalize query: Convert to lowercase
Match phrases: Use \b word boundaries to match complete words
Collect filters: Build dictionary of filter key-value pairs
Cache aggressively: In-memory cache + Valkey fallback
Log queries: Feed back into SEO pipeline

The algorithm is simple, efficient, and handles all phrase variations generated by the SEO pipeline. Word boundaries prevent false matches while allowing flexible phrase matching. The result is robust filter extraction that powers query pages, search, autocomplete, and related searches.

← Back to Documentation Index

Products

Popular Searches and Blogs