Feature Extraction: Aggregating Component Properties
This article explains how we extract and aggregate features from SKU structure to create comprehensive product specifications.
The Problem: Computing Product Features
A product SKU like Treo-N100-8-256-2H-W6-11P is composed of 7 parts:
-
Chassis: Treo
-
Board: N100 (Intel N100 processor)
-
RAM: 8GB
-
Flash: 256GB
-
Adapter: 2H (2 HDMI ports)
-
WiFi: W6 (WiFi 6)
-
OS: 11P (Windows 11 Pro)
Each part provides features. We need to aggregate all features to show complete product specifications.
The Solution: Recursive Feature Aggregation
We recursively traverse the SKU structure, collecting features from each part and its constituents.
Step 1: Split SKU into Parts
part_ids = sku.split("-") # ['Treo', 'N100', '8', '256', '2H', 'W6', '11P']
categories = [Categories.CHASSIS, Categories.BOARD, Categories.RAM, ...]
parts = list(zip(categories, part_ids))
Step 2: Find Features for Each Part
For each part, we look up its data in productdb.json:
def find_features(category, partid):
part_data = productdb.get(category, {}).get(partid, {})
features = {}
# ... (implementation details omitted)
Step 3: Aggregate All Features
all_features = {}
for category, part_id in parts:
all_features.update(find_features(category, part_id))
Step 4: Format and Group by Heading
Features are grouped by heading using feature_sequence:
feature_sequence = {
"Processor Model": {"Unit": "", "Heading": "Processing"},
"Cores": {"Unit": "", "Heading": "Processing"},
# ... (implementation details omitted)
For each feature in all_features, we:
- Look up its heading in
feature_sequence - Format value with unit (if not raw mode)
- Group under heading
features = {}
for feature, details in feature_sequence.items():
if feature in all_features:
# ... (implementation details omitted)
Step 5: Add Computed Features
Some features are computed, not provided by parts:
Weight: Net and gross weight from dimensions
weight_dims = get_from_sku_weight_dimensions(sku)
if weight_dims["net_weight"] and weight_dims["gross_weight"]:
net_kg = weight_dims["net_weight"] / 1000
gross_kg = weight_dims["gross_weight"] / 1000
features["Physical"]["Net and Gross Weight"] = f"{net_kg:.2f}kg, {gross_kg:.2f}kg"
Feature Sequence Structure
The feature_sequence dictionary defines:
-
Feature name: Key in productdb "provides" field
-
Unit: Appended to value (e.g., "GHz", "GB", "Watts")
-
Heading: Group name for display
Headings
Features are grouped into 9 headings:
Processing:
-
Generation, Series, Processor Brand, Processor Model
-
Cores, Max Frequency, Cache
-
Main Memory, SSD Storage
Display:
-
Screen Size, Resolution, Brightness, Viewing Angle, Camera
-
HDMI, HDMI 2.0, DisplayPort, VGA
Audio:
-
Speakers, Speaker Out, Mic In
-
Front Speaker Out, Front Mic In
Connectivity:
-
USB 3.2 Gen 2, USB 3.2 Gen 1, USB 3.0, USB 2.0, USB C
-
microSD Slot, Serial Port, Parallel Port, RS485 Port
Networking:
- Ethernet, Wireless Networking
Power:
- DC Voltage, DC Current, Power Input, Power Consumption, Cable Length
Environmental:
- Operating Temperature, Operating Humidity, Certifications
Physical:
-
Form Factor, Thermal Design, Stand
-
Dimensions, Packing Dimensions, Weight
-
Housing Material, Housing Finish, Housing Colour
-
Kensington Lock
Accessories:
- Keyboard and Mouse, VESA Mount
Operating System:
- Operating System, OS Features
Recursive Aggregation Example
Consider a board with constituents:
{
"N100": {
"provides": {
# ... (implementation details omitted)
The RAM part provides:
{
"4": {
"provides": {
"Main Memory": 4
}
}
}
Aggregation:
- Start with board N100
- Add board's features:
{"Processor Model": "Intel N100", "Cores": 4, "Max Frequency": 3.4} - Recursively find RAM features:
{"Main Memory": 4} - Multiply by quantity (1):
{"Main Memory": 4} - Merge:
{"Processor Model": "Intel N100", "Cores": 4, "Max Frequency": 3.4, "Main Memory": 4}
Numeric Feature Multiplication
When a constituent has quantity > 1, numeric features are multiplied:
Example: Board with 2 RAM modules
{
"constituents": [
{
"Category": "RAM",
"PartID": "4",
"qty": 2
}
]
}
Result: Main Memory = 4 × 2 = 8 GB
Non-Numeric Feature Handling
For non-numeric features (strings), the last value wins:
Example: Multiple adapters
{
"constituents": [
{"Category": "ADAPTER", "PartID": "2H", "qty": 1},
{"Category": "ADAPTER", "PartID": "1D", "qty": 1}
]
}
Both provide {"HDMI": "2"} and {"DisplayPort": "1"}. The aggregation merges them.
Raw vs Formatted Output
The product_features function supports two modes:
Formatted (default):
features = product_features("Treo-N100-8-256-2H-W6-11P")
# {"Processing": {"Cores": "4", "Max Frequency": "3.4 GHz", "Main Memory": "8 GB"}}
Raw (for embeddings):
features = product_features("Treo-N100-8-256-2H-W6-11P", raw=True)
# {"Processing": {"Cores": "4", "Max Frequency": "3.4", "Main Memory": "8"}}
Raw mode omits units for cleaner text in embeddings.
Filtering Facets
Not all features are used for filtering. We exclude:
-
OS Features: Too specific
-
Dimensions: Not useful for filtering
-
Weight: Not useful for filtering
-
Cache: Too technical
EXCLUDE_FACETS = {"OS Features", "Dimensions", "Weight", "Cache"}
facets_by_heading = {}
# ... (implementation details omitted)
This generates FACETS_FOR_FILTERING used in filter extraction.
Debugging Unused Features
The function logs features present in productdb but not in feature_sequence:
unused_features = [
(feature, str(value))
for feature, value in all_features.items()
if feature not in feature_sequence
]
.currentapp.log.debug(unused_features)
This helps identify missing feature definitions.
Integration with Pipeline
Feature extraction is used throughout the system:
Product Pages
Display features grouped by heading:
features = product_features(sku)
# Render in template with headings
Embeddings
Extract features for source data embedding:
features = product_features(sku, raw=True)
# Include in product description for embedding
Filtering
Generate filter options from features:
facets = FACETS_FOR_FILTERING
# Use in filter extraction and query pages
Datasheets
Display features in PDF datasheets:
features = product_features(sku)
# Render in datasheet template
References
Technical Concepts
-
Recursion - Wikipedia
-
Tree traversal - Wikipedia
Related Articles
-
SKU Structure - Hyphen-separated component architecture
-
SKU productdb - Single source of truth for products
-
Source Data Embedding - Using features in embeddings
-
Filter Extraction - Using features for filtering
Summary
Feature extraction aggregates component properties into product specifications:
Process:
-
Split SKU into parts (7 components)
-
Recursively find features for each part
-
Aggregate features (multiply numeric by quantity)
-
Format with units (GHz, GB, Watts)
-
Group by heading (Processing, Display, Connectivity, etc.)
-
Add computed features (weight from dimensions)
Output:
-
Formatted: "4 Cores, 3.4 GHz, 8 GB RAM"
-
Raw: "4, 3.4, 8" (for embeddings)
-
Grouped: 9 headings with features
Benefits:
-
Automatic aggregation (no manual entry)
-
Consistent formatting (units from feature_sequence)
-
Recursive composition (parts can have constituents)
-
Flexible output (formatted or raw)
This enables automatic product specification generation from SKU structure alone.