Feature Extraction: Aggregating Component Properties

This article explains how we extract and aggregate features from SKU structure to create comprehensive product specifications.

The Problem: Computing Product Features

A product SKU like Treo-N100-8-256-2H-W6-11P is composed of 7 parts:

Chassis: Treo
Board: N100 (Intel N100 processor)
RAM: 8GB
Flash: 256GB
Adapter: 2H (2 HDMI ports)
WiFi: W6 (WiFi 6)
OS: 11P (Windows 11 Pro)

Each part provides features. We need to aggregate all features to show complete product specifications.

The Solution: Recursive Feature Aggregation

We recursively traverse the SKU structure, collecting features from each part and its constituents.

Step 1: Split SKU into Parts

part_ids = sku.split("-")  # ['Treo', 'N100', '8', '256', '2H', 'W6', '11P']
categories = [Categories.CHASSIS, Categories.BOARD, Categories.RAM, ...]
parts = list(zip(categories, part_ids))

Step 2: Find Features for Each Part

For each part, we look up its data in productdb.json:

def find_features(category, partid):
    part_data = productdb.get(category, {}).get(partid, {})
    features = {}
# ... (implementation details omitted)

Step 3: Aggregate All Features

all_features = {}
for category, part_id in parts:
    all_features.update(find_features(category, part_id))

Step 4: Format and Group by Heading

Features are grouped by heading using feature_sequence:

feature_sequence = {
    "Processor Model": {"Unit": "", "Heading": "Processing"},
    "Cores": {"Unit": "", "Heading": "Processing"},
# ... (implementation details omitted)

For each feature in all_features, we:

Look up its heading in feature_sequence
Format value with unit (if not raw mode)
Group under heading

features = {}
for feature, details in feature_sequence.items():
    if feature in all_features:
# ... (implementation details omitted)

Step 5: Add Computed Features

Some features are computed, not provided by parts:

Weight: Net and gross weight from dimensions

weight_dims = get_from_sku_weight_dimensions(sku)
if weight_dims["net_weight"] and weight_dims["gross_weight"]:
    net_kg = weight_dims["net_weight"] / 1000
    gross_kg = weight_dims["gross_weight"] / 1000
    features["Physical"]["Net and Gross Weight"] = f"{net_kg:.2f}kg, {gross_kg:.2f}kg"

Feature Sequence Structure

The feature_sequence dictionary defines:

Feature name: Key in productdb "provides" field
Unit: Appended to value (e.g., "GHz", "GB", "Watts")
Heading: Group name for display

Headings

Features are grouped into 9 headings:

Processing:

Generation, Series, Processor Brand, Processor Model
Cores, Max Frequency, Cache
Main Memory, SSD Storage

Display:

Screen Size, Resolution, Brightness, Viewing Angle, Camera
HDMI, HDMI 2.0, DisplayPort, VGA

Audio:

Speakers, Speaker Out, Mic In
Front Speaker Out, Front Mic In

Connectivity:

USB 3.2 Gen 2, USB 3.2 Gen 1, USB 3.0, USB 2.0, USB C
microSD Slot, Serial Port, Parallel Port, RS485 Port

Networking:

Ethernet, Wireless Networking

Power:

DC Voltage, DC Current, Power Input, Power Consumption, Cable Length

Environmental:

Operating Temperature, Operating Humidity, Certifications

Physical:

Form Factor, Thermal Design, Stand
Dimensions, Packing Dimensions, Weight
Housing Material, Housing Finish, Housing Colour
Kensington Lock

Accessories:

Keyboard and Mouse, VESA Mount

Operating System:

Operating System, OS Features

Recursive Aggregation Example

Consider a board with constituents:

{
  "N100": {
    "provides": {
# ... (implementation details omitted)

The RAM part provides:

{
  "4": {
    "provides": {
      "Main Memory": 4
    }
  }
}

Aggregation:

Start with board N100
Add board's features: {"Processor Model": "Intel N100", "Cores": 4, "Max Frequency": 3.4}
Recursively find RAM features: {"Main Memory": 4}
Multiply by quantity (1): {"Main Memory": 4}
Merge: {"Processor Model": "Intel N100", "Cores": 4, "Max Frequency": 3.4, "Main Memory": 4}

Numeric Feature Multiplication

When a constituent has quantity > 1, numeric features are multiplied:

Example: Board with 2 RAM modules

{
  "constituents": [
    {
      "Category": "RAM",
      "PartID": "4",
      "qty": 2
    }
  ]
}

Result: Main Memory = 4 × 2 = 8 GB

Non-Numeric Feature Handling

For non-numeric features (strings), the last value wins:

Example: Multiple adapters

{
  "constituents": [
    {"Category": "ADAPTER", "PartID": "2H", "qty": 1},
    {"Category": "ADAPTER", "PartID": "1D", "qty": 1}
  ]
}

Both provide {"HDMI": "2"} and {"DisplayPort": "1"}. The aggregation merges them.

Raw vs Formatted Output

The product_features function supports two modes:

Formatted (default):

features = product_features("Treo-N100-8-256-2H-W6-11P")
# {"Processing": {"Cores": "4", "Max Frequency": "3.4 GHz", "Main Memory": "8 GB"}}

Raw (for embeddings):

features = product_features("Treo-N100-8-256-2H-W6-11P", raw=True)
# {"Processing": {"Cores": "4", "Max Frequency": "3.4", "Main Memory": "8"}}

Raw mode omits units for cleaner text in embeddings.

Filtering Facets

Not all features are used for filtering. We exclude:

OS Features: Too specific
Dimensions: Not useful for filtering
Weight: Not useful for filtering
Cache: Too technical

EXCLUDE_FACETS = {"OS Features", "Dimensions", "Weight", "Cache"}

facets_by_heading = {}
# ... (implementation details omitted)

This generates FACETS_FOR_FILTERING used in filter extraction.

Debugging Unused Features

The function logs features present in productdb but not in feature_sequence:

unused_features = [
    (feature, str(value))
    for feature, value in all_features.items()
    if feature not in feature_sequence
]
.currentapp.log.debug(unused_features)

This helps identify missing feature definitions.

Integration with Pipeline

Feature extraction is used throughout the system:

Product Pages

Display features grouped by heading:

features = product_features(sku)
# Render in template with headings

Embeddings

Extract features for source data embedding:

features = product_features(sku, raw=True)
# Include in product description for embedding

Filtering

Generate filter options from features:

facets = FACETS_FOR_FILTERING
# Use in filter extraction and query pages

Datasheets

Display features in PDF datasheets:

features = product_features(sku)
# Render in datasheet template

References

Technical Concepts

Recursion - Wikipedia
Tree traversal - Wikipedia

SKU Structure - Hyphen-separated component architecture
SKU productdb - Single source of truth for products
Source Data Embedding - Using features in embeddings
Filter Extraction - Using features for filtering

Summary

Feature extraction aggregates component properties into product specifications:

Process:

Split SKU into parts (7 components)
Recursively find features for each part
Aggregate features (multiply numeric by quantity)
Format with units (GHz, GB, Watts)
Group by heading (Processing, Display, Connectivity, etc.)
Add computed features (weight from dimensions)

Output:

Formatted: "4 Cores, 3.4 GHz, 8 GB RAM"
Raw: "4, 3.4, 8" (for embeddings)
Grouped: 9 headings with features

Benefits:

Automatic aggregation (no manual entry)
Consistent formatting (units from feature_sequence)
Recursive composition (parts can have constituents)
Flexible output (formatted or raw)

This enables automatic product specification generation from SKU structure alone.

← Back to Documentation Index

Products

Popular Searches and Blogs