Feature Extraction: Aggregating Component Properties

This article explains how we extract and aggregate features from SKU structure to create comprehensive product specifications.

The Problem: Computing Product Features

A product SKU like Treo-N100-8-256-2H-W6-11P is composed of 7 parts:

  • Chassis: Treo

  • Board: N100 (Intel N100 processor)

  • RAM: 8GB

  • Flash: 256GB

  • Adapter: 2H (2 HDMI ports)

  • WiFi: W6 (WiFi 6)

  • OS: 11P (Windows 11 Pro)

Each part provides features. We need to aggregate all features to show complete product specifications.

The Solution: Recursive Feature Aggregation

We recursively traverse the SKU structure, collecting features from each part and its constituents.

Step 1: Split SKU into Parts

part_ids = sku.split("-")  # ['Treo', 'N100', '8', '256', '2H', 'W6', '11P']
categories = [Categories.CHASSIS, Categories.BOARD, Categories.RAM, ...]
parts = list(zip(categories, part_ids))

Step 2: Find Features for Each Part

For each part, we look up its data in productdb.json:

def find_features(category, partid):
    part_data = productdb.get(category, {}).get(partid, {})
    features = {}
# ... (implementation details omitted)

Step 3: Aggregate All Features

all_features = {}
for category, part_id in parts:
    all_features.update(find_features(category, part_id))

Step 4: Format and Group by Heading

Features are grouped by heading using feature_sequence:

feature_sequence = {
    "Processor Model": {"Unit": "", "Heading": "Processing"},
    "Cores": {"Unit": "", "Heading": "Processing"},
# ... (implementation details omitted)

For each feature in all_features, we:

  1. Look up its heading in feature_sequence
  2. Format value with unit (if not raw mode)
  3. Group under heading
features = {}
for feature, details in feature_sequence.items():
    if feature in all_features:
# ... (implementation details omitted)

Step 5: Add Computed Features

Some features are computed, not provided by parts:

Weight: Net and gross weight from dimensions

weight_dims = get_from_sku_weight_dimensions(sku)
if weight_dims["net_weight"] and weight_dims["gross_weight"]:
    net_kg = weight_dims["net_weight"] / 1000
    gross_kg = weight_dims["gross_weight"] / 1000
    features["Physical"]["Net and Gross Weight"] = f"{net_kg:.2f}kg, {gross_kg:.2f}kg"

Feature Sequence Structure

The feature_sequence dictionary defines:

  • Feature name: Key in productdb "provides" field

  • Unit: Appended to value (e.g., "GHz", "GB", "Watts")

  • Heading: Group name for display

Headings

Features are grouped into 9 headings:

Processing:

  • Generation, Series, Processor Brand, Processor Model

  • Cores, Max Frequency, Cache

  • Main Memory, SSD Storage

Display:

  • Screen Size, Resolution, Brightness, Viewing Angle, Camera

  • HDMI, HDMI 2.0, DisplayPort, VGA

Audio:

  • Speakers, Speaker Out, Mic In

  • Front Speaker Out, Front Mic In

Connectivity:

  • USB 3.2 Gen 2, USB 3.2 Gen 1, USB 3.0, USB 2.0, USB C

  • microSD Slot, Serial Port, Parallel Port, RS485 Port

Networking:

  • Ethernet, Wireless Networking

Power:

  • DC Voltage, DC Current, Power Input, Power Consumption, Cable Length

Environmental:

  • Operating Temperature, Operating Humidity, Certifications

Physical:

  • Form Factor, Thermal Design, Stand

  • Dimensions, Packing Dimensions, Weight

  • Housing Material, Housing Finish, Housing Colour

  • Kensington Lock

Accessories:

  • Keyboard and Mouse, VESA Mount

Operating System:

  • Operating System, OS Features

Recursive Aggregation Example

Consider a board with constituents:

{
  "N100": {
    "provides": {
# ... (implementation details omitted)

The RAM part provides:

{
  "4": {
    "provides": {
      "Main Memory": 4
    }
  }
}

Aggregation:

  1. Start with board N100
  2. Add board's features: {"Processor Model": "Intel N100", "Cores": 4, "Max Frequency": 3.4}
  3. Recursively find RAM features: {"Main Memory": 4}
  4. Multiply by quantity (1): {"Main Memory": 4}
  5. Merge: {"Processor Model": "Intel N100", "Cores": 4, "Max Frequency": 3.4, "Main Memory": 4}

Numeric Feature Multiplication

When a constituent has quantity > 1, numeric features are multiplied:

Example: Board with 2 RAM modules

{
  "constituents": [
    {
      "Category": "RAM",
      "PartID": "4",
      "qty": 2
    }
  ]
}

Result: Main Memory = 4 × 2 = 8 GB

Non-Numeric Feature Handling

For non-numeric features (strings), the last value wins:

Example: Multiple adapters

{
  "constituents": [
    {"Category": "ADAPTER", "PartID": "2H", "qty": 1},
    {"Category": "ADAPTER", "PartID": "1D", "qty": 1}
  ]
}

Both provide {"HDMI": "2"} and {"DisplayPort": "1"}. The aggregation merges them.

Raw vs Formatted Output

The product_features function supports two modes:

Formatted (default):

features = product_features("Treo-N100-8-256-2H-W6-11P")
# {"Processing": {"Cores": "4", "Max Frequency": "3.4 GHz", "Main Memory": "8 GB"}}

Raw (for embeddings):

features = product_features("Treo-N100-8-256-2H-W6-11P", raw=True)
# {"Processing": {"Cores": "4", "Max Frequency": "3.4", "Main Memory": "8"}}

Raw mode omits units for cleaner text in embeddings.

Filtering Facets

Not all features are used for filtering. We exclude:

  • OS Features: Too specific

  • Dimensions: Not useful for filtering

  • Weight: Not useful for filtering

  • Cache: Too technical

EXCLUDE_FACETS = {"OS Features", "Dimensions", "Weight", "Cache"}

facets_by_heading = {}
# ... (implementation details omitted)

This generates FACETS_FOR_FILTERING used in filter extraction.

Debugging Unused Features

The function logs features present in productdb but not in feature_sequence:

unused_features = [
    (feature, str(value))
    for feature, value in all_features.items()
    if feature not in feature_sequence
]
.currentapp.log.debug(unused_features)

This helps identify missing feature definitions.

Integration with Pipeline

Feature extraction is used throughout the system:

Product Pages

Display features grouped by heading:

features = product_features(sku)
# Render in template with headings

Embeddings

Extract features for source data embedding:

features = product_features(sku, raw=True)
# Include in product description for embedding

Filtering

Generate filter options from features:

facets = FACETS_FOR_FILTERING
# Use in filter extraction and query pages

Datasheets

Display features in PDF datasheets:

features = product_features(sku)
# Render in datasheet template

References

Technical Concepts

Related Articles

Summary

Feature extraction aggregates component properties into product specifications:

Process:

  • Split SKU into parts (7 components)

  • Recursively find features for each part

  • Aggregate features (multiply numeric by quantity)

  • Format with units (GHz, GB, Watts)

  • Group by heading (Processing, Display, Connectivity, etc.)

  • Add computed features (weight from dimensions)

Output:

  • Formatted: "4 Cores, 3.4 GHz, 8 GB RAM"

  • Raw: "4, 3.4, 8" (for embeddings)

  • Grouped: 9 headings with features

Benefits:

  • Automatic aggregation (no manual entry)

  • Consistent formatting (units from feature_sequence)

  • Recursive composition (parts can have constituents)

  • Flexible output (formatted or raw)

This enables automatic product specification generation from SKU structure alone.


← Back to Documentation Index