Datasheet Generation: On-Demand and Batch PDF Creation

This article explains how we generate product datasheets as PDFs, both on-demand (when users request them) and in batch (pre-generating for popular products).

The Problem: Creating Product Datasheets

Each product needs a professional PDF datasheet with:

  • Product name and images

  • Technical specifications grouped by category

  • Company branding and contact information

  • Optimized layout (single page, balanced columns)

Generating PDFs on every request is slow (~2 seconds per PDF). Pre-generating all 65,000 SKUs wastes storage and time.

The Solution: Hybrid Approach

We use two generation methods:

On-demand: Generate PDF when user requests it (first time)

Batch: Pre-generate PDFs for popular products (nightly)

On-Demand Generation

URL Structure

/ds/<sku>.pdf
/ds/<sku>

Both URLs generate the same PDF.

Request Flow

@web.route("/ds/<sku>.pdf")
def datasheet(sku: str):
    # Validate SKU
# ... (implementation details omitted)

Generation Process

Step 1: Get Product Data

# Product name
name = expand_sku(sku)

# ... (implementation details omitted)

Step 2: Balance Columns

Features are distributed across 3 columns to balance line counts:

# Calculate lines per feature group
all_groups = []
for heading, f_items in features.items():
# ... (implementation details omitted)

Step 3: Calculate Image Heights

# Main image height (proportional to aspect ratio)
with Image.open(main_image_path) as img:
    main_width, main_height = img.size
# ... (implementation details omitted)

Step 4: Calculate Spacer

Push footer to bottom of page:

available_height = 25.0  # cm
content_height = total_image_height + text_height_cm
spacer_height = max(0, available_height - content_height)

Step 5: Render HTML

html = render_template(
    "datasheet.html",
    sku=sku,
# ... (implementation details omitted)

Step 6: Generate PDF

options = {
    "page-size": "A4",
    "enable-local-file-access": None,
    "load-error-handling": "ignore",
    "load-media-error-handling": "ignore",
    "no-stop-slow-scripts": None
}

pdf_data = pdfkit.from_string(html, options=options)

Step 7: Keep Only First Page

reader = PdfReader(BytesIO(pdf_data))
if len(reader.pages) > 1:
    writer = PdfWriter()
    writer.add_page(reader.pages[0])
    output = BytesIO()
    writer.write(output)
    pdf_data = output.getvalue()

Performance

Generation time: ~2 seconds per PDF

Caching: No caching (always fresh)

Benefit: Always up-to-date with latest product data

Batch Generation

Purpose

Pre-generate PDFs for popular products to reduce on-demand load.

Script Location

scripts/utils/generate_datasheets.py

Tracking Changes

We track changes to productdb.json sections relevant to each SKU:

def get_sku_productdb_hash(sku: str) -> str:
    """Get hash of productdb sections relevant to this SKU."""
    with open(PRODUCTDB_PATH, "r") as f:
# ... (implementation details omitted)

Prioritization

We prioritize SKUs by request frequency:

def load_popular_skus() -> List[str]:
    """Load popular SKUs from access logs."""
    try:
# ... (implementation details omitted)

Generation Strategy

def batch_generate():
    # Load hash cache
    hash_cache = load_hash_cache()
# ... (implementation details omitted)

Scheduling

Batch generation runs nightly via cron:

0 2 * * * cd /home/ubuntu/manage && python scripts/utils/generate_datasheets.py

Storage

Local: /home/ubuntu/web-static/ds/<sku>.pdf

S3: s3://thinvent-web-static/ds/<sku>.pdf

CDN: Served via CloudFront

Template Structure

The datasheet template uses a 3-column layout:

<div class="container">
  <!-- Header with logo and product name -->
  <div class="header">
    <img src="logo.png">
    <h1>{{ name }}</h1>
  </div>

  <!-- Main image -->
  <img src="{{ images[0][1] }}" style="width: 9.5cm">

  <!-- Thumbnails (up to 4) -->
  <div class="thumbnails">
    {% for name, url in images[1:5] %}
      <img src="{{ url }}" style="width: 4.5cm">
    {% endfor %}
  </div>

  <!-- Features in 3 columns -->
  <div class="features">
    <div class="column">
      {% for heading, items in features1 %}
        <h3>{{ heading }}</h3>
        {% for name, value in items.items() %}
          <div><strong>{{ name }}:</strong> {{ value }}</div>
        {% endfor %}
      {% endfor %}
    </div>

    <div class="column">
      <!-- features2 -->
    </div>

    <div class="column">
      <!-- features3 -->
    </div>
  </div>

  <!-- Spacer to push footer down -->
  <div style="height: {{ spacer_height }}cm"></div>

  <!-- Footer with contact info -->
  <div class="footer">
    <p>www.thinvent.in | sales@thinvent.in | +91-124-4343177</p>
  </div>
</div>

Configuration

Constants control layout:

DATASHEET_COLUMN_COUNT = 3
DATASHEET_HEADING_LINE_WEIGHT = 2  # Lines per heading
DATASHEET_FEATURE_NAME_MAX_LEN = 30  # Chars before wrapping
DATASHEET_FEATURE_VALUE_CHARS_PER_LINE = 40  # Chars per line
PDF_PAGE_SIZE = "A4"

Integration Points

Product Pages

Link to datasheet:

<a href="/ds/{{ sku }}.pdf" target="_blank">Download Datasheet</a>

Google Shopping

Datasheets linked in product feed:

<g:product_detail>
  <g:section_name>Datasheet</g:section_name>
  <g:attribute_name>PDF</g:attribute_name>
  <g:attribute_value>https://www.thinvent.in/ds/{{ sku }}.pdf</g:attribute_value>
</g:product_detail>

Email Campaigns

Datasheets attached to quote emails.

Error Handling

Invalid SKU

if not check_sku(sku):
    abort(422, description="Product not found.")

Generation Failure

try:
    pdf_data = generate_datasheet_pdf(sku)
    if pdf_data is None:
        abort(500, description="PDF generation failed.")
except Exception as e:
    logger.error(f"PDF generation failed for {sku}: {e}")
    abort(500, description="PDF generation failed.")

Missing Images

try:
    with Image.open(image_path) as img:
        width, height = img.size
except Exception:
    # Use default dimensions
    width, height = 800, 600

References

Libraries

Related Articles

Summary

Datasheet generation uses a hybrid approach:

On-Demand:

  • ✅ Generate when user requests

  • ✅ Always up-to-date

  • ✅ No storage waste

  • ✅ ~2 seconds per PDF

Batch:

  • ✅ Pre-generate popular products

  • ✅ Track productdb changes

  • ✅ Prioritize by request frequency

  • ✅ Nightly cron job

Process:

  • ✅ Get product data (name, images, features, description)

  • ✅ Balance features across 3 columns

  • ✅ Calculate image heights

  • ✅ Render HTML template

  • ✅ Generate PDF with pdfkit

  • ✅ Keep only first page

Storage:

  • ✅ Local: /home/ubuntu/web-static/ds/

  • ✅ S3: s3://thinvent-web-static/ds/

  • ✅ CDN: CloudFront distribution

This hybrid approach balances freshness (on-demand) with performance (batch pre-generation).


← Back to Documentation Index