Datasheet Generation: On-Demand and Batch PDF Creation
This article explains how we generate product datasheets as PDFs, both on-demand (when users request them) and in batch (pre-generating for popular products).
The Problem: Creating Product Datasheets
Each product needs a professional PDF datasheet with:
-
Product name and images
-
Technical specifications grouped by category
-
Company branding and contact information
-
Optimized layout (single page, balanced columns)
Generating PDFs on every request is slow (~2 seconds per PDF). Pre-generating all 65,000 SKUs wastes storage and time.
The Solution: Hybrid Approach
We use two generation methods:
On-demand: Generate PDF when user requests it (first time)
Batch: Pre-generate PDFs for popular products (nightly)
On-Demand Generation
URL Structure
/ds/<sku>.pdf
/ds/<sku>
Both URLs generate the same PDF.
Request Flow
@web.route("/ds/<sku>.pdf")
def datasheet(sku: str):
# Validate SKU
# ... (implementation details omitted)
Generation Process
Step 1: Get Product Data
# Product name
name = expand_sku(sku)
# ... (implementation details omitted)
Step 2: Balance Columns
Features are distributed across 3 columns to balance line counts:
# Calculate lines per feature group
all_groups = []
for heading, f_items in features.items():
# ... (implementation details omitted)
Step 3: Calculate Image Heights
# Main image height (proportional to aspect ratio)
with Image.open(main_image_path) as img:
main_width, main_height = img.size
# ... (implementation details omitted)
Step 4: Calculate Spacer
Push footer to bottom of page:
available_height = 25.0 # cm
content_height = total_image_height + text_height_cm
spacer_height = max(0, available_height - content_height)
Step 5: Render HTML
html = render_template(
"datasheet.html",
sku=sku,
# ... (implementation details omitted)
Step 6: Generate PDF
options = {
"page-size": "A4",
"enable-local-file-access": None,
"load-error-handling": "ignore",
"load-media-error-handling": "ignore",
"no-stop-slow-scripts": None
}
pdf_data = pdfkit.from_string(html, options=options)
Step 7: Keep Only First Page
reader = PdfReader(BytesIO(pdf_data))
if len(reader.pages) > 1:
writer = PdfWriter()
writer.add_page(reader.pages[0])
output = BytesIO()
writer.write(output)
pdf_data = output.getvalue()
Performance
Generation time: ~2 seconds per PDF
Caching: No caching (always fresh)
Benefit: Always up-to-date with latest product data
Batch Generation
Purpose
Pre-generate PDFs for popular products to reduce on-demand load.
Script Location
scripts/utils/generate_datasheets.py
Tracking Changes
We track changes to productdb.json sections relevant to each SKU:
def get_sku_productdb_hash(sku: str) -> str:
"""Get hash of productdb sections relevant to this SKU."""
with open(PRODUCTDB_PATH, "r") as f:
# ... (implementation details omitted)
Prioritization
We prioritize SKUs by request frequency:
def load_popular_skus() -> List[str]:
"""Load popular SKUs from access logs."""
try:
# ... (implementation details omitted)
Generation Strategy
def batch_generate():
# Load hash cache
hash_cache = load_hash_cache()
# ... (implementation details omitted)
Scheduling
Batch generation runs nightly via cron:
0 2 * * * cd /home/ubuntu/manage && python scripts/utils/generate_datasheets.py
Storage
Local: /home/ubuntu/web-static/ds/<sku>.pdf
S3: s3://thinvent-web-static/ds/<sku>.pdf
CDN: Served via CloudFront
Template Structure
The datasheet template uses a 3-column layout:
<div class="container">
<!-- Header with logo and product name -->
<div class="header">
<img src="logo.png">
<h1>{{ name }}</h1>
</div>
<!-- Main image -->
<img src="{{ images[0][1] }}" style="width: 9.5cm">
<!-- Thumbnails (up to 4) -->
<div class="thumbnails">
{% for name, url in images[1:5] %}
<img src="{{ url }}" style="width: 4.5cm">
{% endfor %}
</div>
<!-- Features in 3 columns -->
<div class="features">
<div class="column">
{% for heading, items in features1 %}
<h3>{{ heading }}</h3>
{% for name, value in items.items() %}
<div><strong>{{ name }}:</strong> {{ value }}</div>
{% endfor %}
{% endfor %}
</div>
<div class="column">
<!-- features2 -->
</div>
<div class="column">
<!-- features3 -->
</div>
</div>
<!-- Spacer to push footer down -->
<div style="height: {{ spacer_height }}cm"></div>
<!-- Footer with contact info -->
<div class="footer">
<p>www.thinvent.in | sales@thinvent.in | +91-124-4343177</p>
</div>
</div>
Configuration
Constants control layout:
DATASHEET_COLUMN_COUNT = 3
DATASHEET_HEADING_LINE_WEIGHT = 2 # Lines per heading
DATASHEET_FEATURE_NAME_MAX_LEN = 30 # Chars before wrapping
DATASHEET_FEATURE_VALUE_CHARS_PER_LINE = 40 # Chars per line
PDF_PAGE_SIZE = "A4"
Integration Points
Product Pages
Link to datasheet:
<a href="/ds/{{ sku }}.pdf" target="_blank">Download Datasheet</a>
Google Shopping
Datasheets linked in product feed:
<g:product_detail>
<g:section_name>Datasheet</g:section_name>
<g:attribute_name>PDF</g:attribute_name>
<g:attribute_value>https://www.thinvent.in/ds/{{ sku }}.pdf</g:attribute_value>
</g:product_detail>
Email Campaigns
Datasheets attached to quote emails.
Error Handling
Invalid SKU
if not check_sku(sku):
abort(422, description="Product not found.")
Generation Failure
try:
pdf_data = generate_datasheet_pdf(sku)
if pdf_data is None:
abort(500, description="PDF generation failed.")
except Exception as e:
logger.error(f"PDF generation failed for {sku}: {e}")
abort(500, description="PDF generation failed.")
Missing Images
try:
with Image.open(image_path) as img:
width, height = img.size
except Exception:
# Use default dimensions
width, height = 800, 600
References
Libraries
Related Articles
-
Feature Extraction - Getting product specifications
-
Content AI Generation - Product descriptions
-
SKU Structure - Product identifiers
Summary
Datasheet generation uses a hybrid approach:
On-Demand:
-
✅ Generate when user requests
-
✅ Always up-to-date
-
✅ No storage waste
-
✅ ~2 seconds per PDF
Batch:
-
✅ Pre-generate popular products
-
✅ Track productdb changes
-
✅ Prioritize by request frequency
-
✅ Nightly cron job
Process:
-
✅ Get product data (name, images, features, description)
-
✅ Balance features across 3 columns
-
✅ Calculate image heights
-
✅ Render HTML template
-
✅ Generate PDF with pdfkit
-
✅ Keep only first page
Storage:
-
✅ Local:
/home/ubuntu/web-static/ds/ -
✅ S3:
s3://thinvent-web-static/ds/ -
✅ CDN: CloudFront distribution
This hybrid approach balances freshness (on-demand) with performance (batch pre-generation).