Analytics Tracking: Privacy-First Event Collection
This article explains how we track user behavior while respecting privacy and avoiding bot traffic.
The Problem: Understanding User Behavior
We need to know:
-
Which pages users visit
-
Which products they view
-
Where traffic comes from (Google Ads, organic, social)
-
Which campaigns drive conversions
But we must avoid:
-
Tracking bots and crawlers
-
Storing personally identifiable information (PII)
-
Violating privacy regulations
The Solution: Client-Side + Server-Side Tracking
Client-Side: JavaScript Tracking
Visitor ID: Random ID stored in cookie (365 days)
Session ID: Random ID stored in sessionStorage (until browser close)
Campaign params: Extracted from URL and stored in sessionStorage
Tracked parameters:
-
gclid- Google Click ID (Search ads) -
gbraid- Google Ads click ID (Shopping ads) -
wbraid- Google Ads click ID (iOS) -
fbclid- Facebook click ID -
srsltid- Google organic search result ID -
utm_source,utm_medium,utm_campaign,utm_term,utm_content
Storage: Parameters stored in cookies (30 minutes) for WhatsApp/phone click attribution
Server-Side: Enrichment
The server enriches events with:
GeoIP data: Country, region, city from IP address
User-Agent parsing: Browser, OS, device type
Timestamp: Server time (UTC)
Bot detection: Filters known bot user-agents
Event Types
Page view: User visits a page
Product view: User views product page
Add to cart: User adds product to cart
Checkout: User initiates checkout
Purchase: User completes purchase
WhatsApp click: User clicks WhatsApp button
Phone click: User clicks phone number
Data Flow
sequenceDiagram
participant User
participant JS as JavaScript
participant API as /api/analytics
participant Firehose as Kinesis Firehose
participant S3
User->>JS: Visit page
JS->>JS: Extract URL params
(gclid, utm_*, etc.)
JS->>JS: Store in sessionStorage
JS->>API: POST event + params
API->>API: Enrich with GeoIP
API->>API: Parse User-Agent
API->>API: Filter bots
API->>Firehose: Send enriched event
Firehose->>S3: Store in analytics bucketBot Detection
We filter bot traffic using multiple signals:
User-Agent patterns: Known bot strings (Googlebot, Bingbot, etc.)
Behavior patterns: Too fast, too many requests
Missing JavaScript: Bots often don't execute JS
Exclusion cookie: tv_exclude=true stops all tracking
Privacy Protection
No PII: We never store names, emails, phone numbers
Anonymized IPs: Last octet removed before storage
No cross-site tracking: Cookies are first-party only
Opt-out: Users can set exclusion cookie
Data retention: Events deleted after 90 days
Conditional Pixel Loading
We only load tracking pixels when relevant:
Google Ads pixel: Only if gclid, gbraid, or wbraid present
LinkedIn pixel: Only if msclkid present
Facebook pixel: Only if fbclid present
Benefit: Faster page loads, less tracking overhead
Traffic Source Detection
We detect traffic source from URL parameters:
Google Ads: gclid, gbraid, wbraid → utm_source=google_ads
Google Organic: srsltid → utm_source=google_search
Facebook: fbclid → utm_source=facebook
LinkedIn: msclkid → utm_source=linkedin
Direct: No parameters → utm_source=direct
Conversion Tracking
We track conversions through the funnel:
Product view → Add to cart → Checkout → Purchase
Each step includes:
-
Visitor ID (for attribution)
-
Session ID (for session analysis)
-
Campaign params (for ROI calculation)
-
Product SKU (for product analysis)
Lead Touch Tracking
When users contact us (WhatsApp, phone, email), we capture:
Contact method: WhatsApp, phone, email
Campaign params: From cookies (30-minute window)
Product context: Which product page they were on
Benefit: Attribute offline conversions to online campaigns
Rate Limiting
Analytics endpoint is rate-limited:
Limit: 100 requests per 10 minutes per IP
Benefit: Prevents abuse and bot floods
Storage
Events are stored in S3 via Kinesis Firehose:
Format: JSON lines (one event per line)
Partitioning: By date (year/month/day/hour)
Compression: Gzip
Retention: 90 days
Querying
Events are queried via AWS Athena:
Schema: Defined in Glue Data Catalog
Queries: SQL on S3 data
Use cases: Campaign ROI, product popularity, traffic sources
References
Technical Concepts
-
Web analytics - Wikipedia
-
Privacy by design - Wikipedia
AWS Services
-
Kinesis Firehose - AWS documentation
-
Athena - AWS documentation
Related Articles
- Multi-Server Architecture - Where analytics runs
Summary
Our analytics system tracks user behavior while respecting privacy:
Client-side:
-
✅ Extract campaign params from URL
-
✅ Store in sessionStorage (session-scoped)
-
✅ Store in cookies (30 min for attribution)
-
✅ Send events to API
Server-side:
-
✅ Enrich with GeoIP and User-Agent
-
✅ Filter bot traffic
-
✅ Send to Kinesis Firehose
-
✅ Store in S3 (partitioned by date)
Privacy:
-
✅ No PII stored
-
✅ Anonymized IPs
-
✅ First-party cookies only
-
✅ Opt-out available
-
✅ 90-day retention
Conditional loading:
-
✅ Google Ads pixel only if gclid present
-
✅ LinkedIn pixel only if msclkid present
-
✅ Facebook pixel only if fbclid present
This approach balances insights with privacy and performance.