Predictive Analytics in E-commerce: How Early Machine

#machine learning#recommendations#collaborative filtering#e-commerce#2013#analytics

📋 Table of Contents ▼

Amazon introduced "customers who bought this also bought" in 1998. It's now the template for every e-commerce recommendation system. But in 2013, building something similar for a regional retailer without Amazon's infrastructure - without Spark, without real-time ML pipelines, without GPU compute - required a different approach.

We built it with SQL, Python, and a nightly batch job. It worked.

The Data: What We Had

The retailer had 18 months of order history. Schema:

-- Orders table
CREATE TABLE orders (
    id          INT PRIMARY KEY,
    user_id     INT NOT NULL,
    created_at  TIMESTAMP,
    status      ENUM('completed', 'cancelled', 'refunded')
);

-- Order items
CREATE TABLE order_items (
    id          INT PRIMARY KEY,
    order_id    INT REFERENCES orders(id),
    product_id  INT REFERENCES products(id),
    quantity    INT,
    price       DECIMAL(10,2)
);

-- Products
CREATE TABLE products (
    id          INT PRIMARY KEY,
    name        VARCHAR(255),
    category    VARCHAR(100),
    price       DECIMAL(10,2)
);

180,000 completed orders. 42,000 unique customers. 3,800 products. Not "big data" - this fit comfortably in MySQL.

Item-Based Collaborative Filtering

The core algorithm: find products that are frequently purchased together. If customers who bought Product A also frequently bought Product B, then showing Product B on Product A's page should increase conversions.

Step 1: Build a co-occurrence matrix

-- Find all pairs of products bought in the same order
-- This creates a "co-occurrence" count for each pair

CREATE TABLE product_cooccurrence AS
SELECT 
    a.product_id AS product_a,
    b.product_id AS product_b,
    COUNT(DISTINCT a.order_id) AS cooccurrence_count
FROM order_items a
JOIN order_items b 
    ON a.order_id = b.order_id 
    AND a.product_id < b.product_id  -- Avoid counting both (A,B) and (B,A)
JOIN orders o ON a.order_id = o.id
WHERE o.status = 'completed'
GROUP BY a.product_id, b.product_id
HAVING COUNT(DISTINCT a.order_id) >= 3  -- Minimum 3 co-occurrences
ORDER BY cooccurrence_count DESC;

This query ran for 4 minutes on the production database. First thing we moved to a read replica.

Step 2: Calculate similarity scores

Raw co-occurrence counts favor popular products - Product A might co-occur with Product B 500 times simply because Product A is in 40% of all orders. We needed a normalized score.

We used the Jaccard similarity coefficient: |A ∩ B| / |A ∪ B| - the intersection divided by the union.

import mysql.connector
from collections import defaultdict

def calculate_jaccard_similarities(db):
    """
    Jaccard similarity: cooccurrence(A,B) / (orders_with_A + orders_with_B - cooccurrence(A,B))
    """
    cursor = db.cursor(dictionary=True)
    
    # Get total order count per product
    cursor.execute("""
        SELECT product_id, COUNT(DISTINCT order_id) as order_count
        FROM order_items
        JOIN orders USING (order_id)
        WHERE orders.status = 'completed'
        GROUP BY product_id
    """)
    order_counts = {row['product_id']: row['order_count'] for row in cursor.fetchall()}
    
    # Get co-occurrences
    cursor.execute("SELECT product_a, product_b, cooccurrence_count FROM product_cooccurrence")
    
    similarities = []
    for row in cursor.fetchall():
        a = row['product_a']
        b = row['product_b']
        co = row['cooccurrence_count']
        
        count_a = order_counts.get(a, 0)
        count_b = order_counts.get(b, 0)
        
        # Jaccard: intersection / union
        union = count_a + count_b - co
        if union > 0:
            jaccard = co / union
        else:
            jaccard = 0
        
        similarities.append((a, b, jaccard, co))
        similarities.append((b, a, jaccard, co))  # Symmetric
    
    return similarities

def save_recommendations(db, similarities):
    """Save top 10 similar products per product."""
    # Group by product_a
    by_product = defaultdict(list)
    for a, b, score, co in similarities:
        by_product[a].append((b, score, co))
    
    cursor = db.cursor()
    cursor.execute("TRUNCATE TABLE product_recommendations")
    
    for product_id, related in by_product.items():
        # Sort by Jaccard score, take top 10
        top_10 = sorted(related, key=lambda x: x[1], reverse=True)[:10]
        
        for rank, (related_id, score, co) in enumerate(top_10):
            cursor.execute("""
                INSERT INTO product_recommendations 
                (product_id, recommended_product_id, similarity_score, cooccurrence_count, rank)
                VALUES (%s, %s, %s, %s, %s)
            """, (product_id, related_id, score, co, rank + 1))
    
    db.commit()

User-Based Recommendations

Item-based similarity showed "customers who bought this also bought." For logged-in users, we could also show "products similar users liked" - user-based collaborative filtering.

def get_user_recommendations(user_id, db, limit=10):
    """
    Find products purchased by similar users that this user hasn't bought.
    """
    cursor = db.cursor(dictionary=True)
    
    # Get products this user has purchased
    cursor.execute("""
        SELECT DISTINCT product_id 
        FROM order_items 
        JOIN orders USING (order_id)
        WHERE orders.user_id = %s AND orders.status = 'completed'
    """, (user_id,))
    user_products = {row['product_id'] for row in cursor.fetchall()}
    
    if not user_products:
        return []  # No purchase history - fall back to popular items
    
    # Find users with similar purchase history (Jaccard similarity)
    product_list = ','.join(str(p) for p in user_products)
    
    cursor.execute(f"""
        SELECT 
            o.user_id,
            COUNT(DISTINCT oi.product_id) as overlap,  -- Products in common
            (
                SELECT COUNT(DISTINCT product_id) 
                FROM order_items JOIN orders USING (order_id)
                WHERE orders.user_id = o.user_id AND orders.status = 'completed'
            ) as their_total
        FROM orders o
        JOIN order_items oi ON o.id = oi.order_id
        WHERE o.status = 'completed'
          AND o.user_id != {user_id}
          AND oi.product_id IN ({product_list})
        GROUP BY o.user_id
        HAVING overlap >= 2
    """)
    
    similar_users = cursor.fetchall()
    
    # Calculate Jaccard for each similar user
    scored_users = []
    for row in similar_users:
        overlap = row['overlap']
        their_total = row['their_total']
        our_total = len(user_products)
        union = our_total + their_total - overlap
        jaccard = overlap / union if union > 0 else 0
        scored_users.append((row['user_id'], jaccard))
    
    # Take top 20 most similar users
    top_users = sorted(scored_users, key=lambda x: x[1], reverse=True)[:20]
    if not top_users:
        return []
    
    user_ids = ','.join(str(u[0]) for u in top_users)
    
    # Get products those users bought that our user hasn't
    cursor.execute(f"""
        SELECT 
            oi.product_id,
            COUNT(DISTINCT o.user_id) as buyer_count,
            p.name,
            p.price
        FROM orders o
        JOIN order_items oi ON o.id = oi.order_id
        JOIN products p ON oi.product_id = p.id
        WHERE o.user_id IN ({user_ids})
          AND o.status = 'completed'
          AND oi.product_id NOT IN ({product_list})
        GROUP BY oi.product_id, p.name, p.price
        ORDER BY buyer_count DESC
        LIMIT {limit}
    """)
    
    return cursor.fetchall()

The Nightly Batch Job

We ran the full recommendation refresh every night at 2am:

#!/bin/bash
# /etc/cron.d/recommendations
# 0 2 * * * recommender /usr/bin/python3 /app/scripts/rebuild_recommendations.py

# rebuild_recommendations.py
import time
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
log = logging.getLogger('recommendations')

def main():
    log.info("Starting recommendation rebuild")
    start = time.time()
    
    db = connect_to_replica()  # Use read replica, not primary
    
    log.info("Building co-occurrence matrix...")
    build_cooccurrence_table(db)
    
    log.info("Calculating Jaccard similarities...")
    similarities = calculate_jaccard_similarities(db)
    log.info(f"Found {len(similarities)} product pairs")
    
    # Write results to primary
    write_db = connect_to_primary()
    save_recommendations(write_db, similarities)
    
    elapsed = time.time() - start
    log.info(f"Recommendation rebuild complete in {elapsed:.1f}s")

if __name__ == '__main__':
    main()

Runtime: 8 minutes on the read replica. The write to the primary took 2 minutes. Total: 10 minutes, 2am-2:10am, zero user impact.

The A/B Test Results

We ran a 30-day A/B test: 50% of product pages showed recommendations, 50% didn't.

Metric	Control (no recs)	Treatment (with recs)
Average order value	$43.20	$51.80
Items per order	1.8	2.3
Recommendation click rate	-	8.4%
Recommendation conversion rate	-	14.2%

Average order value increased 20%. This was the metric the client cared about. The recommendation system paid for itself in week 2 of the A/B test.

The 8.4% click rate on recommendations was significantly higher than industry benchmarks we'd read (2-4% for email recommendations). The difference: contextual product-page recommendations vs. email blasts. Relevance in context converts far better than relevance without context.

Limits and What Came After

Our system had a cold start problem: new products had no purchase history, so no co-occurrences, so no recommendations. We handled this by using category similarity as a fallback: new products in "Electronics" were recommended alongside top-selling electronics.

New users had no purchase history for user-based recommendations. Fallback: popularity-based recommendations (products most ordered in the last 30 days), filtered by browsing history from the session.

By 2015-2016, Apache Spark and MLlib made building recommendation systems on larger datasets much easier. By 2020, cloud ML services (AWS Personalize, Google Recommendations AI) made it a configuration exercise. Our 2013 batch job was replaced by real-time inference serving personalized recommendations per page view.

But the algorithm - collaborative filtering via item co-occurrence and Jaccard similarity - is still the foundation. The infrastructure changed; the math didn't.

Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

Predictive Analytics in E-commerce: How Early Machine Learning Powered 'Products You May Like'

The Data: What We Had

Item-Based Collaborative Filtering

User-Based Recommendations

The Nightly Batch Job

The A/B Test Results

Limits and What Came After

Aunimeda

Need IT development for your business?

Predictive Analytics in E-commerce: How Early Machine Learning Powered 'Products You May Like'

The Data: What We Had

Item-Based Collaborative Filtering

User-Based Recommendations

The Nightly Batch Job

The A/B Test Results

Limits and What Came After

Aunimeda

Read Also

Early NLP: Building Basic Chatbots Before the LLM Era

Computer Vision Basics: First Attempts at OCR in Mobile Apps

How to Build an AI Chatbot for Your Business in 2026

Need IT development for your business?