AboutBlogContact
AI & Machine LearningJune 18, 2013 7 min read 126Updated: June 22, 2026

Predictive Analytics in E-commerce: How Early Machine Learning Powered 'Products You May Like'

AunimedaAunimeda
📋 Table of Contents

Amazon introduced "customers who bought this also bought" in 1998. It's now the template for every e-commerce recommendation system. But in 2013, building something similar for a regional retailer without Amazon's infrastructure - without Spark, without real-time ML pipelines, without GPU compute - required a different approach.

We built it with SQL, Python, and a nightly batch job. It worked.


The Data: What We Had

The retailer had 18 months of order history. Schema:

-- Orders table
CREATE TABLE orders (
    id          INT PRIMARY KEY,
    user_id     INT NOT NULL,
    created_at  TIMESTAMP,
    status      ENUM('completed', 'cancelled', 'refunded')
);

-- Order items
CREATE TABLE order_items (
    id          INT PRIMARY KEY,
    order_id    INT REFERENCES orders(id),
    product_id  INT REFERENCES products(id),
    quantity    INT,
    price       DECIMAL(10,2)
);

-- Products
CREATE TABLE products (
    id          INT PRIMARY KEY,
    name        VARCHAR(255),
    category    VARCHAR(100),
    price       DECIMAL(10,2)
);

180,000 completed orders. 42,000 unique customers. 3,800 products. Not "big data" - this fit comfortably in MySQL.


Item-Based Collaborative Filtering

The core algorithm: find products that are frequently purchased together. If customers who bought Product A also frequently bought Product B, then showing Product B on Product A's page should increase conversions.

Step 1: Build a co-occurrence matrix

-- Find all pairs of products bought in the same order
-- This creates a "co-occurrence" count for each pair

CREATE TABLE product_cooccurrence AS
SELECT 
    a.product_id AS product_a,
    b.product_id AS product_b,
    COUNT(DISTINCT a.order_id) AS cooccurrence_count
FROM order_items a
JOIN order_items b 
    ON a.order_id = b.order_id 
    AND a.product_id < b.product_id  -- Avoid counting both (A,B) and (B,A)
JOIN orders o ON a.order_id = o.id
WHERE o.status = 'completed'
GROUP BY a.product_id, b.product_id
HAVING COUNT(DISTINCT a.order_id) >= 3  -- Minimum 3 co-occurrences
ORDER BY cooccurrence_count DESC;

This query ran for 4 minutes on the production database. First thing we moved to a read replica.

Step 2: Calculate similarity scores

Raw co-occurrence counts favor popular products - Product A might co-occur with Product B 500 times simply because Product A is in 40% of all orders. We needed a normalized score.

We used the Jaccard similarity coefficient: |A ∩ B| / |A ∪ B| - the intersection divided by the union.

import mysql.connector
from collections import defaultdict

def calculate_jaccard_similarities(db):
    """
    Jaccard similarity: cooccurrence(A,B) / (orders_with_A + orders_with_B - cooccurrence(A,B))
    """
    cursor = db.cursor(dictionary=True)
    
    # Get total order count per product
    cursor.execute("""
        SELECT product_id, COUNT(DISTINCT order_id) as order_count
        FROM order_items
        JOIN orders USING (order_id)
        WHERE orders.status = 'completed'
        GROUP BY product_id
    """)
    order_counts = {row['product_id']: row['order_count'] for row in cursor.fetchall()}
    
    # Get co-occurrences
    cursor.execute("SELECT product_a, product_b, cooccurrence_count FROM product_cooccurrence")
    
    similarities = []
    for row in cursor.fetchall():
        a = row['product_a']
        b = row['product_b']
        co = row['cooccurrence_count']
        
        count_a = order_counts.get(a, 0)
        count_b = order_counts.get(b, 0)
        
        # Jaccard: intersection / union
        union = count_a + count_b - co
        if union > 0:
            jaccard = co / union
        else:
            jaccard = 0
        
        similarities.append((a, b, jaccard, co))
        similarities.append((b, a, jaccard, co))  # Symmetric
    
    return similarities

def save_recommendations(db, similarities):
    """Save top 10 similar products per product."""
    # Group by product_a
    by_product = defaultdict(list)
    for a, b, score, co in similarities:
        by_product[a].append((b, score, co))
    
    cursor = db.cursor()
    cursor.execute("TRUNCATE TABLE product_recommendations")
    
    for product_id, related in by_product.items():
        # Sort by Jaccard score, take top 10
        top_10 = sorted(related, key=lambda x: x[1], reverse=True)[:10]
        
        for rank, (related_id, score, co) in enumerate(top_10):
            cursor.execute("""
                INSERT INTO product_recommendations 
                (product_id, recommended_product_id, similarity_score, cooccurrence_count, rank)
                VALUES (%s, %s, %s, %s, %s)
            """, (product_id, related_id, score, co, rank + 1))
    
    db.commit()

User-Based Recommendations

Item-based similarity showed "customers who bought this also bought." For logged-in users, we could also show "products similar users liked" - user-based collaborative filtering.

def get_user_recommendations(user_id, db, limit=10):
    """
    Find products purchased by similar users that this user hasn't bought.
    """
    cursor = db.cursor(dictionary=True)
    
    # Get products this user has purchased
    cursor.execute("""
        SELECT DISTINCT product_id 
        FROM order_items 
        JOIN orders USING (order_id)
        WHERE orders.user_id = %s AND orders.status = 'completed'
    """, (user_id,))
    user_products = {row['product_id'] for row in cursor.fetchall()}
    
    if not user_products:
        return []  # No purchase history - fall back to popular items
    
    # Find users with similar purchase history (Jaccard similarity)
    product_list = ','.join(str(p) for p in user_products)
    
    cursor.execute(f"""
        SELECT 
            o.user_id,
            COUNT(DISTINCT oi.product_id) as overlap,  -- Products in common
            (
                SELECT COUNT(DISTINCT product_id) 
                FROM order_items JOIN orders USING (order_id)
                WHERE orders.user_id = o.user_id AND orders.status = 'completed'
            ) as their_total
        FROM orders o
        JOIN order_items oi ON o.id = oi.order_id
        WHERE o.status = 'completed'
          AND o.user_id != {user_id}
          AND oi.product_id IN ({product_list})
        GROUP BY o.user_id
        HAVING overlap >= 2
    """)
    
    similar_users = cursor.fetchall()
    
    # Calculate Jaccard for each similar user
    scored_users = []
    for row in similar_users:
        overlap = row['overlap']
        their_total = row['their_total']
        our_total = len(user_products)
        union = our_total + their_total - overlap
        jaccard = overlap / union if union > 0 else 0
        scored_users.append((row['user_id'], jaccard))
    
    # Take top 20 most similar users
    top_users = sorted(scored_users, key=lambda x: x[1], reverse=True)[:20]
    if not top_users:
        return []
    
    user_ids = ','.join(str(u[0]) for u in top_users)
    
    # Get products those users bought that our user hasn't
    cursor.execute(f"""
        SELECT 
            oi.product_id,
            COUNT(DISTINCT o.user_id) as buyer_count,
            p.name,
            p.price
        FROM orders o
        JOIN order_items oi ON o.id = oi.order_id
        JOIN products p ON oi.product_id = p.id
        WHERE o.user_id IN ({user_ids})
          AND o.status = 'completed'
          AND oi.product_id NOT IN ({product_list})
        GROUP BY oi.product_id, p.name, p.price
        ORDER BY buyer_count DESC
        LIMIT {limit}
    """)
    
    return cursor.fetchall()

The Nightly Batch Job

We ran the full recommendation refresh every night at 2am:

#!/bin/bash
# /etc/cron.d/recommendations
# 0 2 * * * recommender /usr/bin/python3 /app/scripts/rebuild_recommendations.py

# rebuild_recommendations.py
import time
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
log = logging.getLogger('recommendations')

def main():
    log.info("Starting recommendation rebuild")
    start = time.time()
    
    db = connect_to_replica()  # Use read replica, not primary
    
    log.info("Building co-occurrence matrix...")
    build_cooccurrence_table(db)
    
    log.info("Calculating Jaccard similarities...")
    similarities = calculate_jaccard_similarities(db)
    log.info(f"Found {len(similarities)} product pairs")
    
    # Write results to primary
    write_db = connect_to_primary()
    save_recommendations(write_db, similarities)
    
    elapsed = time.time() - start
    log.info(f"Recommendation rebuild complete in {elapsed:.1f}s")

if __name__ == '__main__':
    main()

Runtime: 8 minutes on the read replica. The write to the primary took 2 minutes. Total: 10 minutes, 2am-2:10am, zero user impact.


The A/B Test Results

We ran a 30-day A/B test: 50% of product pages showed recommendations, 50% didn't.

Metric Control (no recs) Treatment (with recs)
Average order value $43.20 $51.80
Items per order 1.8 2.3
Recommendation click rate - 8.4%
Recommendation conversion rate - 14.2%

Average order value increased 20%. This was the metric the client cared about. The recommendation system paid for itself in week 2 of the A/B test.

The 8.4% click rate on recommendations was significantly higher than industry benchmarks we'd read (2-4% for email recommendations). The difference: contextual product-page recommendations vs. email blasts. Relevance in context converts far better than relevance without context.


Limits and What Came After

Our system had a cold start problem: new products had no purchase history, so no co-occurrences, so no recommendations. We handled this by using category similarity as a fallback: new products in "Electronics" were recommended alongside top-selling electronics.

New users had no purchase history for user-based recommendations. Fallback: popularity-based recommendations (products most ordered in the last 30 days), filtered by browsing history from the session.

By 2015-2016, Apache Spark and MLlib made building recommendation systems on larger datasets much easier. By 2020, cloud ML services (AWS Personalize, Google Recommendations AI) made it a configuration exercise. Our 2013 batch job was replaced by real-time inference serving personalized recommendations per page view.

But the algorithm - collaborative filtering via item co-occurrence and Jaccard similarity - is still the foundation. The infrastructure changed; the math didn't.


Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

Contact us to discuss AI integration for your business. See also: AI Solutions, AI Agents, Chatbot Development

Read Also

Early NLP: Building Basic Chatbots Before the LLM Eraaunimeda
AI & Machine Learning

Early NLP: Building Basic Chatbots Before the LLM Era

In 2013 we built a customer support chatbot using regex patterns, decision trees, and a Naive Bayes classifier. No neural networks, no embeddings, no GPT. Here's what rule-based NLP actually looked like.

Computer Vision Basics: First Attempts at OCR in Mobile Appsaunimeda
AI & Machine Learning

Computer Vision Basics: First Attempts at OCR in Mobile Apps

In 2013 we built an expense reporting app that read receipts using Tesseract OCR on iOS. The accuracy was 68%. Here's what worked, what didn't, and what image preprocessing actually involved.

How to Build an AI Chatbot for Your Business in 2026aunimeda
AI & Machine Learning

How to Build an AI Chatbot for Your Business in 2026

AI chatbots in 2026 are not the rule-based bots of 2020. They understand context, handle complex questions, and integrate with your actual business data. Here's how to build one that works.

Need IT development for your business?

We build websites, mobile apps and AI solutions. Free consultation.

AI Solutions

Get Consultation All articles