Amazon introduced "customers who bought this also bought" in 1998. It's now the template for every e-commerce recommendation system. But in 2013, building something similar for a regional retailer without Amazon's infrastructure - without Spark, without real-time ML pipelines, without GPU compute - required a different approach.
We built it with SQL, Python, and a nightly batch job. It worked.
The Data: What We Had
The retailer had 18 months of order history. Schema:
-- Orders table
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT NOT NULL,
created_at TIMESTAMP,
status ENUM('completed', 'cancelled', 'refunded')
);
-- Order items
CREATE TABLE order_items (
id INT PRIMARY KEY,
order_id INT REFERENCES orders(id),
product_id INT REFERENCES products(id),
quantity INT,
price DECIMAL(10,2)
);
-- Products
CREATE TABLE products (
id INT PRIMARY KEY,
name VARCHAR(255),
category VARCHAR(100),
price DECIMAL(10,2)
);
180,000 completed orders. 42,000 unique customers. 3,800 products. Not "big data" - this fit comfortably in MySQL.
Item-Based Collaborative Filtering
The core algorithm: find products that are frequently purchased together. If customers who bought Product A also frequently bought Product B, then showing Product B on Product A's page should increase conversions.
Step 1: Build a co-occurrence matrix
-- Find all pairs of products bought in the same order
-- This creates a "co-occurrence" count for each pair
CREATE TABLE product_cooccurrence AS
SELECT
a.product_id AS product_a,
b.product_id AS product_b,
COUNT(DISTINCT a.order_id) AS cooccurrence_count
FROM order_items a
JOIN order_items b
ON a.order_id = b.order_id
AND a.product_id < b.product_id -- Avoid counting both (A,B) and (B,A)
JOIN orders o ON a.order_id = o.id
WHERE o.status = 'completed'
GROUP BY a.product_id, b.product_id
HAVING COUNT(DISTINCT a.order_id) >= 3 -- Minimum 3 co-occurrences
ORDER BY cooccurrence_count DESC;
This query ran for 4 minutes on the production database. First thing we moved to a read replica.
Step 2: Calculate similarity scores
Raw co-occurrence counts favor popular products - Product A might co-occur with Product B 500 times simply because Product A is in 40% of all orders. We needed a normalized score.
We used the Jaccard similarity coefficient: |A ∩ B| / |A ∪ B| - the intersection divided by the union.
import mysql.connector
from collections import defaultdict
def calculate_jaccard_similarities(db):
"""
Jaccard similarity: cooccurrence(A,B) / (orders_with_A + orders_with_B - cooccurrence(A,B))
"""
cursor = db.cursor(dictionary=True)
# Get total order count per product
cursor.execute("""
SELECT product_id, COUNT(DISTINCT order_id) as order_count
FROM order_items
JOIN orders USING (order_id)
WHERE orders.status = 'completed'
GROUP BY product_id
""")
order_counts = {row['product_id']: row['order_count'] for row in cursor.fetchall()}
# Get co-occurrences
cursor.execute("SELECT product_a, product_b, cooccurrence_count FROM product_cooccurrence")
similarities = []
for row in cursor.fetchall():
a = row['product_a']
b = row['product_b']
co = row['cooccurrence_count']
count_a = order_counts.get(a, 0)
count_b = order_counts.get(b, 0)
# Jaccard: intersection / union
union = count_a + count_b - co
if union > 0:
jaccard = co / union
else:
jaccard = 0
similarities.append((a, b, jaccard, co))
similarities.append((b, a, jaccard, co)) # Symmetric
return similarities
def save_recommendations(db, similarities):
"""Save top 10 similar products per product."""
# Group by product_a
by_product = defaultdict(list)
for a, b, score, co in similarities:
by_product[a].append((b, score, co))
cursor = db.cursor()
cursor.execute("TRUNCATE TABLE product_recommendations")
for product_id, related in by_product.items():
# Sort by Jaccard score, take top 10
top_10 = sorted(related, key=lambda x: x[1], reverse=True)[:10]
for rank, (related_id, score, co) in enumerate(top_10):
cursor.execute("""
INSERT INTO product_recommendations
(product_id, recommended_product_id, similarity_score, cooccurrence_count, rank)
VALUES (%s, %s, %s, %s, %s)
""", (product_id, related_id, score, co, rank + 1))
db.commit()
User-Based Recommendations
Item-based similarity showed "customers who bought this also bought." For logged-in users, we could also show "products similar users liked" - user-based collaborative filtering.
def get_user_recommendations(user_id, db, limit=10):
"""
Find products purchased by similar users that this user hasn't bought.
"""
cursor = db.cursor(dictionary=True)
# Get products this user has purchased
cursor.execute("""
SELECT DISTINCT product_id
FROM order_items
JOIN orders USING (order_id)
WHERE orders.user_id = %s AND orders.status = 'completed'
""", (user_id,))
user_products = {row['product_id'] for row in cursor.fetchall()}
if not user_products:
return [] # No purchase history - fall back to popular items
# Find users with similar purchase history (Jaccard similarity)
product_list = ','.join(str(p) for p in user_products)
cursor.execute(f"""
SELECT
o.user_id,
COUNT(DISTINCT oi.product_id) as overlap, -- Products in common
(
SELECT COUNT(DISTINCT product_id)
FROM order_items JOIN orders USING (order_id)
WHERE orders.user_id = o.user_id AND orders.status = 'completed'
) as their_total
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE o.status = 'completed'
AND o.user_id != {user_id}
AND oi.product_id IN ({product_list})
GROUP BY o.user_id
HAVING overlap >= 2
""")
similar_users = cursor.fetchall()
# Calculate Jaccard for each similar user
scored_users = []
for row in similar_users:
overlap = row['overlap']
their_total = row['their_total']
our_total = len(user_products)
union = our_total + their_total - overlap
jaccard = overlap / union if union > 0 else 0
scored_users.append((row['user_id'], jaccard))
# Take top 20 most similar users
top_users = sorted(scored_users, key=lambda x: x[1], reverse=True)[:20]
if not top_users:
return []
user_ids = ','.join(str(u[0]) for u in top_users)
# Get products those users bought that our user hasn't
cursor.execute(f"""
SELECT
oi.product_id,
COUNT(DISTINCT o.user_id) as buyer_count,
p.name,
p.price
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.user_id IN ({user_ids})
AND o.status = 'completed'
AND oi.product_id NOT IN ({product_list})
GROUP BY oi.product_id, p.name, p.price
ORDER BY buyer_count DESC
LIMIT {limit}
""")
return cursor.fetchall()
The Nightly Batch Job
We ran the full recommendation refresh every night at 2am:
#!/bin/bash
# /etc/cron.d/recommendations
# 0 2 * * * recommender /usr/bin/python3 /app/scripts/rebuild_recommendations.py
# rebuild_recommendations.py
import time
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
log = logging.getLogger('recommendations')
def main():
log.info("Starting recommendation rebuild")
start = time.time()
db = connect_to_replica() # Use read replica, not primary
log.info("Building co-occurrence matrix...")
build_cooccurrence_table(db)
log.info("Calculating Jaccard similarities...")
similarities = calculate_jaccard_similarities(db)
log.info(f"Found {len(similarities)} product pairs")
# Write results to primary
write_db = connect_to_primary()
save_recommendations(write_db, similarities)
elapsed = time.time() - start
log.info(f"Recommendation rebuild complete in {elapsed:.1f}s")
if __name__ == '__main__':
main()
Runtime: 8 minutes on the read replica. The write to the primary took 2 minutes. Total: 10 minutes, 2am-2:10am, zero user impact.
The A/B Test Results
We ran a 30-day A/B test: 50% of product pages showed recommendations, 50% didn't.
| Metric | Control (no recs) | Treatment (with recs) |
|---|---|---|
| Average order value | $43.20 | $51.80 |
| Items per order | 1.8 | 2.3 |
| Recommendation click rate | - | 8.4% |
| Recommendation conversion rate | - | 14.2% |
Average order value increased 20%. This was the metric the client cared about. The recommendation system paid for itself in week 2 of the A/B test.
The 8.4% click rate on recommendations was significantly higher than industry benchmarks we'd read (2-4% for email recommendations). The difference: contextual product-page recommendations vs. email blasts. Relevance in context converts far better than relevance without context.
Limits and What Came After
Our system had a cold start problem: new products had no purchase history, so no co-occurrences, so no recommendations. We handled this by using category similarity as a fallback: new products in "Electronics" were recommended alongside top-selling electronics.
New users had no purchase history for user-based recommendations. Fallback: popularity-based recommendations (products most ordered in the last 30 days), filtered by browsing history from the session.
By 2015-2016, Apache Spark and MLlib made building recommendation systems on larger datasets much easier. By 2020, cloud ML services (AWS Personalize, Google Recommendations AI) made it a configuration exercise. Our 2013 batch job was replaced by real-time inference serving personalized recommendations per page view.
But the algorithm - collaborative filtering via item co-occurrence and Jaccard similarity - is still the foundation. The infrastructure changed; the math didn't.
Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.
Contact us to discuss AI integration for your business. See also: AI Solutions, AI Agents, Chatbot Development