AboutBlogContact
DatabasesDecember 5, 2012 7 min read 132Updated: June 10, 2026

The Big Data Hype (2012): How Hadoop and MongoDB Started the Data Revolution

AunimedaAunimeda
📋 Table of Contents

The Big Data Hype (2012): How Hadoop and MongoDB Started the Data Revolution

"Big Data" appeared in Gartner's Hype Cycle in 2012 at the Peak of Inflated Expectations. Every conference had at least two keynotes about it. Every consultancy was rebranding as a "Big Data" company. Every enterprise CTO was asking their teams if they needed a Hadoop cluster.

Most of them didn't. But the underlying problems that Big Data addressed were real, and the technologies that emerged from the hype - particularly MongoDB and Hadoop MapReduce - genuinely changed how data infrastructure was built.

We ran both in production. Here's what was real and what was hype.


The Problem That Hadoop Actually Solved

Before Hadoop, processing large datasets meant either processing them in a single large machine (expensive, doesn't scale infinitely) or writing custom distributed processing code that was extraordinarily complex to build and maintain.

Google published the MapReduce paper in 2004. Doug Cutting and Mike Cafarella at Yahoo! implemented it as open-source Apache Hadoop in 2006. The idea: split large datasets across many commodity machines (cheap), run processing in parallel on each node (fast), aggregate results (complete).

The canonical MapReduce example - word count - was the hello world of distributed computing:

// Hadoop MapReduce - Word Count
// Map phase: split text into (word, 1) pairs
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final IntWritable ONE = new IntWritable(1);
    private Text word = new Text();
    
    @Override
    public void map(LongWritable key, Text value, Context context) 
            throws IOException, InterruptedException {
        // 'value' is one line of the input file
        StringTokenizer tokenizer = new StringTokenizer(value.toString());
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken().toLowerCase());
            context.write(word, ONE);  // Emit (word, 1) for each word
        }
    }
}

// Reduce phase: sum all counts for each word
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

// Running the job
public class WordCount extends Configured implements Tool {
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(getConf(), "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

This code counted words across a 100GB text file distributed across 20 machines. Each machine processed its portion; the reduce phase merged the counts. SQL couldn't do this - not in 2012, before distributed SQL existed. A single MySQL server would either run out of memory or take hours.

Where Hadoop was genuinely useful:

  • Log processing: analyzing 500GB of web server logs to find patterns
  • Batch ETL: transforming raw data into aggregated reports overnight
  • Search index building: Google's original use case
  • Machine learning on large datasets: training on data that didn't fit in RAM

Where Hadoop was overkill:

  • Any dataset under 10GB (a single MySQL server handles this fine)
  • Any workload that needed low-latency responses (MapReduce jobs took minutes to hours)
  • Most "Big Data" projects at companies that weren't actually large

The cluster setup alone took weeks and required dedicated ops expertise. HDFS (Hadoop Distributed File System), YARN job scheduling, MapReduce framework tuning - each was a distinct discipline. Most companies using Hadoop in 2012-2014 had 500MB of data and would have been better served by PostgreSQL and a competent DBA.


MongoDB: The Document Database

MongoDB 1.0 shipped in 2009. By 2012 it was the most popular NoSQL database, widely positioned as a MySQL replacement.

The appeal:

// MongoDB - No schema, no migrations, document model
db.products.insertOne({
  name: "Wireless Headphones",
  price: 89.99,
  variants: [
    { color: "black", sku: "WH-001-BLK", stock: 42 },
    { color: "white", sku: "WH-001-WHT", stock: 17 }
  ],
  specifications: {
    batteryLife: "20 hours",
    bluetooth: "5.0",
    noiseCancelling: true
  },
  tags: ["electronics", "audio", "bluetooth"]
});

// Query - no joins required for nested data
db.products.find({
  "variants.color": "black",
  "specifications.noiseCancelling": true
});

// Aggregation pipeline
db.products.aggregate([
  { $match: { price: { $gt: 50 } } },
  { $group: { _id: null, avgPrice: { $avg: "$price" }, count: { $sum: 1 } } }
]);

Compare to MySQL for the same product catalog:

-- MySQL: normalized schema requires multiple tables and joins
CREATE TABLE products (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(255),
    price DECIMAL(10,2)
);

CREATE TABLE product_variants (
    id INT PRIMARY KEY AUTO_INCREMENT,
    product_id INT REFERENCES products(id),
    color VARCHAR(50),
    sku VARCHAR(100),
    stock INT
);

CREATE TABLE product_specifications (
    id INT PRIMARY KEY AUTO_INCREMENT,
    product_id INT REFERENCES products(id),
    spec_key VARCHAR(100),
    spec_value VARCHAR(255)
);

-- Retrieving the same data requires a JOIN
SELECT p.*, v.color, v.sku, v.stock, s.spec_key, s.spec_value
FROM products p
LEFT JOIN product_variants v ON v.product_id = p.id
LEFT JOIN product_specifications s ON s.product_id = p.id
WHERE p.id = 1;

For product catalogs, content management, and user profiles - hierarchical data with variable schemas - MongoDB was genuinely better than MySQL. Schema changes that required ALTER TABLE in MySQL (potentially locking a large table for minutes) were non-events in MongoDB.


Where MongoDB Failed in 2012

MongoDB 2.x had serious limitations that the marketing didn't emphasize:

No transactions. Multi-document atomic operations didn't exist until MongoDB 4.0 (2018). In 2012, an order that updated inventory across multiple documents could partially succeed. If the server crashed between the order insert and the inventory decrement, you had a phantom order with full inventory. We handled this with application-level two-phase commits - complex, error-prone, and exactly what the database should have handled.

// The painful 2012 MongoDB "transaction" workaround
function createOrder(userId, productId, quantity, callback) {
    // Step 1: Reserve inventory (atomic)
    db.products.findAndModify({
        query: { _id: productId, stock: { $gte: quantity } },
        update: { $inc: { stock: -quantity } },
        new: true
    }, function(err, product) {
        if (!product) return callback(new Error('Out of stock'));
        
        // Step 2: Create order
        db.orders.insertOne({
            userId: userId,
            productId: productId,
            quantity: quantity,
            status: 'pending'
        }, function(err, order) {
            if (err) {
                // PROBLEM: Need to undo the inventory change
                // What if THIS rollback fails?
                db.products.updateOne(
                    { _id: productId },
                    { $inc: { stock: quantity } },
                    function() { callback(err); }
                );
                return;
            }
            callback(null, order);
        });
    });
}

Writes were unsafe by default. MongoDB's default write concern in 2.x was "fire and forget" - writes were acknowledged before being flushed to disk. A server crash could lose the last second of writes. Production deployments required explicitly setting {w: 1, j: true} (write acknowledged AND journaled) on every write operation, which the documentation didn't make obvious.

Memory usage. MongoDB mapped its data files into memory and relied on the OS to manage paging. On a machine with 8GB RAM and a 50GB database, MongoDB would try to use all available RAM for its memory map and regularly evict pages. Performance was inconsistent in ways that were hard to tune.


The Real Lesson: Use the Right Tool

By 2014-2015 we'd arrived at a pragmatic position:

  • PostgreSQL for transactional data: orders, payments, user accounts, anything where consistency mattered
  • MongoDB for catalog/content data: products, articles, user profiles with variable schemas
  • Hadoop/Hive for batch analytics on genuinely large datasets (>50GB)
  • Redis for caching and session storage

The hype of 2012 pushed many teams to replace MySQL with MongoDB entirely. Most of them partially migrated back by 2016. The document model was genuinely better for some data; relational was genuinely better for other data. The answer was never "NoSQL replaces SQL." It was "different data models for different problem shapes."

The Big Data hype was mostly inflated. But it forced the industry to seriously think about data infrastructure beyond the single-relational-database model for the first time. That thinking produced better tools, better architectures, and eventually the data engineering discipline as a profession.


Aunimeda builds backend systems with optimized database architectures - PostgreSQL, Redis, ClickHouse, and more.

Contact us for backend and database engineering. See also: Custom Software Development

Read Also

How to Scale MySQL with Read Replicas When Your App Slows Down (2015)aunimeda
Databases

How to Scale MySQL with Read Replicas When Your App Slows Down (2015)

When a single MySQL server handles both reads and writes, reads win and writes stall. Adding a read replica splits the load: writes go to master, reads go to replica. This is the exact setup - my.cnf, replication config, PHP connection routing - we used when our app hit 50k daily users.

MySQL Optimization: How We Handled 100,000 Daily Queries on PHP 5.3aunimeda
Databases

MySQL Optimization: How We Handled 100,000 Daily Queries on PHP 5.3

Case study of migrating from MyISAM to InnoDB and introducing Memcached for heavy SQL query caching in a high-load portal. Before/after performance benchmarks.

PostgreSQL EXPLAIN ANALYZE: Reading Query Plans Like a Senior DBAaunimeda
Databases

PostgreSQL EXPLAIN ANALYZE: Reading Query Plans Like a Senior DBA

Stop guessing why your queries are slow. Learn to read PostgreSQL query plans at a level where you can actually fix problems - seq scans, join strategies, row estimate disasters, and the N+1 you didn't know was hiding in your ORM output.

Need IT development for your business?

We build websites, mobile apps and AI solutions. Free consultation.

Get Consultation All articles