HBase: Anatomy of a Region Split
In HBase, data is partitioned into "Regions." As you pour gigabytes of data into a table, a single region will eventually become too large for one server to handle. This is when the magic of the Region Split happens.
The Split Trigger
By default, when a StoreFile in a region exceeds hbase.hregion.max.filesize (usually 256MB in these early versions), a split is triggered.
<!-- hbase-site.xml -->
<property>
<name>hbase.hregion.max.filesize</name>
<value>268435456</value>
</property>
The Split Process
- Transaction Start: The RegionServer creates a
splitznode in ZooKeeper to notify the Master. - Offline: The parent region is taken offline. It stops accepting new requests.
- Daughter Creation: Two new daughter regions are created. Instead of copying all the data (which would be slow), HBase creates "Reference files."
- Reference Files: These are tiny files that point to the top or bottom half of the original parent HFiles.
// Conceptual logic for reference file check
if (isReference(path)) {
Reference r = Reference.read(fs, path);
long splitPoint = r.getSplitPoint();
// Only read the relevant half of the HFile
}
- Online: The daughter regions are opened and registered with the
.META.table. - Compaction: Eventually, a "Major Compaction" will run, which actually rewrites the data into new HFiles for the daughter regions, deleting the old parent file.
This "constant-time" split is why HBase can scale to petabytes. The split itself takes seconds, regardless of how much data is in the region, because it's just a metadata operation.
Aunimeda builds backend systems with optimized database architectures - PostgreSQL, Redis, ClickHouse, and more.
Contact us for backend and database engineering. See also: Custom Software Development