A really short write-up on HBase and its features. For a longer version, check the open source tutorial.
Like BigTable and many relational databases, HBase also follows a three-level hierarchy
HBase Data Hierarchy
||Covers an entire installation
||Covers an entire database
||Tables / Partitions
||Covers a partition of the table
- Each directory is also stored in the same format as the regular tables.
- The common data structure is called a REGION
- This makes it easy to store and access the meta-data the same way.
- Each region is assigned to a unique Region Server, that handles both the scan and the update loads on the server.
- Of course, secondary Region Servers will take over when the Primary Region Server goes down.
- This approach ensures there is no data contention across machines
Finding the Region
To find the Region Server at which a particular key may be present, the following are the steps
- From Zookeeper, find the Server that contains the Root region
- From the Root Server, find out the Server that contains the Meta region
- From the Meta Server, find the Server that contains the Key Region
- From the Key Server, retrieve the value of the key
Note that the Root Region Server, the Meta Region Server and Key Region Server may be cached at the client. So, it may directly contact the Key Region Server directly in many cases.
The two points below help search effectively within a Key Region
- The records in the file are kept sorted, and hence can provide both sub-range queries and random queries efficiently.
- The Indexes are flat, and only on the key. This means at-most two disk blocks are accessed, before the data is read.
Finding the Value
Once a Region is located, the following steps help search for the value
- The Store-File contains the Index at the end of the file, which contains information on exactly which block stores the data.
- Within a block, binary search can be used to verify if the data exists or not.
Immediate Full Consistency
- The database follows “Immediate Consistency” on all of its writes.
- It also follows full ACID semantics
- This means after a Write is completed, any subsequent Read will read the latest value.
- For each region, there are Transaction Logs maintained on the underlying file system.
- This ensures the recovery in case the rest of the Write process fails.
- The database is designed, not as a B-Tree, but as a “Log-Store Managed Tree”
- A “Log Store Managed Tree” works as follows
Log Store Managed Technique
- Write the transaction log, so that the “D” portion of ACID is satisfied
- Next change the memory structures. At this point, the transaction is considered done.
- Periodically, we run a Merge pass, that
- Do an entire sequential scan of the entire sorted Region from the disk,
- Merge the memory changes
- Write back a new version of the Region to disk
- Update the Region server to use this new Region server, instead of the old one.
Why this technique works better
The scenario we have is one of “High Workloads” and a “Large dataset”. Under this situation, a sequential scan performs better than B-Tree writes
- B-Trees modify different sparse blocks, and the Disk seek time is wasted. At higher workloads, the disk arm may not be able to keep up with the pace of “Disk requests”.
- Even if we batch the Writes, because the “Data set” is vast, the disk blocks written are still sparse.
Since the Cell values are sparse, the cells are stored as tuples
|<row-key> <col-key> <value>
Enhancing this to include column names and timestamps, the representation is
The key-value is therefore represented as
Efficient Representation of Data
As a result of the above representation, the keys can get very long
- One needs to be conscious of the size of the key, and keep it short
- If there is only one column, part of the data may be used as part of the key.
HBase requires a good key design, if the load needs to be distributed across multiple machines
- If the keys are timestamps or sequences, then, they get assigned to the same region, and this becomes a hot-spot
- To avoid this, the keys may be hashed.
- Alternately, the leading dimension should not be timestamped.
What is Zookeeper
Zookeeper is a “Name Server” in the Hadoop suite of products, with the following characteristics.
- Names form a hierarchical name-space
- It has functionality create, read, update and delete names
- It has functionality to send updates to registered listeners on different machines in the same order in which it received them.
The last two features enable it to be used for co-ordination and synchronization.
Simple Use Case
External Monitoring Service
Suppose we wanted to use an external monitoring service like Munin
- The external program will register a “Zookeeper Watch” to be informed whenever there is a change in the tree location.
- The existing services, such as apache may register a node.
For example, the node below represents an Apache server at www32.mydomain.com at port 80.
- The munin system can periodically get a dump of all services under /services/www . and load them into its special file – munin.conf
- Once this is done, the particular WebServer can be monitored by the Monitoring System.
Why not use a Database
Zookeeper is a superior interface to the database, because of the guarantees made
- The watch is ordered with respect to other events, other watches, and other asynchronous replies. The events are all propagated by the client library in the right order.
- The client will see the node creation event before it sees the value for the node
- The order in which events are seen by the client is the same as the order in which these are being seen by the Zookeeper service.
Usage in Hadoop
Managing Configuration Changes
- When there are hundreds and thousands of nodes in a cluster, it becomes difficult to push configuration changes to the machines.
- Zookeeper enables the configuration changes to be pushed.
Implementing Reliable Messaging
- With Zookeeper, we can implement reliable producer-consumer queues
- even if a few consumers and some Zookeeper servers fail.
Implement Redundant Services
- Several identical nodes may provide a service.
- One of these may elect itself as the leader (using a leader election algorithm), and may start providing the service.
Synchronize Process Execution
- Multiple nodes can coordinate the start and end of a process or calculation.
- This ensures that any follow-up processing is done only after all nodes have finished their calculations.
Usage in a Data-Center
Complex Ad-Serving environment
Zookeeper is also useful in a complex Data-Center environment
- Let us consider the case of a Complex Ad-Serving system. It consists of several components
- Database for Campaign data and Fiscal transactions
- Ad-serving engines for serving the best Advertisements for the customers
- Campaign planners for advertisers to run campaigns and simulations
- Log collection engines for Data Warehousing, and data planning.
- Data analytics and modeling systems
- Fraud detection systems
- Beacons and fault management systems
- Failover servers
One of the most important uses of Zookeeper in these cases is as a “Bootstrap Server”.
- It contains the way to contact the “Services”, when all of the services are not running
- It can store the primary and secondary configurations.
Distributed Service Locator
The Distributed Service Locator allows a way for services to access other services
- Services may come up, and use “Leader Election” to decide the configurations.
- They can store their status, which can then be queried.
Distributed System State
Zookeeper is usually used to maintain top-level system states, so that
- An upto-date directory of which machine is running which service may be maintained
- This directory may be used by the Monitoring Software to decide which machines should be monitored and how.
To make configuration changes, and push them to the servers that use them,
- Zookeeper can pro-actively push the configuration, if a new configuration is created
- It can also be used to push software onto each of the clusters.
Partition – Consistent Hashing
- Partition is done through a mechanism called Consistent Hashing
- For the initial partition, if there are M machines, and N is a large number – say 1000
- We generate M * N static partitions on the entire key-space
- This can be done by using a mod M*N mechanism on the key-space
Large number of partitions
|Let k1, k2, … , kn denote the sorted numbersThe partitions are 0…k1, k1…k2, … , kn-1…kn
Mapping of partitions to machines
- The mapping of partitions to machines follows a novel scheme called “Consistent Hashing”.
- Previous versions of the Hash tables decided the formula based on a simple mod or div of the key
- Using a simple mod formula. This meant that whenever a machine was added or removed, lots of keys had to be moved
- By using a “div” formula for the Hash function, on an average, half the keys needed to be moved.
- “Consistent Hash’ function avoids this by
- Using a dynamic table of M*N to lookup the machine on which Partition “I” is placed.
- Whenever a machine goes away, all of its partitions are evenly distributed to the other machines.
- Whenever a machine gets added, the partitions that get assigned to the machine may be uniformly borrowed from the other machines.
- So, the key concept here is
|Assign Machine(k) à Table-Lookup(k), where k is a keyAdjust the table and move keys, when you add or remove keys
- Note that whenever the “Base Configuration” of a ring is changed, data has to be moved.
- However, the amount of data that is moved is kept to a minimum
- Partitioning of the key-store is meant for keeping the load on the system distributed.
- Replication is meant for dealing with short-term failures of hardware and software
- Each item that is stored in Miis also stored with N of its ring successors.
- The nodes are called standby nodes.
- N is called the Replication Factor
- In order to control write-latency,
- Data is written to the home node – Mi
- The system responds to the user, that it has stored the item.
- Replication is done offline, in a queued manner
- If a data-node goes down between when it wrote to the Transaction Logs, and before replicating
- We assume that data recovery procedures will take care of restoring the data
- In the context of “Eventual Consistency”, this can be considered a “Delayed Update”.
- Under normal operations (under normal assumptions of serializability),
- Replication Factor is N – The number of nodes each item needs to write to
- Read Quorum is R – The number of nodes from the Standby, that the Read must happen
- Write Quorum is W – The minimum number of nodes from the StandBy to which data must be written
- For all the writes to have been made to at-least one copy,
- R + W > N
- Usual numbers are R = W = 2, and N = 3
- Otherwise transient communication failures may cause data inconsistencies
- If the write cannot be made to W of the original node copies,
- We write the data to W – w other copies chosen at random, with a hint.
- These copies become the actual representatives in the quorum, till the original nodes come back ??
Vector Clocks – Multi-Version and Resolution
Because of arbitrarily delayed writes, versions have to be reconciled. Reconciling versions is done using Vector Clocks.
- Every Virtual Node maintains its own version of its Timestamp. Every copy of an item has a history of timestamps it passed through
|Node X writes item D1([Sx,1])Node X is modifies item D2([Sx,2])Node X down, Node Y modifies item D3([Sx,2],[Sy,1])
Node X,Y down, Node Z modifies item D4([Sx,2],[Sz,1])
- Reconciliation is done during a “Read” when R copies are being read.
|Node X comes back up. Gets versions. Writes reconciled versionD5([Sx,3],[Sz,1],[Sy,1])
In addition to reconciling on Reads, reconciliation may also happen behind the scenes – to make sure the replicas do not go too far out of sync.
- A Merkle tree is a tree where
- Each leaf stores the range of values associated with the node.
- The intermediate nodes contain the hash value of the values of its children
- Comparing Merkle trees for two nodes from the root downwards allows you to identify nodes that are different in both the versions, and hence is a fast way to pro-actively exchange values, and become consistent.
Scalable Web-2.0 Architectures
These are pre-social Architectures
Web Architecture Components
- Scalable External Cache
- Scalable WebServer
- Scalable Internal Cache
- Scalable Application Servers
- Scalable Database Servers
- Scalable Miscellaneous Servers
Scalable External Cache
- Caches outbound HTTP objects
- Defense against Denial of Service
- A lot of traffic does not reach the WebServer.
- Since the path to the cache from the WebServer is a high bandwidth path,
- The proxy quickly gets the file, and frees the WebServer connection,
- even though client may be slow
- Apache mod_proxy
- A commercial HTTP Accelerator
- Fast Network
- Lots of memory
- Moderate disk space
- Low CPU
- Terminates HTTP or SSL connections
- Serves static content from disk
- Serves dynamic content
- php, mod_perl, python, etc
- Dispatches request to Application Server
Typical Web Servers
- Apache ( The “A” of LAMP )
- thttp ( Trivial http – Very fast HTTP server )
- Tux WebServer, Oracle WebServer
- Lots and lots of memory
- Memory is the primary bottleneck in WebServers
- Memory determines maximum number of users
- Fast Network
- CPU depends on Storage
- Dynamic Content Serving requires high CPU
- Static Content Serving requires only a slow CPU
- Slow disk is adequate
- Used to serve static content and code
- If SSL terminated, requires SSL accelerator card.
- Maintains Application view of data
- Runs internal services
- Shopping Cart
- Credit Card Processing
- Receives RPC (SOAP or REST) calls from the Web layer
- Communicates with Database layer and other services, and returns a response.
Typical App Servers
- BEA Weblogic
- IBM WebSphere
- Extremely memory hungry
- Lots of CPU – Multi-CPU systems required
- Data Storage and Retrieval
- Data Aggregation and Computing
- Transaction (ACID properties)
- Data Archival, Data Warehousing
- Depends on the application. Likely to be expensive
- Lots of memory
- Lots of disks – Spindles galore
- CPU important for Data Computation
- Redundancy – Dual Power Supply, Dual Network
Internal Cache Tier
- Object Cache
- Caches of Applications and Web layers
- Tuned for applications
- Scales horizontally
- LRU tables
- Lots of memory
- Some hard-disk
- Low CPU
- Time Synchronization
- System Monitoring
- Fault Notification
Routers and Switches
- Crucial piece of the system
- Use GigE for performance
- Use Jumbo Ethernet frames (1500 – 9000 bytes)
- Use Vlans to handle complexity
- Use Trunking, Bonding and Aggregation (LACP) for failover
- Balances the load on
- HTTP Caches
- HTTP Servers
- App Servers
- DB Servers
- Need not use on items that auto-balance within themselves
Message Buses – Examples
Message Buses – Functionality
- Robustness between different parts of the Distributed System
- Guaranteed Delivery
- Handles Unicast and Broadcast
Most products assume a Relational Database under the hood. The other models have not been as popular. The reason is that it is being used as an “Integration Database Model”.
With the scaling issues of Relational Databases, the “Application Database Model” is becoming popular. With this model
- HTTP/ REST / WebServices is used as the primary vehicle of Integration
- Each application is free to choose the Database that best serves its application needs.
Simpler Database Schema
With breaking the iron-tight requirement of “Database Integration”, it becomes easier to split the database into smaller functional components, the data model becomes simpler to maintain
Easier Schema and Product Evolution
Each of the individual components can evolve better through “Evolutionary Schema Design and Refactoring”.