Crisp Reading Notes on Latest Technology Trends and Basics

Archive for the ‘Scaling and BigData’ Category

HBase Summary

A really short write-up on HBase and its features. For a longer version, check the open source tutorial.



Like BigTable and many relational databases, HBase also follows a three-level hierarchy

HBase Data Hierarchy

Region Name Relational Term Functionality
ROOT Catalog Covers an entire installation
.meta Database Covers an entire database
.table Tables / Partitions Covers a partition of the table


  • Each directory is also stored in the same format as the regular tables.
  • The common data structure is called a REGION
  • This makes it easy to store and access the meta-data the same way.
  • Each region is assigned to a unique Region Server, that handles both the scan and the update loads on the server.
  • Of course, secondary Region Servers will take over when the Primary Region Server goes down.
  • This approach ensures there is no data contention across machines

Region Servers


Finding the Region

To find the Region Server at which a particular key may be present, the following are the steps

  1. From Zookeeper, find the Server that contains the Root region
  2. From the Root Server, find out the Server that contains the Meta region
  3. From the Meta Server, find the Server that contains the Key Region
  4. From the Key Server, retrieve the value of the key

Note that the Root Region Server, the Meta Region Server and Key Region Server may be cached at the client. So, it may directly contact the Key Region Server directly in many cases.

File Representation

The two points below help search effectively within a Key Region

  1. The records in the file are kept sorted, and hence can provide both sub-range queries and random queries efficiently.
  2. The Indexes are flat, and only on the key. This means at-most two disk blocks are accessed, before the data is read.

Finding the Value

Once a Region is located, the following steps help search for the value

  1. The Store-File contains the Index at the end of the file, which contains information on exactly which block stores the data.
  2. Within a block, binary search can be used to verify if the data exists or not.


Immediate Full Consistency

  • The database follows “Immediate Consistency” on all of its writes.
  • It also follows full ACID semantics
  • This means after a Write is completed, any subsequent Read will read the latest value.
  • For each region, there are Transaction Logs maintained on the underlying file system.
  • This ensures the recovery in case the rest of the Write process fails.
  • The database is designed, not as a B-Tree, but as a “Log-Store Managed Tree”
  • A “Log Store Managed Tree” works as follows

Write-Ahead Logging

Log Store Managed Technique

  1. Write the transaction log, so that the “D” portion of ACID is satisfied
  2. Next change the memory structures. At this point, the transaction is considered done.
  3. Periodically, we run a Merge pass, that
    1. Do an entire sequential scan of the entire sorted Region from the disk,
    2. Merge the memory changes
    3. Write back a new version of the Region to disk
    4. Update the Region server to use this new Region server, instead of the old one.

Why this technique works better

The scenario we have is one of “High Workloads” and a “Large dataset”. Under this situation, a sequential scan performs better than B-Tree writes

  1. B-Trees modify different sparse blocks, and the Disk seek time is wasted. At higher workloads, the disk arm may not be able to keep up with the pace of “Disk requests”.
  2. Even if we batch the Writes, because the “Data set” is vast, the disk blocks written are still sparse.


Cell Representations

Since the Cell values are sparse, the cells are stored as tuples

<row-key> <col-key> <value>

Enhancing this to include column names and timestamps, the representation is

<row-key>:<col-family-name>:<col-name>:<timestamp> <value>

The key-value is therefore represented as


Efficient Representation of Data

As a result of the above representation, the keys can get very long

  • One needs to be conscious of the size of the key, and keep it short
  • If there is only one column, part of the data may be used as part of the key.


HBase requires a good key design, if the load needs to be distributed across multiple machines

  • If the keys are timestamps or sequences, then, they get assigned to the same region, and this becomes a hot-spot
  • To avoid this, the keys may be hashed.
  • Alternately, the leading dimension should not be timestamped.

Zookeeper Usage in Data-Centers


What is Zookeeper

Zookeeper is a “Name Server” in the Hadoop suite of products, with the following characteristics.

  1. Names form a hierarchical name-space
  2. It has functionality create, read, update and delete names
  3. It has functionality to send updates to registered listeners on different machines in the same order in which it received them.

The last two features enable it to be used for co-ordination and synchronization.


Simple Use Case

External Monitoring Service

Suppose we wanted to use an external monitoring service like Munin

  1. The external program will register a “Zookeeper Watch” to be informed whenever there is a change in the tree location.
  2. The existing services, such as apache may register a node.

For example, the node below represents an Apache server at at port 80.

  1. The munin system can periodically get a dump of all services under /services/www .  and load them into its special file – munin.conf
  2. Once this is done, the particular WebServer can be monitored by the Monitoring System.

Why not use a Database

Zookeeper is a superior interface to the database, because of the guarantees made

  1. The watch is ordered with respect to other events, other watches, and other asynchronous replies. The events are all propagated by the client library in the right order.
  2. The client will see the node creation event before it sees the value for the node
  3. The order in which events are seen by the client is the same as the order in which these are being seen by the Zookeeper service.

Usage in Hadoop

Managing Configuration Changes

  • When there are hundreds and thousands of nodes in a cluster, it becomes difficult to push configuration changes to the machines.
  • Zookeeper enables the configuration changes to be pushed.

Implementing Reliable Messaging

  • With Zookeeper, we can implement reliable producer-consumer queues
  • even if a few consumers and some Zookeeper servers fail.

Implement Redundant Services

  • Several identical nodes may provide a service.
  • One of these may elect itself as the leader (using a leader election algorithm), and may start providing the service.

Synchronize Process Execution

  • Multiple nodes can coordinate the start and end of a process or calculation.
  • This ensures that any follow-up processing is done only after all nodes have finished their calculations.

Usage in a Data-Center

Complex Ad-Serving environment

Zookeeper is also useful in a complex Data-Center environment

  • Let us consider the case of a Complex Ad-Serving system. It consists of several components
    • Database for Campaign data and Fiscal transactions
    • Ad-serving engines for serving the best Advertisements for the customers
    • Campaign planners for advertisers to run campaigns and simulations
    • Log collection engines for Data Warehousing, and data planning.
    • Data analytics and modeling systems
    • Fraud detection systems
    • Beacons and fault management systems
    • Failover servers

Bootstrap Configuration

One of the most important uses of Zookeeper in these cases is as a “Bootstrap Server”.

  • It contains the way to contact the “Services”, when all of the services are not running
  • It can store the primary and secondary configurations.

Distributed Service Locator

The Distributed Service Locator allows a way for services to access other services

  • Services may come up, and use “Leader Election” to decide the configurations.
  • They can store their status, which can then be queried.

Distributed System State

Zookeeper is usually used to maintain top-level system states, so that

  • An upto-date directory of which machine is running which service may be maintained
  • This directory may be used by the Monitoring Software to decide which machines should be monitored and how.

Configuration Push

To make configuration changes, and push them to the servers that use them,

  • Zookeeper can pro-actively push the configuration, if a new configuration is created
  • It can also be used to push software onto each of the clusters.

Amazon Dynamo – Notes from paper


Partition – Consistent Hashing

  • Partition is done through a mechanism called Consistent Hashing
  • For the initial partition, if there are M machines, and N is a large number – say 1000
    • We generate M * N static partitions on the entire key-space
    • This can be  done by using a mod M*N mechanism on the key-space

Large number of partitions

Let k1, k2, … , kn denote the sorted numbersThe partitions are 0…k1, k1…k2, … , kn-1…kn

Mapping of partitions to machines

  • The mapping of partitions to machines follows a novel scheme called “Consistent Hashing”.
  • Previous versions of the Hash tables decided the formula based on a simple mod or div of the key
    • Using a simple mod formula. This meant that whenever a machine was added or removed, lots of keys had to be moved
    • By using a “div” formula for the Hash function, on an average, half the keys needed to be moved.
  • “Consistent Hash’ function avoids this by
    • Using a dynamic table of M*N to lookup the machine on which Partition “I” is placed.
    • Whenever a machine goes away, all of its partitions are evenly distributed to the other machines.
    • Whenever a machine gets added, the partitions that get assigned to the machine may be uniformly borrowed from the other machines.
  • So, the key concept here is
Assign Machine(k) à Table-Lookup(k), where k is a keyAdjust the table and move keys, when you add or remove keys
  • Note that whenever the “Base Configuration” of a ring is changed, data has to be moved.
  • However, the amount of data that is moved is kept to a minimum


  • Partitioning of the key-store is meant for keeping the load on the system distributed.
  • Replication is meant for dealing with short-term failures of hardware and software
  • Each item that is stored in Miis also stored with N of its ring successors.
    • The nodes are called standby nodes.
    • N is called the Replication Factor

Replication Quorums

  • In order to control write-latency,
    • Data is written to the home node – Mi
    • The system responds to the user, that it has stored the item.
    • Replication is done offline, in a queued manner
  • If a data-node goes down between when it wrote to the Transaction Logs, and before replicating
    • We assume that data recovery procedures will take care of restoring the data
    • In the context of “Eventual Consistency”, this can be considered a “Delayed Update”.
  • Under normal operations (under normal assumptions of serializability),
    • Replication Factor is N – The number of nodes each item needs to write to
    • Read Quorum is R – The number of nodes from the Standby, that the Read must happen
    • Write Quorum is W – The minimum number of nodes from the StandBy to which data must be written
  • For all the writes to have been made to at-least one copy,
    • R + W > N
    • Usual numbers are R = W = 2, and N = 3
    • Otherwise transient communication failures may cause data inconsistencies

Hinted Handoff

  • If the write cannot be made to W of the original node copies,
    • We write the data to W – w other copies chosen at random, with a hint.
    • These copies become the actual representatives in the quorum, till the original nodes come back ??

Vector Clocks – Multi-Version and Resolution

Because of arbitrarily delayed writes, versions have to be reconciled. Reconciling versions is done using Vector Clocks.

  • Every Virtual Node maintains its own version of its Timestamp.  Every copy of an item has a history of timestamps it passed through
    • For example
Node X writes item                  D1([Sx,1])Node X is modifies item             D2([Sx,2])Node X down, Node Y modifies item   D3([Sx,2],[Sy,1])

Node X,Y down, Node Z modifies item D4([Sx,2],[Sz,1])

  • Reconciliation is done during a “Read” when R copies are being read.
    • For the above example
Node X comes back up. Gets versions. Writes reconciled versionD5([Sx,3],[Sz,1],[Sy,1])

Merkle Trees

In addition to reconciling on Reads, reconciliation may also happen behind the scenes – to make sure the replicas do not go too far out of sync.

  • A Merkle tree is a tree where
    • Each leaf stores the range of values associated with the node.
    • The intermediate nodes contain the hash value of the values of its children
  • Comparing Merkle trees for two nodes from the root downwards allows you to identify nodes that are different in both the versions, and hence is a fast way to pro-actively exchange values, and become consistent.

Scalable Web-2.0 Architectures

Scalable Web-2.0 Architectures
These are pre-social Architectures

Web Architecture Components

  • Scalable External Cache
  • Scalable WebServer
  • Scalable Internal Cache
  • Scalable Application Servers
  • Scalable Database Servers
  • Scalable Miscellaneous Servers

Scalable External Cache


  • Caches outbound HTTP objects
    • Images, CSS, XML, etc
  • Defense against Denial of Service

How useful

  • A lot of traffic does not reach the WebServer.
  • Since the path to the cache from the WebServer is a high bandwidth path,
    • The proxy quickly gets the file, and frees the WebServer connection,
    • even though client may be slow

Typical Caches

  • Squid
  • Apache mod_proxy
  • A commercial HTTP Accelerator

System Configuration

  • Fast Network
  • Lots of memory
  • Moderate disk space
  • Low CPU

Scalable Web-Server


  1. Terminates HTTP or SSL connections
  2. Serves static content from disk
  3. Serves dynamic content
    • php, mod_perl, python, etc
  4. Dispatches request to Application Server

Typical Web Servers

  • Apache ( The “A” of LAMP )
  • thttp ( Trivial http – Very fast HTTP server )
  • IIS
  • Tux WebServer, Oracle WebServer

Hardware Requirement

  • Lots and lots of memory
    • Memory is the primary bottleneck in WebServers
    • Memory determines maximum number of users
  • Fast Network
  • CPU depends on Storage
    • Dynamic Content Serving requires high CPU
    • Static Content Serving requires only a slow CPU
  • Slow disk is adequate
    • Used to serve static content and code

Application Servers


  • If SSL terminated, requires SSL accelerator card.
  • Maintains Application view of data
  • Runs internal services
    • Search
    • Shopping Cart
    • Credit Card Processing


  • Receives RPC (SOAP or REST) calls from the Web layer
  • Communicates with Database layer and other services, and returns a response.

Typical App Servers

  • BEA Weblogic
  • IBM WebSphere

Hardware Requirements

  • Extremely memory hungry
    • 64-bit systems required
  • Lots of CPU – Multi-CPU systems required

Database Layer


  • Data Storage and Retrieval
  • Data Aggregation and Computing
  • Sorting
  • Filtering
  • Transaction (ACID properties)
  • Data Archival, Data Warehousing

Hardware Requirements

  • Depends on the application. Likely to be expensive
  • Lots of memory
  • Lots of disks – Spindles galore
    • RAID
  • CPU important for Data Computation
  • Redundancy – Dual Power Supply, Dual Network

Typical Databases

  • Oracle
  • MS-SQL
  • IBM-DB2
  • Postgres
  • MySQL

Internal Cache Tier


  • Object Cache
  • Caches of Applications and Web layers
  • Tuned for applications
  • Scales horizontally

Typical Examples

  • memcache
  • redis
  • LRU tables

Hardware Configuration

  • Lots of memory
  • Some hard-disk
  • Low CPU

Miscellaneous Services


  1. Mail
  2. DNS
  3. Time Synchronization
  4. System Monitoring
  5. Fault Notification

The Glue


  1. Load-Balancers
  2. Switches
  3. Firewalls
  4. Routers

Routers and Switches

  • Expensive
  • Complex
  • Crucial piece of the system

Performance Hints

  • Use GigE for performance
    • Use Jumbo Ethernet frames (1500 – 9000 bytes)
  • Use Vlans to handle complexity
  • Use Trunking, Bonding and Aggregation (LACP) for failover

Load Balancers

  • Balances the load on
    • HTTP Caches
    • HTTP Servers
    • App Servers
    • DB Servers
  • Need not use on items that auto-balance within themselves
    • DNS
    • LDAP
    • Memcache

Message Buses – Examples

  • JMS
  • MQ-Series
  • Tibco-Rendevous

Message Buses – Functionality

  • Robustness between different parts of the Distributed System
  • Guaranteed Delivery
  • Handles Unicast and Broadcast

An example of Hadoop usage

Application Databases – More Agile than Integration Databases

Application Databases

Integration Databases

Most products assume a Relational Database under the hood. The other models have not been as popular. The reason is that it is being used as an “Integration Database Model”.

Application Databases

With the scaling issues of Relational Databases, the “Application Database Model” is becoming popular. With this model

  • HTTP/ REST / WebServices is used as the primary vehicle of Integration
  • Each application is free to choose the Database that best serves its application needs.


Simpler Database Schema

With breaking the iron-tight requirement of “Database Integration”, it becomes easier to split the database into smaller functional components, the data model becomes simpler to maintain

Easier Schema and Product Evolution

Each of the individual components can evolve better through “Evolutionary Schema Design and Refactoring”.

Polyglot Persistence

  • Polyglot Persistence is a “Design Pattern” that allows different functional components of an Integrated System to be present in different kinds of data-stores, each with
    • Different CAP (Consistency, Availability and Partitioning) characteristics
    • Different Scalability requirements
    • Different Read and Write Performance characteristics
    • Different Administrative characteristics


  • The following is a sample of how different databases may be used for different applications in an E-retailers application

User Session Redis In memory database

Used for moderate amounts of fast changing data

Financial Data RDBMS Reliability, Transactions, Interoperability, Explainable Procedures
Shopping Cart Riak Emphasis on availability. Even seconds of downtime count.

Must support distributed access

Recommendations Neo4J A Graph database
Product Catalog MongoDB Very good for Indexing

Can tolerate moderate amount of writes as well

Reporting RDBMS Infrastructure is very good, Sophisticated data model,

Query Optimization

Cassandra User Activity Logs Allows a large amount of writes at the same time.

HBase is a good alternative

Cassandra Web Analytics Must read huge amounts of data one-by-one.

Hive is a good alternative


Tag Cloud