Crisp Reading Notes on Latest Technology Trends and Basics


Partition – Consistent Hashing

  • Partition is done through a mechanism called Consistent Hashing
  • For the initial partition, if there are M machines, and N is a large number – say 1000
    • We generate M * N static partitions on the entire key-space
    • This can be  done by using a mod M*N mechanism on the key-space

Large number of partitions

Let k1, k2, … , kn denote the sorted numbersThe partitions are 0…k1, k1…k2, … , kn-1…kn

Mapping of partitions to machines

  • The mapping of partitions to machines follows a novel scheme called “Consistent Hashing”.
  • Previous versions of the Hash tables decided the formula based on a simple mod or div of the key
    • Using a simple mod formula. This meant that whenever a machine was added or removed, lots of keys had to be moved
    • By using a “div” formula for the Hash function, on an average, half the keys needed to be moved.
  • “Consistent Hash’ function avoids this by
    • Using a dynamic table of M*N to lookup the machine on which Partition “I” is placed.
    • Whenever a machine goes away, all of its partitions are evenly distributed to the other machines.
    • Whenever a machine gets added, the partitions that get assigned to the machine may be uniformly borrowed from the other machines.
  • So, the key concept here is
Assign Machine(k) à Table-Lookup(k), where k is a keyAdjust the table and move keys, when you add or remove keys
  • Note that whenever the “Base Configuration” of a ring is changed, data has to be moved.
  • However, the amount of data that is moved is kept to a minimum


  • Partitioning of the key-store is meant for keeping the load on the system distributed.
  • Replication is meant for dealing with short-term failures of hardware and software
  • Each item that is stored in Miis also stored with N of its ring successors.
    • The nodes are called standby nodes.
    • N is called the Replication Factor

Replication Quorums

  • In order to control write-latency,
    • Data is written to the home node – Mi
    • The system responds to the user, that it has stored the item.
    • Replication is done offline, in a queued manner
  • If a data-node goes down between when it wrote to the Transaction Logs, and before replicating
    • We assume that data recovery procedures will take care of restoring the data
    • In the context of “Eventual Consistency”, this can be considered a “Delayed Update”.
  • Under normal operations (under normal assumptions of serializability),
    • Replication Factor is N – The number of nodes each item needs to write to
    • Read Quorum is R – The number of nodes from the Standby, that the Read must happen
    • Write Quorum is W – The minimum number of nodes from the StandBy to which data must be written
  • For all the writes to have been made to at-least one copy,
    • R + W > N
    • Usual numbers are R = W = 2, and N = 3
    • Otherwise transient communication failures may cause data inconsistencies

Hinted Handoff

  • If the write cannot be made to W of the original node copies,
    • We write the data to W – w other copies chosen at random, with a hint.
    • These copies become the actual representatives in the quorum, till the original nodes come back ??

Vector Clocks – Multi-Version and Resolution

Because of arbitrarily delayed writes, versions have to be reconciled. Reconciling versions is done using Vector Clocks.

  • Every Virtual Node maintains its own version of its Timestamp.  Every copy of an item has a history of timestamps it passed through
    • For example
Node X writes item                  D1([Sx,1])Node X is modifies item             D2([Sx,2])Node X down, Node Y modifies item   D3([Sx,2],[Sy,1])

Node X,Y down, Node Z modifies item D4([Sx,2],[Sz,1])

  • Reconciliation is done during a “Read” when R copies are being read.
    • For the above example
Node X comes back up. Gets versions. Writes reconciled versionD5([Sx,3],[Sz,1],[Sy,1])

Merkle Trees

In addition to reconciling on Reads, reconciliation may also happen behind the scenes – to make sure the replicas do not go too far out of sync.

  • A Merkle tree is a tree where
    • Each leaf stores the range of values associated with the node.
    • The intermediate nodes contain the hash value of the values of its children
  • Comparing Merkle trees for two nodes from the root downwards allows you to identify nodes that are different in both the versions, and hence is a fast way to pro-actively exchange values, and become consistent.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Tag Cloud

%d bloggers like this: