Partition – Consistent Hashing
- Partition is done through a mechanism called Consistent Hashing
- For the initial partition, if there are M machines, and N is a large number – say 1000
- We generate M * N static partitions on the entire key-space
- This can be done by using a mod M*N mechanism on the key-space
Large number of partitions
|Let k1, k2, … , kn denote the sorted numbersThe partitions are 0…k1, k1…k2, … , kn-1…kn
Mapping of partitions to machines
- The mapping of partitions to machines follows a novel scheme called “Consistent Hashing”.
- Previous versions of the Hash tables decided the formula based on a simple mod or div of the key
- Using a simple mod formula. This meant that whenever a machine was added or removed, lots of keys had to be moved
- By using a “div” formula for the Hash function, on an average, half the keys needed to be moved.
- “Consistent Hash’ function avoids this by
- Using a dynamic table of M*N to lookup the machine on which Partition “I” is placed.
- Whenever a machine goes away, all of its partitions are evenly distributed to the other machines.
- Whenever a machine gets added, the partitions that get assigned to the machine may be uniformly borrowed from the other machines.
- So, the key concept here is
|Assign Machine(k) à Table-Lookup(k), where k is a keyAdjust the table and move keys, when you add or remove keys
- Note that whenever the “Base Configuration” of a ring is changed, data has to be moved.
- However, the amount of data that is moved is kept to a minimum
- Partitioning of the key-store is meant for keeping the load on the system distributed.
- Replication is meant for dealing with short-term failures of hardware and software
- Each item that is stored in Miis also stored with N of its ring successors.
- The nodes are called standby nodes.
- N is called the Replication Factor
- In order to control write-latency,
- Data is written to the home node – Mi
- The system responds to the user, that it has stored the item.
- Replication is done offline, in a queued manner
- If a data-node goes down between when it wrote to the Transaction Logs, and before replicating
- We assume that data recovery procedures will take care of restoring the data
- In the context of “Eventual Consistency”, this can be considered a “Delayed Update”.
- Under normal operations (under normal assumptions of serializability),
- Replication Factor is N – The number of nodes each item needs to write to
- Read Quorum is R – The number of nodes from the Standby, that the Read must happen
- Write Quorum is W – The minimum number of nodes from the StandBy to which data must be written
- For all the writes to have been made to at-least one copy,
- R + W > N
- Usual numbers are R = W = 2, and N = 3
- Otherwise transient communication failures may cause data inconsistencies
- If the write cannot be made to W of the original node copies,
- We write the data to W – w other copies chosen at random, with a hint.
- These copies become the actual representatives in the quorum, till the original nodes come back ??
Vector Clocks – Multi-Version and Resolution
Because of arbitrarily delayed writes, versions have to be reconciled. Reconciling versions is done using Vector Clocks.
- Every Virtual Node maintains its own version of its Timestamp. Every copy of an item has a history of timestamps it passed through
|Node X writes item D1([Sx,1])Node X is modifies item D2([Sx,2])Node X down, Node Y modifies item D3([Sx,2],[Sy,1])
Node X,Y down, Node Z modifies item D4([Sx,2],[Sz,1])
- Reconciliation is done during a “Read” when R copies are being read.
|Node X comes back up. Gets versions. Writes reconciled versionD5([Sx,3],[Sz,1],[Sy,1])
In addition to reconciling on Reads, reconciliation may also happen behind the scenes – to make sure the replicas do not go too far out of sync.
- A Merkle tree is a tree where
- Each leaf stores the range of values associated with the node.
- The intermediate nodes contain the hash value of the values of its children
- Comparing Merkle trees for two nodes from the root downwards allows you to identify nodes that are different in both the versions, and hence is a fast way to pro-actively exchange values, and become consistent.