Crisp Reading Notes on Latest Technology Trends and Basics

Archive for May, 2012

Optimistic Timestamp – (on MVCC)

Timestamping with Multi-Version Concurrency Control


  • Multi-version Concurrency Control allows the designer to take snapshots, and verify their data at commit/abort time.
  • Every transaction goes through three phases
    • Read / Local Modify
    • Validation
    • Write
  • The Timestamp of each transaction Tiis created at Validation time
    • Note that it is not necessary to have timestamps while tracking Read/Write.


ST(Ti) Starting time of Transaction Ti
TS(Ti) Validation time of Transaction Ti
RS(Ti) Read-set of Transaction Ti
WS(Ti) Write-set of Transaction Ti
  • Note that this is a Snapshot abstraction. Changes of Ti are not visible until Ti decides to commit.
  • While validating a Transaction,
    • Consider transactions that were active and have validated, since this transaction started

Backward Validation of Transaction

B(T) { Tj | Tjis active when T started, andTS(Tj) < TS(T), that is Tj preceeds T in timestamp ordering }
  • For each of these transactions Tj,
    • Transaction T should not have read from Tj
      • There should be no RS(T) à WS(Tj) dependency
      • This is because T would have picked up its snapshot before Tj wrote the values back
  • So, the algorithm is
sub validate(T) {valid ß truefor all Tj in B(T) {

if RS(T) ∩ WS(Tj) == Φ

valid ß false



Forward Validation of Transactions

  • In addition to the above technique of looking back at the transaction, we could look forward at the Transactions that are currently active.
  • If anything in the Write-Set of the current transaction has been read by the Transactions that have not yet got the Validation Timestamp, then
    • These transactions will most likely get aborted when they try to Commit
    • So, we could abort these new transactions right at this stage.

Backward Validation is required even if we use Forward Validations

  • Note that through “Forward Validations”, we do not know the Full Read-Set of the forward transactions
    • So, some of the Forward transactions may actually abort when they try to commit.
    • This check can be made only when the Forward transaction tries to Commit, and the check is a “Backward Validation”.
    • For this reason, “Forward Validation” is required, even if we use “Backward Validation”.

Why Forward Validations

  • Forward Transactions abort Active Transactions.
  • It is less expensive to abort “Validating Transactions”, as opposed to “Active Transactions”.

Distributed Optimistic Control

  • The “Validation Timestamps” have to be strictly ordered.
    • On multiple nodes, there needs to be a mechanism to generate strict ordering

Locking on a Single Node

  • Also, when we are validating a Transaction,
    • If there is only “Backward Validation”, no locking is needed
    •  For “Forward Validation”, locks are needed for the Read-Set and Write-Set of all items on the list

Conservative Timestamping

Conservative Timestamp Ordering


  • With Timestamp Ordering schemes, there is no Locking
  • We assign timestamps to Transactions and to data-items
  • We describe below, one of the techniques of Timestamping, which is conservative, and to a large extent, Single Versioned
  • I will add notes for Multi-Versioned Timestamping at a later point in time


TS(Ti) Starting time of Transaction – Ti

The starting time forms the Id of the transaction

RTS(Oj) Read timestamp of item

The most recent transaction that have read the item

WTS(Oj) Write timestamp of item

The most recent transaction that have written the item

  • The reads and writes to the database are executed without locking
  • Instead, the following two conditions should be satisfied with each read/write

Read rule

  • If Ti reads Oj,
    • If WTS(Oj) > TS(Ti), a later transaction has written the data. So, abort current transaction
    • Else RTS(Oj) ß max( RTS(Oj), TS(Ti) ), and
      • add a dependency DEP(Ti ).add(WTS(Oj) à TS(Ti))

Write Rule

  • If Ti writes Oj,
    • If RTS(Oj) > TS(Ti), a later transaction has read data. So, abort transaction if Isolation level is Read Committed
    • If WTS(Oj) > TS(Ti), a later transaction has written the data. So, skip the write
    • Else
      • OLD(Ti).add(Oj,WTS(Oj))
      • update Oj

Commit Rule

  • It Ti wants to commit,
    • If there is a transaction in DEP(Ti)- an aborted transaction from which Tihas read data,
      • then abort the transaction
      • If there is a transaction in DEP(Ti), – a running transaction from which Tihas read data,
        • then either abort the transaction, or wait till the running transaction completes

Abort Rule

  • If Tiwants to abort,
    • For every item in <Oj,oldOj> in OLD(Ti), if WTS(Oj) = TS(Ti),
      • WTS(Oj) ß oldOj


Read from

Older Txn

Write from

Older Txn

Read from

Newer Txn

Write from

Newer Txn

Read Object OK OK OK Abort.

Cannot serialize

Write Object OK OK Abort.

Cannot serialize


No overwriting

Older Txn Aborted Older Txn Committed Older Txn Pending
Commit OK

if no read dep

from older to newer




On completion,

check for read dep or commit

Abort OK OK OK

Notes on wait state during commit

  • An observant reader will notice that there is a “Wait State” during the “Commit” of a transaction
  • This state could be easily engineered as an Abort action. In that case, the later transaction will also be forced to Abort, causing “Cascading Aborts”. This is something we seek to avoid

Amazon Dynamo – Notes from paper


Partition – Consistent Hashing

  • Partition is done through a mechanism called Consistent Hashing
  • For the initial partition, if there are M machines, and N is a large number – say 1000
    • We generate M * N static partitions on the entire key-space
    • This can be  done by using a mod M*N mechanism on the key-space

Large number of partitions

Let k1, k2, … , kn denote the sorted numbersThe partitions are 0…k1, k1…k2, … , kn-1…kn

Mapping of partitions to machines

  • The mapping of partitions to machines follows a novel scheme called “Consistent Hashing”.
  • Previous versions of the Hash tables decided the formula based on a simple mod or div of the key
    • Using a simple mod formula. This meant that whenever a machine was added or removed, lots of keys had to be moved
    • By using a “div” formula for the Hash function, on an average, half the keys needed to be moved.
  • “Consistent Hash’ function avoids this by
    • Using a dynamic table of M*N to lookup the machine on which Partition “I” is placed.
    • Whenever a machine goes away, all of its partitions are evenly distributed to the other machines.
    • Whenever a machine gets added, the partitions that get assigned to the machine may be uniformly borrowed from the other machines.
  • So, the key concept here is
Assign Machine(k) à Table-Lookup(k), where k is a keyAdjust the table and move keys, when you add or remove keys
  • Note that whenever the “Base Configuration” of a ring is changed, data has to be moved.
  • However, the amount of data that is moved is kept to a minimum


  • Partitioning of the key-store is meant for keeping the load on the system distributed.
  • Replication is meant for dealing with short-term failures of hardware and software
  • Each item that is stored in Miis also stored with N of its ring successors.
    • The nodes are called standby nodes.
    • N is called the Replication Factor

Replication Quorums

  • In order to control write-latency,
    • Data is written to the home node – Mi
    • The system responds to the user, that it has stored the item.
    • Replication is done offline, in a queued manner
  • If a data-node goes down between when it wrote to the Transaction Logs, and before replicating
    • We assume that data recovery procedures will take care of restoring the data
    • In the context of “Eventual Consistency”, this can be considered a “Delayed Update”.
  • Under normal operations (under normal assumptions of serializability),
    • Replication Factor is N – The number of nodes each item needs to write to
    • Read Quorum is R – The number of nodes from the Standby, that the Read must happen
    • Write Quorum is W – The minimum number of nodes from the StandBy to which data must be written
  • For all the writes to have been made to at-least one copy,
    • R + W > N
    • Usual numbers are R = W = 2, and N = 3
    • Otherwise transient communication failures may cause data inconsistencies

Hinted Handoff

  • If the write cannot be made to W of the original node copies,
    • We write the data to W – w other copies chosen at random, with a hint.
    • These copies become the actual representatives in the quorum, till the original nodes come back ??

Vector Clocks – Multi-Version and Resolution

Because of arbitrarily delayed writes, versions have to be reconciled. Reconciling versions is done using Vector Clocks.

  • Every Virtual Node maintains its own version of its Timestamp.  Every copy of an item has a history of timestamps it passed through
    • For example
Node X writes item                  D1([Sx,1])Node X is modifies item             D2([Sx,2])Node X down, Node Y modifies item   D3([Sx,2],[Sy,1])

Node X,Y down, Node Z modifies item D4([Sx,2],[Sz,1])

  • Reconciliation is done during a “Read” when R copies are being read.
    • For the above example
Node X comes back up. Gets versions. Writes reconciled versionD5([Sx,3],[Sz,1],[Sy,1])

Merkle Trees

In addition to reconciling on Reads, reconciliation may also happen behind the scenes – to make sure the replicas do not go too far out of sync.

  • A Merkle tree is a tree where
    • Each leaf stores the range of values associated with the node.
    • The intermediate nodes contain the hash value of the values of its children
  • Comparing Merkle trees for two nodes from the root downwards allows you to identify nodes that are different in both the versions, and hence is a fast way to pro-actively exchange values, and become consistent.

Internationalization in a Internet Environment


Document Encodings

HTTP Responses

  • HTTP responses usually contain a “Content-Type: xxxx; charset=yyy”
    • Based on the charset, the encoding can be inferred
  • This requires the WebServer to either
    • know the encoding up-front,
    • or to understand it by reading a bit of the document
  • A XML document always begins with <?xml encoding=”…”>
  • A HTML document would have this as part of its meta-tag itself

In the Document itself

<meta http-equiv=“Content-Type” content=“text/html; charset=utf-8”>

Browser Inferring the Encoding

  • For many encodings, the browser tries to infer the encoding from the distribution of characters
  • This applies for variants of the Code-page encodings
    • Each language gets its own set of mappings, that have their own distributions in typical documents
    • If the browser did not get it right, we just change the encoding manually on the browser, and read the document.

URL Encodings

  • The URL encodings are relevant in two places
    • As the URL in the HTTP request ( Both GET and POST )
    • For posting the contents of the form
  • The encoding for URLs is
    • Convert to UTF-8 first
    • Then, replace all reserved characters with their %-escaped sequences
    • Other sequences may also be %-escaped.
  • During form-submit, the payload could be www-form-url encoded.
    • This also follows the URL encoding rules for the most part.

UTF Encodings

UTF encodings have the following interesting features that make them very good encodings

  1. The beginning of a character has a zero in the first bit or  11 in the first two bits
    • This makes it easy to synchronize the bytes
    • The number of bytes occupied is specified by the number of Contiguous 1s in the first byte.
  2. This makes it easy to skip over this character and move to the next
    • Also, it clearly shows what kind of UTF encoding is being used (JSON)

Generalized Linear Regression

Generalized Linear Regression

Document in MS-Word format glm_regression

This is a long post on an Intermediate level topic of interest to people working in Big-Data.

These are my notes, as I struggled to understand the topic from the available references. Unfortunately, the references, all contained the same wordings and the same areas of focus. They were deficient in some crucial areas – missing links

  1. To explain the Link function as the “Maximum Likelihood function” of the Original distribution
  2. To explain how Maximum Likelihood applied to regression – That the objective was to fit a probability distribution function, that maximized the probability that the observed independent variables would give as output the observed dependant variables
  1. That only a single pdf was being fit, even though the observations were in N-dimensions


Scalable Web-2.0 Architectures

Scalable Web-2.0 Architectures
These are pre-social Architectures

Web Architecture Components

  • Scalable External Cache
  • Scalable WebServer
  • Scalable Internal Cache
  • Scalable Application Servers
  • Scalable Database Servers
  • Scalable Miscellaneous Servers

Scalable External Cache


  • Caches outbound HTTP objects
    • Images, CSS, XML, etc
  • Defense against Denial of Service

How useful

  • A lot of traffic does not reach the WebServer.
  • Since the path to the cache from the WebServer is a high bandwidth path,
    • The proxy quickly gets the file, and frees the WebServer connection,
    • even though client may be slow

Typical Caches

  • Squid
  • Apache mod_proxy
  • A commercial HTTP Accelerator

System Configuration

  • Fast Network
  • Lots of memory
  • Moderate disk space
  • Low CPU

Scalable Web-Server


  1. Terminates HTTP or SSL connections
  2. Serves static content from disk
  3. Serves dynamic content
    • php, mod_perl, python, etc
  4. Dispatches request to Application Server

Typical Web Servers

  • Apache ( The “A” of LAMP )
  • thttp ( Trivial http – Very fast HTTP server )
  • IIS
  • Tux WebServer, Oracle WebServer

Hardware Requirement

  • Lots and lots of memory
    • Memory is the primary bottleneck in WebServers
    • Memory determines maximum number of users
  • Fast Network
  • CPU depends on Storage
    • Dynamic Content Serving requires high CPU
    • Static Content Serving requires only a slow CPU
  • Slow disk is adequate
    • Used to serve static content and code

Application Servers


  • If SSL terminated, requires SSL accelerator card.
  • Maintains Application view of data
  • Runs internal services
    • Search
    • Shopping Cart
    • Credit Card Processing


  • Receives RPC (SOAP or REST) calls from the Web layer
  • Communicates with Database layer and other services, and returns a response.

Typical App Servers

  • BEA Weblogic
  • IBM WebSphere

Hardware Requirements

  • Extremely memory hungry
    • 64-bit systems required
  • Lots of CPU – Multi-CPU systems required

Database Layer


  • Data Storage and Retrieval
  • Data Aggregation and Computing
  • Sorting
  • Filtering
  • Transaction (ACID properties)
  • Data Archival, Data Warehousing

Hardware Requirements

  • Depends on the application. Likely to be expensive
  • Lots of memory
  • Lots of disks – Spindles galore
    • RAID
  • CPU important for Data Computation
  • Redundancy – Dual Power Supply, Dual Network

Typical Databases

  • Oracle
  • MS-SQL
  • IBM-DB2
  • Postgres
  • MySQL

Internal Cache Tier


  • Object Cache
  • Caches of Applications and Web layers
  • Tuned for applications
  • Scales horizontally

Typical Examples

  • memcache
  • redis
  • LRU tables

Hardware Configuration

  • Lots of memory
  • Some hard-disk
  • Low CPU

Miscellaneous Services


  1. Mail
  2. DNS
  3. Time Synchronization
  4. System Monitoring
  5. Fault Notification

The Glue


  1. Load-Balancers
  2. Switches
  3. Firewalls
  4. Routers

Routers and Switches

  • Expensive
  • Complex
  • Crucial piece of the system

Performance Hints

  • Use GigE for performance
    • Use Jumbo Ethernet frames (1500 – 9000 bytes)
  • Use Vlans to handle complexity
  • Use Trunking, Bonding and Aggregation (LACP) for failover

Load Balancers

  • Balances the load on
    • HTTP Caches
    • HTTP Servers
    • App Servers
    • DB Servers
  • Need not use on items that auto-balance within themselves
    • DNS
    • LDAP
    • Memcache

Message Buses – Examples

  • JMS
  • MQ-Series
  • Tibco-Rendevous

Message Buses – Functionality

  • Robustness between different parts of the Distributed System
  • Guaranteed Delivery
  • Handles Unicast and Broadcast

Database Normal Forms

First Normal Form

  • A Relational Database is a table of Atomic Values
  • Every record contains the same number of fields
  • Lists must be flattened out
  • The First Normal Form is responsible for causing various Redundancies, that other Normal Forms try to remove

Second Normal Form

  • Every non Primary-key column must fully depend on all the Primary-Key columns
  • If a Value column Vj depends on keys <K1, K2, …, Kn>,
    • then a separate table of < K1, K2, …, Kn, Vi >  must be created

Third Normal Form

  • There must be no dependencies among non Primary-Key fields
  • If a Value column Vi depends on column Vj, then
    • A separate table Vj must be created

Boyce-Codd Normal Form

  • There must be no dependencies from non-keys to any portion of the keys
  • That is, there must be no dependencies from the non-keys
  • Cannot decompose a 3NF to a BCNF, while preserving dependencies

Fourth Normal Form

  • Two independent multi-valued attributes must not be present in the same table
  • If keys <K1, K2, …, Kn > have two multi-valued columns Vi and Vj
    • Then separate tables – < K1, K2, …, Kn, Vi >  and < K1, K2, …, Kn, Vj > must be created

Fifth Normal Form

  • This applies to decomposition of a table with only keys
  • If there are multivalued relationships, among the candidate keys, then they may be decomposed into Fifth Normal Form.

Tag Cloud