What is Zookeeper
Zookeeper is a “Name Server” in the Hadoop suite of products, with the following characteristics.
- Names form a hierarchical name-space
- It has functionality create, read, update and delete names
- It has functionality to send updates to registered listeners on different machines in the same order in which it received them.
The last two features enable it to be used for co-ordination and synchronization.
Simple Use Case
External Monitoring Service
Suppose we wanted to use an external monitoring service like Munin
- The external program will register a “Zookeeper Watch” to be informed whenever there is a change in the tree location.
- The existing services, such as apache may register a node.
For example, the node below represents an Apache server at www32.mydomain.com at port 80.
- The munin system can periodically get a dump of all services under /services/www . and load them into its special file – munin.conf
- Once this is done, the particular WebServer can be monitored by the Monitoring System.
Why not use a Database
Zookeeper is a superior interface to the database, because of the guarantees made
- The watch is ordered with respect to other events, other watches, and other asynchronous replies. The events are all propagated by the client library in the right order.
- The client will see the node creation event before it sees the value for the node
- The order in which events are seen by the client is the same as the order in which these are being seen by the Zookeeper service.
Usage in Hadoop
Managing Configuration Changes
- When there are hundreds and thousands of nodes in a cluster, it becomes difficult to push configuration changes to the machines.
- Zookeeper enables the configuration changes to be pushed.
Implementing Reliable Messaging
- With Zookeeper, we can implement reliable producer-consumer queues
- even if a few consumers and some Zookeeper servers fail.
Implement Redundant Services
- Several identical nodes may provide a service.
- One of these may elect itself as the leader (using a leader election algorithm), and may start providing the service.
Synchronize Process Execution
- Multiple nodes can coordinate the start and end of a process or calculation.
- This ensures that any follow-up processing is done only after all nodes have finished their calculations.
Usage in a Data-Center
Complex Ad-Serving environment
Zookeeper is also useful in a complex Data-Center environment
- Let us consider the case of a Complex Ad-Serving system. It consists of several components
- Database for Campaign data and Fiscal transactions
- Ad-serving engines for serving the best Advertisements for the customers
- Campaign planners for advertisers to run campaigns and simulations
- Log collection engines for Data Warehousing, and data planning.
- Data analytics and modeling systems
- Fraud detection systems
- Beacons and fault management systems
- Failover servers
One of the most important uses of Zookeeper in these cases is as a “Bootstrap Server”.
- It contains the way to contact the “Services”, when all of the services are not running
- It can store the primary and secondary configurations.
Distributed Service Locator
The Distributed Service Locator allows a way for services to access other services
- Services may come up, and use “Leader Election” to decide the configurations.
- They can store their status, which can then be queried.
Distributed System State
Zookeeper is usually used to maintain top-level system states, so that
- An upto-date directory of which machine is running which service may be maintained
- This directory may be used by the Monitoring Software to decide which machines should be monitored and how.
To make configuration changes, and push them to the servers that use them,
- Zookeeper can pro-actively push the configuration, if a new configuration is created
- It can also be used to push software onto each of the clusters.