Getting Your Data into HDFS and HBase in Real Time Using Apache Flume

Getting real time production data into NoSQL systems like HDFS and HBase can be challenging, especially because these systems are designed to scale out, so the system that gets the data into these systems also needs to scale out, handle intermittent network or system downtimes, be tolerant to load spikes, and be able to run on readily available hardware. Flume was designed ground up with this in mind and is designed to scale out and tolerate downtime without losing data or blocking the systems generating the data. Flume is highly configurable, and easy to deploy and maintain. Flume is highly pluggable, and users can easily deploy their own plugins to modify the behavior of Flume agents and the way Flume writes the data out to HDFS and HBase. Flume configuration is highly declarative and is done using a simple properties file format. In this talk, we will discuss:

the basics of Flume
the design and implementation of the framework and the various built-in components
how flume guarantees event delivery and the various durability guarantees of Flume's data stores.
writing custom components and plugins for Flume and deploying them to Flume agents.
capacity planning, configuring, deploying and monitoring Flume clusters.
various sample deployments with use-cases.

Hari is Software Engineer at Cloudera, working on Apache Flume.He is also a committer and PMC member on the Apache Flume project. Previously, Hari worked with Yahoo! Mail Metadata Indexing team. Hari holds a Masters in Computer Science from Cornell University and a Bachelors in Information Technology from National Institute of Technology, Jaipur, India.