Getting Your Data into HDFS and HBase in Real Time Using Apache Flume
Share this Session:
  Hari Shreedharan   Hari Shreedharan
Software Engineer
Cloudera
 


 

Thursday, August 22, 2013
03:00 PM - 03:45 PM

Level:  Technical - Intermediate


Getting real time production data into NoSQL systems like HDFS and HBase can be challenging, especially because these systems are designed to scale out, so the system that gets the data into these systems also needs to scale out, handle intermittent network or system downtimes, be tolerant to load spikes, and be able to run on readily available hardware. Flume was designed ground up with this in mind and is designed to scale out and tolerate downtime without losing data or blocking the systems generating the data. Flume is highly configurable, and easy to deploy and maintain. Flume is highly pluggable, and users can easily deploy their own plugins to modify the behavior of Flume agents and the way Flume writes the data out to HDFS and HBase. Flume configuration is highly declarative and is done using a simple properties file format. In this talk, we will discuss:
  • the basics of Flume
  • the design and implementation of the framework and the various built-in components
  • how flume guarantees event delivery and the various durability guarantees of Flume's data stores.
  • writing custom components and plugins for Flume and deploying them to Flume agents.
  • capacity planning, configuring, deploying and monitoring Flume clusters.
  • various sample deployments with use-cases.


Hari is Software Engineer at Cloudera, working on Apache Flume.He is also a committer and PMC member on the Apache Flume project. Previously, Hari worked with Yahoo! Mail Metadata Indexing team. Hari holds a Masters in Computer Science from Cornell University and a Bachelors in Information Technology from National Institute of Technology, Jaipur, India.


   
Close Window