Distributed Apache Flume Setup With an HDFS Sink
I have 3 kinds of servers all running Ubuntu 10.04 locally:
hadoop-agent-1: This is the agent which is producing all the logs
hadoop-collector-1: This is the collector which is aggregating all the logs (from hadoop-agent-1, agent-2, agent-3, etc)
hadoop-master-1: This is the flume master node which is sending out all the commands
To add the CDH3 repository:
Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents:
deb http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib deb-src http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib
where:
is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above.
(To install a different version of CDH on a Debian system, specify the version number you want in the -cdh3 section of the deb command. For example, to install CDH3 Update 0 for Ubuntu Maverick, use maverick-cdh3u0 in the command above.)
(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
This key enables you to verify that you are downloading genuine packages
Initial Setup
On both hadoop-agent-1 and hadoop-collector-1, you’ll have to install flume-node (flume-node contains the files necessary to run the agent or the collector).
sudo apt-get update sudo apt-get install flume-node
On hadoop-master-1:
sudo apt-get update sudo apt-get install flume-master
First let’s jump onto the agent and set that up. Tune the hadoop-master-1 and hadoop-collector-1 variables appropriately, but change your /etc/flume/conf/flume-site.xml to look like:
<configuration> <property> <name>flume.master.servers</name> <value>hadoop-master-1</value> <description>This is the address for the config servers status server (http)</description> </property> <property> <name>flume.collector.event.host</name> <value>hadoop-collector-1</value> <description>This is the host name of the default 'remote' collector.</description> </property> <property> <name>flume.collector.port</name> <value>35853</value> <description>This default tcp port that the collector listens to in order to receive events it is collecting.</description> </property> <property> <name>flume.agent.logdir</name> <value>/tmp/flume-${user.name}/agent</value> <description> This is the directory that write-ahead logging data or disk-failover data is collected from applications gets written to. The agent watches this directory. </description> </property> </configuration>
Now on to the collector. Same file, different config.
<configuration> <property> <name>flume.master.servers</name> <value>hadoop-master-1</value> <description>This is the address for the config servers status server (http)</description> </property> <property> <name>flume.collector.event.host</name> <value>hadoop-collector-1</value> <description>This is the host name of the default 'remote' collector.</description> </property> <property> <name>flume.collector.port</name> <value>35853</value> <description>This default tcp port that the collector listens to in order to receive events it is collecting.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://hadoop-master-1:8020</value> </property> <property> <name>flume.agent.logdir</name> <value>/tmp/flume-${user.name}/agent</value> <description> This is the directory that write-ahead logging data or disk-failover data is collected from applications gets written to. The agent watches this directory. </description> </property> <property> <name>flume.collector.dfs.dir</name> <value>file:///tmp/flume-${user.name}/collected</value> <description>This is a dfs directory that is the the final resting place for logs to be stored in. This defaults to a local dir in /tmp but can be hadoop URI path that such as hdfs://namenode/path/ </description> </property> <property> <name>flume.collector.dfs.compress.gzip</name> <value>true</value> <description>Writes compressed output in gzip format to dfs. value is boolean type, i.e. true/false</description> </property> <property> <name>flume.collector.roll.millis</name> <value>60000</value> <description>The time (in milliseconds) between when hdfs files are closed and a new file is opened (rolled). </description> </property> </configuration>
Web Based Setup
I chose to do the individual machine setup via the master web interface. You can get to this pointing your web browser at http://hadoop-master-1:35871/ (replace hadoop-master-1 with public/private DNS IP of your flume master or setup /etc/hosts for a hostname). Ensure that the port is accessible from the outside through your security settings. At this point, it was easiest for me to ensure all hosts running flume could talk to all ports on all other hosts running flume. You can certainly lock this down to the individual ports for security once everything is up and running.
At this point, you should go to hadoop-agent-1 and hadoop-collector-1 run /etc/init.d/flume-node start. If everything goes well, then the master (whose IP is specified in their configs) should be notified of their existence. Now you can configure them from the web. Click on the config link and then fill in the text lines as follows (use what is in bold):
Agent Node: hadoop-agent-1
Source: tailDir(“/var/logs/apache2/”,”.*.log”)
Sink: agentBESink(“hadoop-collector-1?,35853)
Note: I chose to use tailDir since I will control rotating the logs on my own. I am also using agentBESink because I am ok with losing log lines if the case arises.
Now click Submit Query and go back to the config page to setup the collector:
Agent Node: hadoop-collector-1
Source: collectorSource(35853)
Sink: collectorSink(“hdfs://hadoop-master-1:8020/flume/logs/%Y/%m/%d/%H00?,”server”)
This is going to tell the collector that we are sinking to HDFS with the with an initial folder of ‘flume’. It will then log to sub-folders with “flume/logs/YYYY/MM/DD/HH00? (or 2011/02/03/1300/server-.log). Now click Submit Query and go to the ‘master’ page and you should see 2 commands listed as “SUCCEEDED” in the command history. If they have not succeeded, ensure a few things have been done (there are probably more, but this is a handy start:
Always use double quotes (“) since single quotes (‘) aren’t interpreted correctly. UPDATE: Single quotes are interpreted correctly, they are just not accepted intentionally (Thanks jmhsieh)
In your regex, use something like “.*\\.log” since the ‘.’ is part of the regex.
In your regex, ensure that your blackslashes are properly escaped: “foo\\bar” is the correct version of trying to match “foo\bar”.
Additionally, there are also tables of Node Status and Node Configuration. These should match up with what you think you configured.
At this point everything should work. Admittedly I had a lot of trouble getting to this point. But with the help of the Cloudera folks and the users on irc.freenode.net in #flume, I was able to get things going. The logs sadly aren’t too helpful here in most cases (but look anyway cause they might provide you with more info than they provided for me). If I missed anything in this post or there is something else I am unaware of, then let me know.
Reference: Distributed Apache Flume Setup With an HDFS Sink from our JCG partner Evan Conkle at the Evan Conkle’s blog blog.