How Often Does Flume Upload to S3

FlumeApache Flume Logo is a distributed system to aggregate log files into the Hadoop Distributed File System (HDFS). It has a uncomplicated design of Events, Sources, Sinks, and Channels which tin can be connected into a complex multi-hop architecture.

While Flume is designed to exist resilient "with tunable reliability mechanisms for fail-over and recovery" in this web log mail we'll also look at the reliable forwarding of rsyslog, which nosotros are going to use to store postfix logs in Amazon S3.

A bones design of aggregating postifx log files could look similar the post-obit diagram:

Flume Basic DesignIn this scenario our Mail (Web) Server emits events to our Flume Agent (Source) which writes this events to a Channel. Our aqueduct stacks this events and the Sink then picks them up and finally writes the events to S3 (HDFS). And so for this scenario nosotros would only need to define a Flume Source, Sink, and Channel to collect our postfix logs with a Sink writing to S3.

Setting Up Flume

Setting up Flume to use Elementary Storage Service (S3) is quite unproblematic as we tin can use the HDFS Sink of Flume and Hadoop'southward capability to "natively" write to S3. Prior to using this Hadoop needs to be setup and installed on your system. For setting upwardly Hadoop please read here or hither. Delight exist aware that we don't demand any of the daemons running, we just need the libraries and the configuration!!

To configure S3 as the default file system for Hadoop we can change the configuration in mapred-site.xml appropriately:

<?xml version="i.0"?> <?xml-stylesheet blazon="text/xsl" href="configuration.xsl"?> <configuration> <property>   <proper name>fs.default.name</name>   <value>s3n://AWS_KEY_ID:AWS_SECRET_ACCESS_KEY@BUCKET_NAME</value> </property> </configuration>        

This should exist enough to make Hadoop use S3 as it's underlaying file storage. If setup correctly yous should also exist able to applyhadoop fs  to scan your S3 bucket

hadoop fs -ls

You can likewise always override the default file system at command prompt using-fs  flag to try out different settings. If you don't succeed in configuring Hadoop to use S3 try to brand information technology work this way and so change the configuration:

hadoop fs -fs s3n://AWS_KEY_ID:AWS_SECRET_ACCESS_KEY@BUCKET_NAME -ls /

 S3 Postfix Flume Agent

Adjacent nosotros are going to configure a Flume amanuensis that we can utilise to collect postfix logs. Our amanuensis volition consists of a Source collecting syslogs events listening on a sure port, a Channel the Source can write to and a Sink picking up logs from the aqueduct and writing them into S3 (HDFS).

We'll run a single amanuensis and the channel is going to be a memory channel which both are rather not very reliable designs. Every bit it is important to build upward a arrangement to be robust from the ground upward in our scenario nosotros'll heavily rely on a reliable setup of rsyslog. Skip ahead if y'all desire to read more about that.

Our source is going to heed for syslog events on port 5140:

postfix.sources.syslog_tcp.type = multiport_syslogtcp # tcp for reliable connectedness postfix.sources.syslog_tcp.host = 0.0.0.0 postfix.sources.syslog_tcp.ports = 5140        

Our sink is going to use HDFS and write into the /postfix folder. Nosotros'll store all the logs in a folder hierarchy containing the hostname (IP) and day (Y-thousand-d). All files volition also have a mail_log prefix. To actually employ the hostname and day of the the upshot Flume Interceptors are used, which populate the event header at the source. Our sink:

postfix.sinks.s3.type = hdfs postfix.sinks.s3.hdfs.path = /postfix/%{hostname}/%y-%m-%d postfix.sinks.s3.hdfs.filePrefix = mail_log        

Flume Intercetpor to populate event headers with host and twenty-four hour period:

postfix.sources.syslog_tcp.interceptors = i1 i2 postfix.sources.syslog_tcp.interceptors.i1.type = timestamp postfix.sources.syslog_tcp.interceptors.i2.type = host postfix.sources.syslog_tcp.interceptors.i2.hostHeader = hostname

The interceptor is function of our source and helps to write the event headers we can use when storing the event to the file system.

As already mentioned the aqueduct is going to be a simple memory channel which is designed for high throughput and niggling resilience. All in all our Flume agent looks similar this:

postfix.sources = syslog_tcp postfix.sinks = s3 postfix.channels = mem_channel  postfix.sources.syslog_tcp.blazon = multiport_syslogtcp postfix.sources.syslog_tcp.host = 0.0.0.0 postfix.sources.syslog_tcp.ports = 5140 #postfix.sources.syslog_tcp.interceptors = i1 i2 #postfix.sources.syslog_tcp.interceptors.i1.type = timestamp #postfix.sources.syslog_tcp.interceptors.i2.blazon = host #postfix.sources.syslog_tcp.interceptors.i2.hostHeader = hostname  postfix.sinks.s3.type = hdfs postfix.sinks.s3.hdfs.path = /postfix/%{hostname}/%y-%k-%d postfix.sinks.s3.hdfs.filePrefix = mail_log  postfix.channels.mem_channel.type = memory postfix.channels.mem_channel.chapters = m postfix.channels.mem_channel.transactionCapacity = 100  postfix.sources.syslog_tcp.channels = mem_channel postfix.sinks.s3.channel = mem_channel

We tin now run this agent and examination it by connecting to it with telenet:

flume-ng agent -c flume_example -f flume_example/flume_postfix.conf --name postfix -Dflume.root.logger=INFO,console &

To test if everything is working correctly we can connect to the running Flume amanuensis using telnet and write "events" to it. If working correctly files containing syslog events serialized as Sequence Files should appear in the configures saucepan.

echo "Flume Example Postfix" | telnet localhost 5140

Setup rsyslog

Nosotros now have our Flume agent running collecting syslog events that are directed to port 5140. Configuring rsyslog to send postfix messages to that port is a simple line in the configuration. Just we also would desire rsyslog to do this in a reliable way. If your Flume agent dies we would want rsyslog to spill the events to a temporary output and equally soon as the agent is running once again to transport this queued letters. This tin exist accomplished past using a so chosen ActionQueue. This is going to exist our cornerstone of reliability. For an in-depth reading of rsyslog ActionQueues please read here.

$ActionQueueType LinkedList   		# employ asynchronous processing $ActionQueueFileName flume_postfix  	# set file name, also enables disk style $ActionResumeRetryCount -1    		# infinite retries on insert failure $ActionQueueSaveOnShutdown on 		# relieve in-memory data if flume shuts down mail service.*       @@localhost:5140           # filter message just ship mail/postfix letters        

To configure rsyslog you create a *.conf file nether /etc/rsyslog.d . Please check if$IncludeConfig /etc/rsyslog.d/*.conf  is present in /etc/rsyslog.conf  to be certain your config file is going to be read. In my instance I simply used /etc/rsyslog.d/50-default.conf where the filters for mail.* were already present.

Yous tin can see in the configuration I use that through rsyslog ActionQueues messages are stacked if failed to send to port 5140 on localhost. You should note that lines starting with the word "Action" modify the next action and should be specified in front of it (/usr/share/dr./rsyslog-doctor/html/rsyslog_conf_global.html).  In our case we change themail.* @@localhost:5140  action to utilise a LinkedList.

rsyslog will notice if the Flume agent is downwardly, as we are using TCP. This would not exist possible if using UDP.

Further Readings

  • Flume User Guide
  • Reliable Forwarding of syslog Messages with Rsyslog
  • Hadoop: The Definitive Guide (Amazon)
  • Hadoop Wiki: AmazonS3
  • S3 sink on flumeNG (Jira Ticket)
  • How To Refine and Visualize Sentiment Data (Hortonworks)

brownjoiny1992.blogspot.com

Source: https://henning.kropponline.de/2014/06/01/reliable-postfix-logs-in-s3-w-flume-rsyslog/

Belum ada Komentar untuk "How Often Does Flume Upload to S3"

Posting Komentar

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel