Log4j, Stat4j, SMTPAppender Integration – Aggregating Error Logs to Send Email When Too Many
In this post I want to present a simple solution we have implemented using a custom Log4J Appender based on Stats4j and an SMTPAppender (which is more difficult to configure and troubleshoot than you might expect).
The Challenge
We faced the following challenges with the logs:
- It’s unfortunately normal to have certain number of exceptions (customers select search criteria yielding no results, temporary, unimportant outages of external services etc.) and we certainly don’t want to be spammed because of that. So the solution must have a configurable threshold and only send an alert when it is exceeded.
- The failure rate should be computed for a configurable period (long enough not to trigger an alert because of few-minutes outages yet short enough for the team to be informed ASAP when something serious happens).
- Once an alert is send, no further alerts should be send again for some time (ideally until the original problem is fixed), we don’t want to be spammed because of a problem we already know about.
The Solution
We’ve based our solution on Lara D’Abreo’s Stat4J, which provides a custom Log4J appender that uses the logs to compute configurable measures and triggers alerts when they exceed their warning or critical thresholds. It is couple of years old, alpha-quality (regarding its generality and flexibility) open-source library, which is fortunately simple enough to be modified easily for one’s needs.
So we have tweaked Stat4J to produce alerts when the number of alerts exceeds thresholds and keep quiet thereafter and combined that with a Log4J SMTPAppender that listens for the alerts and sends them via e-mail to the team.
Stat4J Tweaking
The key components of Stat4J are the Stat4jAppender for Log4J itself, calculators (measures) that aggregate the individual logs (e.g. by counting them or extracting some number form them), statistics that define which logs to consider via regular expressions and how to process them by referencing a calculator, and finally alerts that log a warning when the value of a statistics exceeds its limits. You can learn more in an article that introduces Stat4J.
We have implemented a custom measure calculator, RunningRate (to count the number of failures in the last N minutes) and modified Stat4J as follows:
- We’ve enhanced Alert to support a new attribute, quietperiod, so that once triggered, subsequent alerts will be ignored for that duration (unless the previous alert was just a warning while the new one is a critical one)
- We’ve modified the appender to include the log’s Throwable together with the log message, which is then passed to the individual statistics calculators, so that we could filter more precisely what we want to count
- Finally we’ve modified Alert to log alerts as errors instead of warnings so that the SMTPAppender wouldn’t ignore them
Get our modified Stat4j from GitHub (sources or a compiled jar). Disclaimer: It is one day’s hack and I’m not proud of the code.
Stat4J Configuration
Take the example stat4j.properties and put it on the classpath. It is already configured with the correct calculator, statistics, and alert. See this part:
### JAKUB HOLY - MY CONFIG calculator.minuteRate.classname=net.sourceforge.stat4j.calculators.RunningRate # Period is in [ms] 1000 * 60 * 10 = 10 min: calculator.minuteRate.period=600000 statistic.RunningErrorRate.description=Errors per 10 minutes statistic.RunningErrorRate.calculator=minuteRate # Regular expression to match "<throwable.toString> <- <original log message>" statistic.RunningErrorRate.first.match=.*Exception.* # Error Rate alert.TooManyErrorsRecently.description=Too many errors in the log alert.TooManyErrorsRecently.statistic=RunningErrorRate alert.TooManyErrorsRecently.warn= >=3 alert.TooManyErrorsRecently.critical= >=10 alert.TooManyErrorsRecently.category=alerts # Ignore following warnings (or criticals, after the first critical) for the given amount of time: # 1000 * 60 * 100 = 100 min alert.TooManyErrorsRecently.quietperiod=6000000
The important config params are
- calculator.minuteRate.period (in ms) – count errors over this period, reset the count at its end; a reasonable value may be 10 minutes
- alert.TooManyErrorsRecently.warn and alert.TooManyErrorsRecently.critical – trigger the alert when so many errors in the period has been encountered; reasonable values depend on your application’s normal error rate
- alert.TooManyErrorsRecently.quietperiod (in ms) – don’t send further alerts for this period not to spam in a persistent failure situation; the reasonable value depends on how quickly you usually fix problems, 1 hour would seem OK to me
Log4J Configuration
Now we need to tell Log4J to use the Stat4j appender to count error occurences and to send alerts via email:
log4j.rootCategory=DEBUG, Console, FileAppender, Stat4jAppender ... ### Stat4jAppender & EmailAlertsAppender ### # Collects statistics about logs and sends alerts when there # were too many failures in cooperation with the EmailAlertsAppender ## Stat4jAppender log4j.appender.Stat4jAppender=net.sourceforge.stat4j.log4j.Stat4jAppender log4j.appender.Stat4jAppender.Threshold=ERROR # For configuration see stat4j.properties ## EmailAlertsAppender # BEWARE: SMTPAppender ignores its Thresholds and only evers sends ERROR or higher messages log4j.category.alerts=ERROR, EmailAlertsAppender log4j.appender.EmailAlertsAppender=org.apache.log4j.net.SMTPAppender log4j.appender.EmailAlertsAppender.To=dummy@example.com # BEWARE: The address below must have a valid domain or some receivers will reject it (e.g. GMail) log4j.appender.EmailAlertsAppender.From=noreply-stat4j@google.no log4j.appender.EmailAlertsAppender.SMTPHost=172.20.20.70 log4j.appender.EmailAlertsAppender.BufferSize=1 log4j.appender.EmailAlertsAppender.Subject=[Stat4j] Too many exceptions in log log4j.appender.EmailAlertsAppender.layout=org.apache.log4j.PatternLayout log4j.appender.EmailAlertsAppender.layout.ConversionPattern=%d{ISO8601} %-5p %X{clientIdentifier} %c %x - %m%n
Comments
- #8 Specify the Stat4J appender
- #9 Only send ERRORs to Stat4J, we are not interested in less serious exceptions
- #14 “alerts” is the log category used by Stat4jAppender to log alerts (the same you would create via Logger.getLogger(“alerts”)); as mentioned, SMTPAppender will without respect to the configuration only process ERRORs and higher
Issues with the SMTPAppender
It is quite tricky to get the SMTPAppender working. Some pitfalls:
- SMTPAppender ignores all logs that are not ERROR or higher without respect to how you set its threshold
- If you specify a non-existing From domain then some recipient’s mail servers can just delete the email as spam (e.g. GMail)
- To send emails, you of course need mail.jar (and for older JVMs also activation.jar), here are instructions for Tomcat
And one $100 tip: to debug it, run your application in the debug mode and set a method breakpoint on javax.mail.Transport#send (you don’t need the source code) and when there, set this.session.debug to true to get a very detailed log of the following SMTP communication in the server log.
Sidenote
The fact that this article is based on Log4J doesn’t mean I’d personally choose it, it just came with the project. I’d at least consider using the newer and shiny Logback instead :-) .
Conclusion
Stat4j + SMTPAppender are a very good base for a rather flexible do-it-yourself alerting system based on logs and e-mail. You can achieve the same thing out-out-the-box with Hyperic HQ plus.
Reference: Aggregating Error Logs to Send a Warning Email When Too Many of Them – Log4j, Stat4j, SMTPAppender from our JCG partner Jakub Holý at “The Holy Java” Blog.