Enterprise Java

Log4j, Stat4j, SMTPAppender Integration – Aggregating Error Logs to Send Email When Too Many

Our development team wanted to get notified as soon as something goes wrong in our production system, a critical Java web application serving thousands of customers daily. The idea was to let it send us an email when there are too many errors, indicating usually a problem with a database, an external web service, or something really bad with the application itself.

In this post I want to present a simple solution we have implemented using a custom Log4J Appender based on Stats4j and an SMTPAppender (which is more difficult to configure and troubleshoot than you might expect).

The Challenge

We faced the following challenges with the logs:

  • It’s unfortunately normal to have certain number of exceptions (customers select search criteria yielding no results, temporary, unimportant outages of external services etc.) and we certainly don’t want to be spammed because of that. So the solution must have a configurable threshold and only send an alert when it is exceeded.
  • The failure rate should be computed for a configurable period (long enough not to trigger an alert because of few-minutes outages yet short enough for the team to be informed ASAP when something serious happens).
  • Once an alert is send, no further alerts should be send again for some time (ideally until the original problem is fixed), we don’t want to be spammed because of a problem we already know about.

The Solution

We’ve based our solution on Lara D’Abreo’s Stat4J, which provides a custom Log4J appender that uses the logs to compute configurable measures and triggers alerts when they exceed their warning or critical thresholds. It is couple of years old, alpha-quality (regarding its generality and flexibility) open-source library, which is fortunately simple enough to be modified easily for one’s needs.

So we have tweaked Stat4J to produce alerts when the number of alerts exceeds thresholds and keep quiet thereafter and combined that with a Log4J SMTPAppender that listens for the alerts and sends them via e-mail to the team.

Stat4J Tweaking

The key components of Stat4J are the Stat4jAppender for Log4J itself, calculators (measures) that aggregate the individual logs (e.g. by counting them or extracting some number form them), statistics that define which logs to consider via regular expressions and how to process them by referencing a calculator, and finally alerts that log a warning when the value of a statistics exceeds its limits. You can learn more in an article that introduces Stat4J.

We have implemented a custom measure calculator, RunningRate (to count the number of failures in the last N minutes) and modified Stat4J as follows:

  • We’ve enhanced Alert to support a new attribute, quietperiod, so that once triggered, subsequent alerts will be ignored for that duration (unless the previous alert was just a warning while the new one is a critical one)
  • We’ve modified the appender to include the log’s Throwable together with the log message, which is then passed to the individual statistics calculators, so that we could filter more precisely what we want to count
  • Finally we’ve modified Alert to log alerts as errors instead of warnings so that the SMTPAppender wouldn’t ignore them

Get our modified Stat4j from GitHub (sources or a compiled jar). Disclaimer: It is one day’s hack and I’m not proud of the code.

Stat4J Configuration

Take the example stat4j.properties and put it on the classpath. It is already configured with the correct calculator, statistics, and alert. See this part:

### JAKUB HOLY - MY CONFIG
calculator.minuteRate.classname=net.sourceforge.stat4j.calculators.RunningRate
# Period is in [ms] 1000 * 60 * 10 = 10 min:
calculator.minuteRate.period=600000

statistic.RunningErrorRate.description=Errors per 10 minutes
statistic.RunningErrorRate.calculator=minuteRate
# Regular expression to match "<throwable.toString> <- <original log message>"
statistic.RunningErrorRate.first.match=.*Exception.*

# Error Rate
alert.TooManyErrorsRecently.description=Too many errors in the log
alert.TooManyErrorsRecently.statistic=RunningErrorRate
alert.TooManyErrorsRecently.warn= >=3
alert.TooManyErrorsRecently.critical= >=10
alert.TooManyErrorsRecently.category=alerts
# Ignore following warnings (or criticals, after the first critical) for the given amount of time:
# 1000 * 60 * 100 = 100 min
alert.TooManyErrorsRecently.quietperiod=6000000

The important config params are

  • calculator.minuteRate.period (in ms) – count errors over this period, reset the count at its end; a reasonable value may be 10 minutes
  • alert.TooManyErrorsRecently.warn and alert.TooManyErrorsRecently.critical – trigger the alert when so many errors in the period has been encountered; reasonable values depend on your application’s normal error rate
  • alert.TooManyErrorsRecently.quietperiod (in ms) – don’t send further alerts for this period not to spam in a persistent failure situation; the reasonable value depends on how quickly you usually fix problems, 1 hour would seem OK to me

Log4J Configuration

Now we need to tell Log4J to use the Stat4j appender to count error occurences and to send alerts via email:

log4j.rootCategory=DEBUG, Console, FileAppender, Stat4jAppender
...
### Stat4jAppender & EmailAlertsAppender ###
# Collects statistics about logs and sends alerts when there
# were too many failures in cooperation with the EmailAlertsAppender

## Stat4jAppender
log4j.appender.Stat4jAppender=net.sourceforge.stat4j.log4j.Stat4jAppender
log4j.appender.Stat4jAppender.Threshold=ERROR
# For configuration see stat4j.properties

## EmailAlertsAppender
# BEWARE: SMTPAppender ignores its Thresholds and only evers sends ERROR or higher messages
log4j.category.alerts=ERROR, EmailAlertsAppender
log4j.appender.EmailAlertsAppender=org.apache.log4j.net.SMTPAppender
log4j.appender.EmailAlertsAppender.To=dummy@example.com
# BEWARE: The address below must have a valid domain or some receivers will reject it (e.g. GMail)
log4j.appender.EmailAlertsAppender.From=noreply-stat4j@google.no
log4j.appender.EmailAlertsAppender.SMTPHost=172.20.20.70
log4j.appender.EmailAlertsAppender.BufferSize=1
log4j.appender.EmailAlertsAppender.Subject=[Stat4j] Too many exceptions in log
log4j.appender.EmailAlertsAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.EmailAlertsAppender.layout.ConversionPattern=%d{ISO8601} %-5p %X{clientIdentifier} %c %x - %m%n

Comments

  • #8 Specify the Stat4J appender
  • #9 Only send ERRORs to Stat4J, we are not interested in less serious exceptions
  • #14 “alerts” is the log category used by Stat4jAppender to log alerts (the same you would create via Logger.getLogger(“alerts”)); as mentioned, SMTPAppender will without respect to the configuration only process ERRORs and higher

Issues with the SMTPAppender

It is quite tricky to get the SMTPAppender working. Some pitfalls:

  • SMTPAppender ignores all logs that are not ERROR or higher without respect to how you set its threshold
  • If you specify a non-existing From domain then some recipient’s mail servers can just delete the email as spam (e.g. GMail)
  • To send emails, you of course need mail.jar (and for older JVMs also activation.jar), here are instructions for Tomcat

And one $100 tip: to debug it, run your application in the debug mode and set a method breakpoint on javax.mail.Transport#send (you don’t need the source code) and when there, set this.session.debug to true to get a very detailed log of the following SMTP communication in the server log.

Sidenote

The fact that this article is based on Log4J doesn’t mean I’d personally choose it, it just came with the project. I’d at least consider using the newer and shiny Logback instead :-) .

Conclusion

Stat4j + SMTPAppender are a very good base for a rather flexible do-it-yourself alerting system based on logs and e-mail. You can achieve the same thing out-out-the-box with Hyperic HQ plus.

Reference: Aggregating Error Logs to Send a Warning Email When Too Many of Them – Log4j, Stat4j, SMTPAppender from our JCG partner Jakub Holý at “The Holy Java” Blog.

Related Articles :

Jakub Holy

Jakub is an experienced Java[EE] developer working for a lean & agile consultancy in Norway. He is interested in code quality, developer productivity, testing, and in how to make projects succeed.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button