Software Development

Development Horror Story – Mail Bomb

Based on my session idea at JavaOne about things that went terrible wrong in our development careers, I thought about writing a few of these stories. I’ll start with one of my favourites ones: Crashing a customer’s mail server after generating more than 23 Million emails! Yes, that’s right, 23 Millions!

 
 
 
 
 
 
 

email-bomb

History

A few years ago, I’ve joined a project that was being developed for several months, but had no production release yet. Actually, the project was scheduled to replace an existing application in the upcoming weeks. My first task in the project was to figure out what was needed to deploy the application in a production environment and replace the old application.

This application had a considerable amount of users (around 50 k), but not all of them were active. The new application had a new feature to exclude the users that didn’t log into the application for the last few months. This was implemented as a timer (executed daily) and a email notification was sent to that user warning him that he was excluded from the application.

The Problem

The release was installed on a Friday (yes, Friday!), and everyone went for a rest. Monday morning, all hell broke loose! The customer mail server was down, and nobody had any idea why.

The first reports indicated that the mail server was out of disk space, because it had around 2 Million emails pending delivery and a lot more incoming. What the hell happened?

The Cause

Even with the server down, support was able to show us a copy of an email stuck in the server. It was consistent with the email sent when a user was excluded. It didn’t make any sense, because we counted the number of users to be excluded and they were around 28 k, so only 28 k emails should have been sent. Even if all users were excluded the number could not be higher than 50 k (the total number of users).

Invalid Email

Looking into the code, we found out a bug that would cause the user to not be excluded if he had an invalid email. As a consequence these users were caught every time that the timer executed. From the total 28 k users to be excluded, around 26 k had invalid emails. From Friday to Monday, we count 3 executions * 26 k users, so 78k k emails. Ok, so now we have an email increase, but not close enough to the reported numbers.

Timer Bug

Actually the timer also had a bug. It was not scheduled to be executed daily, but every 8 hours. Let’s adjust the numbers: 3 days * 3 executions a day * 26 k users, brings the total to 234 k emails. A considerable increase but still far from a big number.

Additional Node

The operations installed the application in a second node, and the timer was executed in both. So a double increase. Let’s update: 2 * 234 k emails, brings the total to 468 k emails.

No-reply Address

Since the emails were automated, you usually set up a no-reply email as the email sender. Now the problem was that the domain for the no-reply address was invalid. Combining this with the users invalid emails, the mail server entered in a loop state. Each invalid user email generated an error email sent to the no-reply address, which was invalid as well and this caused a returned email again to the server. The loop end when the Maximum hop count is exceeded. In this case it was 50. Now everything starts to make sense! Let’s update the numbers:

26 k users * 3 days * 3 executions * 2 servers * 50 hops for a grand total of 23.4 Million emails!

Aftermath

The customer lost all their email from Friday to Monday, but it was possible to recover the mail server. The problems were fixed and it never happened again. I remember those days, to be very stressful, but today all of us involved, laugh about it!

Remember: always check the no-reply address!

Reference: Development Horror Story – Mail Bomb from our JCG partner Roberto Cortez at the Roberto Cortez Java Blog blog.

Roberto Cortez

My name is Roberto Cortez and I was born in Venezuela, but I have spent most of my life in Coimbra – Portugal, where I currently live. I am a professional Java Developer working in the software development industry, with more than 8 years of experience in business areas like Finance, Insurance and Government. I work with many Java based technologies like JavaEE, Spring, Hibernate, GWT, JBoss AS and Maven just to name a few, always relying on my favorite IDE: IntelliJ IDEA.Most recently, I became a Freelancer / Independent Contractor. My new position is making me travel around the world (an old dream) to customers, but also to attend Java conferences. The direct contact with the Java community made me want to become an active member in the community itself. For that reason, I have created the Coimbra Java User Group, started to contribute to Open Source on Github and launched my own blog (www.radcortez.com), so I can share some of the knowledge that I gained over the years.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
hh
hh
10 years ago

so what ??

this is just an untested code going to prod ?? so what ?? is it really worth for a blog post and my time reading it ??

Roberto Cortez
10 years ago
Reply to  hh

Thank you for you feedback.

Back to top button