Blood, Sweat, and Writing Automated Integration Tests for Failure Scenarios
Last winter, I wrote and released a service for a client I am still working with. Overall the service has met business needs and performance requirements, however one of the teams that consumes the service told me they were periodically running into an issue where the service would return 500 errors and not return to normal until the service was restarted. I asked when this was occurring and put on my detective’s hat.
In this blog, I will introduce the process I went through to diagnose the bug and determine the correct integration test solution to fix it the right way. In doing so, I had to create a test that accurately reproduced the scenario my service was experiencing in PROD. I had to create a fix that took my test from failing to passing. And finally, I worked to increase confidence in the correctness of code for all future releases, which is only possible through automated testing.
Diagnosing the Bug
I read through my service’s log files around the the time the 500 errors started happening. They quickly showed a pretty serious problem: a little before midnight on a Saturday my service would start throwing errors. At first there was a variety of errors occurring, all SQLException, but eventually the root cause became the same:
org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
This went on for several hours until early the following morning when the service was restarted and the service went back to normal.
Checking with the cave trolls DBAs, I found the database I was connecting to went down for maintenance. The exact details escape me, but I believe it was a roughly 30-minute window the database was down. So, clearly, my service had an issue re-connecting to a database once the database recovered from an outage.
Fixing the Bug the Wrong Way
The most straightforward way of fixing this bug (and one which I have often went to in the past), would had been to Google “recovering from database outage,” which would likely lead me to a Stack Overflow thread that answers my question. I would then have “copied and pasted” in the provided answer and pushed the code to be tested.
If production was being severely affected by a bug, this approach might be necessary in the short-term. That said, time should be set aside in the immediate future to cover the change with an automated test.
Fixing the Bug the Right Way
So as is often the case, doing the things the “right way” often means a significant font-loaded time investment, and this adage is definitely true here.
The return on investment, however, is less time later spent fixing bugs, increased confidence in the correctness of the code, and, in addition, tests can be an important form of documentation as to how the code should behave in a given scenario.
While this specific test case is a bit esoteric, it’s an important factor to keep in mind when designing and writing tests, be they unit or integration: give tests good names, make sure test code is readable, etc.
Solution 1: Mock Everything
My first crack at writing a test for this issue was to try to “mock everything.” While Mockito and other mocking frameworks are quite powerful and are getting ever easier to use, after mulling over this solution, I quickly came to the conclusion I just wouldn’t ever have the confidence that I wouldn’t be testing anything beyond the mocks I have written.
Getting a “green” result would not increase my confidence in the correctness of my code, the whole point of writing automated tests in the first place! On to another approach.
Solution 2: Use An In-Memory Database
Using an in-memory database was my next attempt at writing this test. I’m a pretty big proponent of H2, I’ve used H2 extensively in the past and was hoping it might address my needs here once again. I probably spent more time here than I should have.
While ultimately this approach doesn’t pan out, the time spent isn’t entirely wasted, I did learn a decent bit more about H2. One of the advantages of doing things the “right way” (though often painful in the moment) is that you learn a lot. The knowledge gained might not be useful at the time, but could prove valuable later.
The Advantages of Using an In-Memory Database
Like I said, I probably spent more time here than I should have, but I did have my reasons for wanting this solution to work. H2, and other in-memory databases, had a couple of very desirable traits:
- Speed: Starting and stopping H2 is quite fast, sub-second. So while a little slower than using mocks, my tests would still be plenty fast.
- Portability: H2 can run entirely from an imported jar, so other developers can just pull down my code and run all the tests without performing any additional steps.
Additionally my eventual solution had a couple non-trivial disadvantages which I will cover as part of that solution below.
Writing the Test
Somewhat meaningful, but to this point I still hadn’t written a single line of production code. A central principle of TDD is to write the test first and production code later. This methodology along with ensuring a high level of test coverage also encourages the developer to only make changes that are necessary. This goes back to the goal increasing confidence in the correctness of your code.
Below is the initial test case I built to test my PROD issue:
@RunWith(SpringRunner.class) @SpringBootTest(classes = DataSourceConfig.class, properties = {"datasource.driver=org.h2.Driver", "datasource.url=jdbc:h2:mem:;MODE=ORACLE", "datasource.user=test", "datasource.password=test" }) public class ITDatabaseFailureAndRecovery { @Autowired private DataSource dataSource; @Test public void test() throws SQLException { Connection conn = DataSourceUtils.getConnection(dataSource); conn.createStatement().executeQuery("SELECT 1 FROM dual"); ResultSet rs = conn.createStatement().executeQuery("SELECT 1 FROM dual"); assertTrue(rs.next()); assertEquals(1, rs.getLong(1)); conn.createStatement().execute("SHUTDOWN"); DataSourceUtils.releaseConnection(conn, dataSource); conn = DataSourceUtils.getConnection(dataSource); rs = conn.createStatement().executeQuery("SELECT 1 FROM dual"); assertTrue(rs.next()); assertEquals(1, rs.getLong(1)); } }
Initially I felt I was on the right path with this solution. There is the question of how do I start the H2 server back up (one problem at a time!) But when I run the test, it is failing and giving an error analogous to what my service is experiencing in PROD:
org.h2.jdbc.JdbcSQLException: Database is already closed (to disable automatic closing at VM shutdown, add ";DB_CLOSE_ON_EXIT=FALSE" to the db URL) [90121-192]
However, if I modify my test case and simply attempt a second connection to the database:
conn = DataSourceUtils.getConnection(dataSource);
The exception goes away and my test passes without me making any changes to my production code. Something isn’t right here…
Why This Solution Didn’t Work
So using H2 won’t work. I actually spent quite a bit more time trying to get H2 to work than what the above would suggest. Troubleshooting attempts included; connecting to a file based H2 server instance instead of just an in-memory one, a remote H2 server; I even stumbled up the H2 Server class that would had addressed the server shutdown/startup issue from earlier.
None of those attempts worked obviously. The fundamental problem with H2, at least for this test case, is attempting to connect to a database will cause that database to start up if it currently isn’t running. There is a bit of a delay, as my initial test case shows, but obviously this poses a fundamental problem. In PROD, when my service attempts to connect to a database, it does not cause the database to start up (no matter how many times I attempt connecting to it). My service’s logs can certainly attest to this fact. So on to another approach.
Solution 3: Connect to a Local Database
Mocking everything won’t work. Using an in-memory database didn’t pan out either. It looks like the only way I will be able to properly reproduce the scenario my service was experiencing in PROD was by connecting to a more formal database implementation. Bringing down a shared development database is out of the question, so this database implementation needs to run locally.
The Problems With This Solution
So everything before this should give a pretty good indication that I really wanted to avoid going down this path. There are some good reasons for my reticence:
- Decreased portability: If another developer wanted to run this test she would need to download and install a database on her local machine. She would also need to make sure her configuration details match what the test is expecting. This is time-consuming task and would lead to at least some amount of “out of band” knowledge.
- Slower: Overall my test still isn’t too slow, but it does take several seconds to startup, shutdown, and then startup again even against a local database. While a few seconds doesn’t sound like much, time can add up with enough tests. This is a major concern as integration tests are allowed to take longer (more on that later), but the faster the integration tests, the more often they can be run.
- Organizational wrangling: To run this test on the build server means I would now need to work with my already-overburdened DevOps team to setup a database on the build box. Even if the ops team wasn’t overburden, I just like to avoid this if possible as it’s just one more step.
- Licensing: In my code example, I am using MySQL as my test database implementation. However, for my client, I was connecting to an Oracle database. Oracle does offer Oracle Express Edition (XE) for free, however it does come with stipulations. One of those stipulations is two instance of Oracle XE cannot be running at the same time. The specific case of Oracle XE aside, licensing can become an issue when it comes to connecting to specific products offerings, it’s something to keep in mind.
Success! … Finally
Originally this article was a good bit longer, which also gave a better impression of all the blood, sweat, and tears work that went into getting to this point. Ultimately such information isn’t particularly useful to readers, even if cathartic for the author to write about. So, without further ado, a test that accurately reproduces the scenario my service was experiencing in PROD:
@Test public void testServiceRecoveryFromDatabaseOutage() throws SQLException, InterruptedException, IOException { Connection conn = null; conn = DataSourceUtils.getConnection(datasource); assertTrue(conn.createStatement().execute("SELECT 1")); DataSourceUtils.releaseConnection(conn, datasource); LOGGER.debug("STOPPING DB"); Runtime.getRuntime().exec("/usr/local/mysql/support-files/mysql.server stop").waitFor(); LOGGER.debug("DB STOPPED"); try { conn = DataSourceUtils.getConnection(datasource); conn.createStatement().execute("SELECT 1"); fail("Database is down at this point, call should fail"); } catch (Exception e) { LOGGER.debug("EXPECTED CONNECTION FAILURE"); } LOGGER.debug("STARTING DB"); Runtime.getRuntime().exec("/usr/local/mysql/support-files/mysql.server start").waitFor(); LOGGER.debug("DB STARTED"); conn = DataSourceUtils.getConnection(datasource); assertTrue(conn.createStatement().execute("SELECT 1")); DataSourceUtils.releaseConnection(conn, datasource); }
Full code here: https://github.com/wkorando/integration-test-example/blob/master/src/test/java/com/integration/test/example/ITDatabaseFailureAndRecovery.java
The Fix
So I have my test case. Now it’s time to write production code to get my test showing green. Ultimately I got the answer from a friend, but likely would stumbled upon it with enough Googling.
Initially the DataSource I set up in my service’s configuration effectively looked like this:
@Bean public DataSource dataSource() { org.apache.tomcat.jdbc.pool.DataSource dataSource = new org.apache.tomcat.jdbc.pool.DataSource(); dataSource.setDriverClassName(env.getRequiredProperty("datasource.driver")); dataSource.setUrl(env.getRequiredProperty("datasource.url")); dataSource.setUsername(env.getRequiredProperty("datasource.user")); dataSource.setPassword(env.getRequiredProperty("datasource.password")); return dataSource; }
The underlying problem my service was experiencing is when a connection from the DataSource
’s connection pool failed to connect to the database, it became “bad.” The next problem then was my DataSource
implementation would not drop these “bad” connections from the connection pool. It just kept trying to use them over and over.
The fix for this is luckily pretty simple. I needed to instruct my DataSource
to test a connection when the DataSource
retrieved it from the connection pool. If this test failed, the connection would be dropped from the pool and a new one attempted. I also needed to provide the DataSource
with a query it could use to test a connection.
Finally (not strictly necessary but useful for testing), by default my DataSource
implementation would only test a connection every 30 seconds. However it would be nice for my test to run in less than 30 seconds. Ultimately the length of this period isn’t really meaningful, so I added a validation interval that is provided by a property file.
Here is what my updated DataSource
looks like:
@Bean public DataSource dataSource() { org.apache.tomcat.jdbc.pool.DataSource dataSource = new org.apache.tomcat.jdbc.pool.DataSource(); dataSource.setDriverClassName(env.getRequiredProperty("datasource.driver")); dataSource.setUrl(env.getRequiredProperty("datasource.url")); dataSource.setUsername(env.getRequiredProperty("datasource.user")); dataSource.setPassword(env.getRequiredProperty("datasource.password")); dataSource.setValidationQuery("SELECT 1"); dataSource.setTestOnBorrow(true); dataSource.setValidationInterval(env.getRequiredProperty("datasource.validation.interval")); return dataSource; }
One final note for writing integration tests. Initially I created a test configuration file that I used to configure the DataSource
to use in my test. However this is incorrect.
The problem is that if someone were to remove my fix from the production configuration file, but left it in the test configuration file, my test would still be passing but my actual production code would once again be vulnerable to the problem I spent all this time fixing! This is a mistake that would be easy to imagine happening. So be sure to use your actual production configuration files when writing integration tests.
Automating the Test
So the end is almost in sight. I have a test case that accurately reproduces the scenario I am experiencing in PROD. I have a fix that then takes my test from failing to passing. However the point of all this work wasn’t to just have confidence that my fix works for the next release, but for all future releases.
Maven users: hopefully you are already familiar with the surefire plugin. Or, at least hopefully your DevOps team already has your parent pom set up so that when a project is being built on your build server, all those unit tests you took the time to write are being run with every commit.
This article however isn’t about writing unit tests, but about writing integration tests. An integration test suite will typically take much longer to run (sometimes hours) than an unit test suite (which should take no more than 5-10 minutes). Integration tests are also typically more subject to volatility. While the integration test I wrote in this article should be stable –if it breaks, it should be cause for concern– when connecting to a development database, you can’t always be 100% confident the database will be available or that your test data will be correct or even present. So a failed integration test doesn’t necessarily mean the code is incorrect.
Luckily the folks behind Maven have already addressed this and that is with the failsafe plugin. Whereas the surefire plugin, by default, will look for classes that are pre or post-fixed with Test
, the failsafe plugin will look for classes pre or post-fixed with IT
(Integration Test). Like all Maven plugins, you can configure in which goals the plugin should execute. This gives you the flexibility to have your unit tests run with every code commit, but your integration tests to only run during a nightly build. This can also prevent a scenario in which a hot-fix needs to be deployed, but a resource that an integration test depends upon isn’t present.
Final Thoughts
Writing integration tests is a time consuming and difficult. It requires extensive thought into how your service will interact with other resources. This process is even more difficult and time consuming when you are specifically testing for failure scenarios which often requires more in-depth control of the resource your test is connecting and drawing on past experience and knowledge.
Despite this high cost in time and effort, this investment will pay itself back many times over in time. Increasing confidence in the correctness of code, which is only possible through automated testing, is central to shortening the development feedback cycle.
The code that I used in this article can be found here: https://github.com/wkorando/integration-test-example.
Reference: | Blood, Sweat, and Writing Automated Integration Tests for Failure Scenarios from our JCG partner Billy Korando at the Keyhole Software blog. |