How to Measure the Reliability of Your Software Throughout the CI/CD Workflow
Overcome the challenge of maintaining code quality in a CI/CD workflow with Continuous Reliability
CI/CD practices encourage frequent code integration in development, speed up preparations for new releases and automate deployments. And with this new tooling, these parts of the software development lifecycle have all improved and accelerated. At the same time, the data that we use to evaluate the overall quality of a new release and of our application as a whole hasn’t changed much at all.
With all of the new tools on the market, most teams are experiencing a trade-off of accelerated delivery of our applications. In exchange for delivering new features more quickly, we let more errors slip through to production and consequently increase the risk of a lessened user experience.
It doesn’t have to be this way, though. Let’s talk about how we can face and overcome the challenge of maintaining high quality code within a CI/CD workflow.
Implementing a CI/CD Workflow
The implementation of a CI/CD workflow has become a pillar of modern software development. The basic concept is to introduce automation across the software release cycle in order to release software more quickly and more consistently. Not everybody implements automation at every step of the release cycle, though, and some may choose to focus on automating one step more than another.
Development teams that focus on Continuous Integration (CI) frequently merge changes and new code into the main branch. This practice requires advanced automated testing protocols to prevent new code from breaking the release version.
Continuous Delivery (CD) builds on CI to include the remaining steps of the process up to the point of deployment. Every code revision goes through a standardized and automated test process beyond unit testing. The process may include load testing, integration testing, or tests for UI or API reliability. Every code change is automatically built, tested and sent to the staging environment ready to be deployed. In Continuous Delivery, deployment isn’t automated and requires a manual “flick of the switch”. In Continuous Deployment (also CD), the final act of deploying code to production is also automated, meaning there are no points in the process that require human oversight before code is pushed to users.
The more steps of the process we automate, the faster we can get code from our local environment into the hands of our users.
The Quality Challenge
For the CI/CD process to be successful, code should be introduced more quickly into production without the quality of the code being diminished. With our automated testing frameworks acting as the only gates blocking code from being promoted through the development stages, the quality of the tests we write essentially determines the quality of the code that is deployed.
Successful use of CI/CD is founded on our ability to apply stringent testing to applications as part of this automated process. In reality, we all know that our code is never perfect and that issues are likely to sneak into the production environment in spite of our best efforts.
Organizations that employ this level of automation are typically very concerned about quality, but the speed at which code passes through the development stages can also affect the overall functional quality of an application as a whole. As the level of automation increases, the quality of our testing must scale proportionately.
No matter how much automation we put into our testing suites, though, they’ll always be somewhat manual in that they are written and defined by people in the first place. And despite the fact that most engineers and testers are incredibly smart, it’s reasonable to say that we can’t consider all the functional edge-cases or permutations of data that should be tested for complete coverage. But we also can’t accept the knowledge that our code might be introducing new errors that will negatively impact our users and business.
Instead, we often try to evaluate overall quality of an application in each deployment environment by analyzing log files. In this way, we get limited insight into some raw numbers but it doesn’t give enough detail into which function is failing, or better yet, why it’s failing. It also doesn’t provide any insight into new or reintroduced errors, meaning that understanding frequency or failure rate is impossible. CI/CD automation is predicated on delivering quality software but it is a challenge to understand this using log files alone.
Practicing Continuous Reliability with Better Data
This is a complicated problem with a not-so-complicated solution, in theory. We need to get access to more detailed data about our applications’ performance and reliability. We need to see what the JVM sees. With access to this data, we can use it to set up more advanced quality gates to block problematic code from passing to the next stage and feedback loops to inform more comprehensive testing scenarios.
Quality Gates
The goal of Continuous Integration specifically is to enable early detection of issues in the code. Rather than waiting for the day of the release, changes and new code are integrated as frequently as possible so that if (when) a bug comes up in testing, troubleshooting it should be easier. Then, until it’s fixed, the code is blocked from passing through to the release stage.
The problem is that this process only accelerates code introduction from development to testing without making any improvements on code quality. The goal of CI is to reduce overhead and reveal code errors early in the development cycle, but with access to the right data and its use in building effective quality gates, the CI workflow has a much stronger potential.
So, we’ve implemented our CI/CD workflow and gain the accelerated development cycle that we wanted. Now, it’s time that we get our hands on the data that we need to evaluate the overall quality of a release before we determine whether or not it is “safe” to promote the code.
Feedback Loops
In addition to stopping unqualified code from entering the release stage, this data enables us to create a feedback loop between different stages of development and production environments. In addition to helping with troubleshooting tasks, this helps us write better automated test scripts by using real production data as a basis for testing in lower environments.
This helps developers to greatly increase their productivity and can easily be a part of our common workflow and testing in CI.
Basically, what most teams are missing now is visibility into the state of the JVM at the time an error is logged or an exception is thrown. Log aggregators and performance monitoring tools provide useful information, but they don’t give us what they need to get a granular understanding of the quality of our application and to regulate and decide when a release is safe to promote.
Instead, or in addition rather, we should get access to the source code, variable state and stack trace at the time of an error. Then, this data can be aggregated from across the entire application, library, class, deployment or any other arbitrary boundary for insight into the overall functional quality across the code. This unique set of data allows us to identify both known and unknown errors and to classify events, providing information on whether they are new, reintroduced and their frequency and failure rate.
The result of all of this is that we will eliminate the risks of promoting code faster with a better understanding of its overall quality, we will write more comprehensive tests in dev and pre-prod environments, and with all of this together, we will drive higher quality code development.
Final Thoughts
For most teams, implementing a CI/CD workflow means making certain trade-offs. In exchange for more frequent code integrations and releases, code quality stays stagnant or drops. Focus can even shift, ironically, from writing code to troubleshooting code. To mitigate these challenges, teams should practice Continuous Reliability in the context of a CI/CD workflow to help govern the release and to provide a valuable feedback loop.
Published on Java Code Geeks with permission by Tali Soroker, partner at our JCG program. See the original article here: How to Measure the Reliability of Your Software Throughout the CI/CD Workflow Opinions expressed by Java Code Geeks contributors are their own. |