The reuse dilemma
The first commandment that any young programmer learns is “Thou Shall Not Duplicate”. Thus instructed, whenever we see something that looks like it may be repeated code, we refactor. We create libraries and frameworks. But removing duplication doesn’t come for free.
If I refactor some code so that instead of duplicating some logic in Class A and Class B, these classes share the logic in Class R (for reuse!). Now Classes A and B are indirectly coupled. This is not necessarily a bad thing, but it comes with some consequences, which are often overlooked.
If Class A require the shared functionality to change, we have a choice: We make the change or we stymie Class A. Making the change comes at a cost of breaking Class B. If these classes are in the same Java package (or .NET Namespace), chances are that we will be able to verify that the change didn’t break anything. If the reused functionality is in a library that is used by another library that is used by a Class B, verifying that the change was good is harder.
This is where Continuous Integration comes in. We check in our change to class R (for reuse, remember). A build is triggered on Jenkins, Bamboo, TFS or what have you. The build causes other projects to get built and eventually, we get a failing build where a unit test for Class B breaks. If we did our job right.
Even with this failing test, we’re not out of the woods. A build system for a large enterprise may take up to an hour or even more to run. This means that by trying to improve Class R, the developers of Class A have made a mess for the developers of Class B at least for a good while.
When organizations see repeated build breaks due to changes in dependencies, the reaction is usually the same: Lets limit changes to the shared code. Perhaps we’ll even version it, so that any change that you need will be released in the next version and dependencies can upgrade at their own pace.
Now we’ve introduced another problem: We are no longer able to change the code. Developers of Class A will have to take Class R as it is, or at the very least go through substantial work to change it. If the reused code is very mature, this is fine. After all, this is what we have to deal with when it comes to language standard libraries and open source projects.
Most reused code inside an organization, however, isn’t so mature. The developers of Class A will often be frustrated by the limitations of Class R. If the reused class is something very domain specific, the limitation is even worse.
So the developers do what any reasonable developer would do: They make a copy of Class R (or they fork the repository where it lives), creating a new duplication of the code.
And so it goes: Removing duplication leads to risk of adverse interactions between reusers. Organizations enforce change control on the reused code to avoid changes made in one context from breaking other parts of the system, resulting in paralysis for the reusers who need the reused code to change. The reusers eventually duplicate the reused code to their own branch where they can evolve it, leading to duplication. Until someone gets the urge to remove the duplication.
The dilemma happens in the small and in the large, but the trade-offs are different.
What we should do less is to reuse immature code with little novel value. The cost of the extra coupling often far outweighs the benefit of reuse.
Reference: | The reuse dilemma from our JCG partner Johannes Brodwall at the Thinking Inside a Bigger Box blog. |
You bring up some good points about the undervalued cost of code deduplication. Thanks for posting.