Making Devops work outside of Webops
I’ve spent the last 3 years or so learning more about devops. I went to Velocity and Devopsdays and a bunch of other conferences that included devops stuff (like the last couple of OWASP USA conferences and this year’s Agile conference). I’ve been following the devops forums and news and reading devops books and trying out devops tools and Continuous Delivery, talking to smart people who work at successful devops shops, encouraging people in my organization to adopt some of these ideas and tools where they make sense. Looking for practices and patterns and technology we can take and apply to the work that we do, which is in an enterprise, B2B environment.
A problem for us is that devops today is still mostly where it started, rooted in Web Ops with a focus on building and scaling online shops and communities and Cloud services – except maybe where some enterprise technology vendors have jumped on the devops bandwagon to re-brand their tools.
Is there really that much that a well-run highly-regulated enterprise IT organization hooked into hundreds or thousands of other enterprises can learn from a technology startup trying to launch a new online social community or a multi-player online game, or even from larger, more mature devops shops like Etsy or Netflix? Do the same rules and ideas apply?
The answer is: yes sometimes, but no, not always.
There are some important factors that separate enterprises from most devops shops today.
Platform heterogeneity and the need to support legacy systems and all of the operational inter-dependencies between systems is one – you can’t pretend that you can take care of your configuration management problems by using Puppet or Chef in an enterprise that has been built up over many years through mergers and acquisitions and that has to support thousands of different applications on dozens of different technology platforms. Many of those apps are third party apps that you don’t have control over. Some of those platforms are legacy systems that aren’t supported any more. Some of those configs are one-off snow flakes because that’s the only way that somebody could get things to work.
Governance and regulatory compliance (and all the paperwork and hassle that goes with this) is another. Even devops shops don’t handle their highly-regulated core business functions the same as they do the rest of their code (a good example is how Etsy meets PCI compliance).
There are two other important factors that separate many enterprises such as large financial institutions from the way that a devops shop works: the need for speed in change, and the cost of failure.
The Need for Speed
If “How can I change things faster?” is the question, devops looks like the answer.
Devops enables – and emphasizes – rapid, continuous change, through relentless automation, breaking down walls between developers and operations, and through practices like Continuous Deployment, where developers push out code changes to production several times a day.
Being able to move this quickly is important in early stages of iterative design and development, for online startups that need to build a critical mass of customers before they run out of money, and other organizations experiencing hyper growth. Every organization has some systems that need to be changed often, and can be changed with minimal impact: CRM systems, analytics and internal management reporting for example. And as James Urqhuart explains, optimizing for change makes sense when you need to change technology often.
But there are other systems that you don’t need to or you can’t change every day or every week or every month: ERP and accounting systems, payment handling, B2B transactional systems, industrial control. Where you don’t need to and don’t want to run experiments on your customers to try out new features or constantly refine the details of their user experience because the system already works and lots of people are depending on it to work a certain way and what’s really important is keeping the system working properly and keeping operational costs down. Or where change is risky and expensive because of regulatory and compliance requirements and operational interdependencies with other systems and other organizations. And where the cost of failure is high.
Change, even when you optimize for it, always comes with the risk of failure. The 2013 State of Devops Report found that high performing devops shops deploy code 30x more frequently, with “double the change success rate”. By themselves these figures are impressive. Taken together – they aren’t good enough. Changing more often still means failing more often than an organization which moves more slowly and more cautiously, and not every organization can afford to fail more often.
The Cost of Failure
Most online businesses exist in a simpler, more innocent world where change is straightforward – it’s your code and your Cloud so you can make a change and push it out without worrying about dependencies on other systems and the people who use them or how to coordinate a roll-out globally across different companies and business lines – and where the consequences of failure are really not that high.
If you’re not charging anything (Facebook, Twitter) or next to nothing (Netflix) for customers to use your service, and if the cost of failure to customers is not that much (they have to wait a little bit to tell people that their kitty just sneezed or to post a picture of their gold fish or watch a movie) then nobody has the right to expect too much when something goes wrong.
It’s a completely different world for financial services companies, where failure is not always an option – and “fail fast, fail often” works in early stage development, but not once customers are using your technology.
I’ve been told by a tech exec at a bank that Etsy (or any web company) wasn’t a “serious” endeavor, that his bank works with “serious money” which means that they can’t “screw around” like web companies do. I’ve also seen web companies poo-poo the enterprise because they’re “spoiled” with their small user base and non-24×7 working environments.Until there is a shared understanding between those groups, the healthy and mature swapping of ideas and concepts is going to be slow.
John Allspaw, interview, Is the Entrerprise Ready for DevOps?
That bank exec, though he wasn’t diplomatic about it, is right.
The cost and risk involved in a failure is several orders of magnitude different between a bank and an online consumer web business, even something as large as Etsy. In a presentation at the end of 2012, Etsy’s CTO boasted that they are now handling “real money”, as much as “$1k per minute” at that time. While that’s real money to real customers at Etsy (and I am sure that this number is higher by now), it’s negligible compared to the value of transactions that any major financial institution handles.
There aren’t any mistakes that a company like Etsy or even Facebook could make that could compare with the impact of a big system failure at a major bank, or a credit card processor or a major stock exchange or financial clearing house or brokerage, or some other large financial institution.
This is not just because of the high value of transactions that are moving through these systems. It is also because of the chain reaction that such failures have on other institutions in the national and international system-of-systems that these organizations operate in – the impact on partner and customers’ systems, and on their customers and partners and so on. The costs of these failures can run into the millions or hundreds of millions of dollars, much more if you include lost opportunity costs and the downstream costs of responding to the failure (including IT costs for upgrading or replacing the entire system, which is a common response to a major failure), never mind the follow-on costs of increased regulatory oversight that is often demanded across an entire industry after a high-profile system failure.
Problems of this scale just don’t happen when an online e-business or social network fails, even if it fails spectacularly. It’s an inconvenience to customers, and it is a real cost to the online business, but failures don’t cascade down to other companies and industries, unless maybe Amazon’s AWS infrastructure fails big time and the online companies that depend on it are left hanging, which seems to happen a couple of times each year.
It’s not enough for many enterprises and even smaller B2B platforms to optimize for MTTR and try harder next time or to accept that roll-back is a myth and that “real men only roll forward” – and from the continuing stories of high-profile failures at online organizations this isn’t enough for devops organizations once they reach a certain size either.
But you can still learn a lot from Devops
It’s not that devops won’t work in the enterprise. Just not devops as it is mostly described until now. Devops overplays the “everything needs to change faster and more often” card, and oversimplifies some of the other problems that many organizations face if they don’t or can’t run everything in the Cloud. But there is still a lot that to learn from devops leaders, even if their stories and their priorities and constraints and their business situations don’t match up.
We can certainly learn from them about how to run a scalable Web presence and about how to move stuff to the Cloud.
We can learn how to take more of the risk and cost out of change management by simplifying and standardizing configurations where possible and simplifying and automating testing and deployment steps as much as possible – even if we aren’t going to change things every day.
But for now, probably the most valuable thing that devops brings to those of us who don’t work in online Web shops isn’t tools or practices. It’s that devops is creating new reasons and more opportunities for dev and Ops and management to engage with each other, as they try to figure out what this devops thing is and whether and how it makes sense in our organizations.
To get developers, managers and Ops together talking about configuration management and how to improve release and deployment and run-time alerting and application monitoring and run-time health checks and how developers and Ops can learn from failures together. Getting developers and Ops thinking about these problems and trying to solve them together, in collaborative and constructive ways that ITIL certainly never did. Spending more time on problem solving and less time on expensive bureaucracy and buck passing. That’s gotta be a good thing.