Resiliency is the through line in DORA’s 5th annual State of DevOps report. In 2018, trending data demonstrates that elite performers are better prepared for things to fail - they expect things to fail - and they have put systems in place to get back up and running within tight windows to deploy faster with greater reliability.
One of the big surprises is that industry leaders, defined as “elite performers” in this assessment, don’t see throughput vs. stability as a trade-off. This group is consistently able to deliver faster, more robust software, while “low performers” are both more cautious and end up with much less stable results. Read more about what separates elite DevOps teams from their peers.
What Makes a DevOps Team “Elite”?
Elite performers come from all organizations of all sizes. Half of the respondents came from North America, but the report includes input from every corner of the globe. The four measures of software delivery performance fell into two categories.
- Throughput was measured by deployment frequency and lead time for change.
- Stability was measured by time to restore service and change fail rates.
The addition this year was a measure of availability, all of which went into a total score for software delivery and operational performance (SDO performance).
At the bottom of the scale, low performers tended to need one to six months in lead time for changes, about half of which resulted in service impairments. These represented about 15 percent of those surveyed. At the other end, 48 percent of companies qualified as high performers, able to deploy code once a day or more and 3.5 times more likely to achieve availability goals at the same time. Within that group, seven percent achieved elite classification, which involved factors like deployment on demand and a record of restoring service in less than an hour.
What Do Elite Teams Do Differently?
There’s a great deal to be learned in how elite DevOps teams operated. Compared to the low performer group, they were able to achieve:
- Code deployments that are both stable and going out 46X more often
- A hyper-compressed lead time in going from code commit to production 2,555X faster
- Changes failure rates, including hot fixe, rollbacks and rolling patches, that was 7X lower
- A recovery time that reduces impacts on the customer experience 2,604X shorter
In comparison, the misguided performers tended to fall into the traditional time trap of working in reactive mode, frantically working to patch up errors which ends up degrading performance further and compounding the problems. Unplanned work is a big part of the breakdown, because it prevents these companies from meeting their goals and the time pressure of uptime encourages them to cut corners, with predictably poor results.
Planning for Resilience
That observation may be the most important takeaway from this year’s report. Advanced knowledge leads to heightened performance. The authors concluded, “Developing software in increasingly complex systems is difficult and failure is inevitable. Making large-batch and infrequent changes introduces risk to the deployment process.”
Cascading failures drive teams to burnout, which impairs response time to the point where unplanned work dominates the team’s priority list. While the performance metrics for high and elite teams remained fairly constant, performance metrics for low performers have dipped. This suggests that top teams expect things to go wrong, and they are pulling ahead by building more resilient systems - those that can be quickly restored and enhanced with minimal downtime.