DevOps and SRE, Chapter 1: When Innovation Becomes Mainstream – DZone DevOps
Cloud-native applications are a type of complex system that depends on the continuous effort of software professionals that combines the best of their expertise to keep them running. In other words, their reliability isn’t self-sustaining, but is a result of the interactions of all the different actors engaged in their design, build, and operation.
Over the years the collection of those interactions has been evolving together with the systems they were designed to maintain, which have been also becoming increasingly sophisticated and complex. The IT service management model, once designed to maintain control and stability, is now fading and giving place to a model designed to improve velocity while maintaining stability. Although the combination of those things might seem contradictory at first, this series of articles tries to reveal the reasons why the collection of practices that today we know as DevOps and SRE (Site Reliability Engineering) are becoming the norm for modern systems.
You may also enjoy: SRE vs. DevOps: Any Common Ground?
Table of Contents:
Chapter 1: When Innovation Becomes Mainstream
When Björn “Beorn” Rabenstein from GrafanaLabs said this, during his talk “SRE in the Third Age” on SRECon19 in Dublin, he made me think about how the engineering practices developed around DevOps and Site Reliability Engineering relate to each other, and also how they’re becoming prominent standard practices to manage complex cloud systems in a way that, soon enough, we might stop seeing them as innovation and start perceiving them as the normal way of doing things.
From this point forward it is important to mention that I won’t make any distinctions between SRE or DevOps professionals; to me, the engineers and architects specialized in these practices are professionals with the same set of fundamental objectives, calling them differently would be a mere different name for the same job role. My personal definition for this type of role is:
I must admit that my opinions may be a little biased as they come from my research, observations, and experiences, which are more on the operations side rather than on software development. When I first read Google’s SRE book, my impression was that everything made sense. Still, I couldn’t see, out-of-the-box, innovation other than establishing the framework and developing a shared vocabulary and a shared understanding of how to improve IT Operations for complex distributed environments. As time went by, recognizing that I couldn’t immediately realize the impact of such a framework, as I dug into the theme and connected with experts far more advanced than myself in the adoption of the practices, I realized it was not only immensely essential but also established the basis for the new normal for modern IT operations.
Although this new way of managing IT systems was perceived as disruptive innovation by innovators and early adopters, it was still seen with skepticism by pragmatists, as I think all things usually are (more on that chapter 4). Something is only perceived as innovation when it disrupts any individual belief or a group’s shared understanding of how things work. Streaming technology, for example, was perceived as innovation to my generation (to be honest, VCRs and DVD players were really cool innovations to me, too), but considered something usual to my daughter’s generation. She has had access to those things once called innovation (YouTube, Netflix, Amazon Video, etc.) since she was born, so to her, those things were the standard way of watching movies and never perceived as a “modern” way of doing things.
With that in mind, I started observing some signals that corroborate with the assertion that in the future you won’t need Site Reliability Engineers, you will need Site Reliability Engineering, and this set of engineering practices will be the standard effective way of managing complex distributed systems. SRE and DevOps will retain their status until a better model once again disrupts it, but this new model is yet to surface, and perhaps it will be based on the adoption of artificial intelligence at scale. Chances are that the industry will converge around the concept of AIOps, but that’s topic for another article.
To understand the roots of my observations and beliefs, it is also important to understand the scientific foundation I based my research on. It is not my intent to explain each one of them in detail; the upcoming chapters will only provide a glimpse of each one so that the reader understands my thought process and how I modeled my understanding. After all, my model is, as all models, fairly imprecise; it only describes the world from my very limited set of perspectives but hopefully, it will be rather imperfect but useful.
Key Takeaway
The advent of cloud computing, together with the rapid expansion of the technologies and services that were enabled by its wide adoption, represented a very disruptive force in several different industries. The combination of those elements enabled companies of all sizes, from the garage size start-up to Fortune 500 enterprises, to continuously reinvent themselves, transform their product portfolio, and engage with their customers in different shapes and forms, preferably via the so-called digital channels.
The collection of those capabilities and strategies became what we started to describe as the digital transformation. This radical shift in the traditional way of work also required a much faster way of developing and delivering digital products. At first, we tried to push the incumbent IT management model to control this fast and ever-changing landscape but, as we learned from the process, we understood that a new way that focuses on velocity was more suitable for such reality. This once innovative approach to manage cloud systems is rapidly becoming the standard practice and this is happening so fast that we might not even see the transition.