Book: The DevOps Handbook

“The DevOps Handbook, Second Edition” by Gene Kim

Analogous to financial debt, technical debt describes how decisions we make lead to problems that get increasingly more diﬃcult to x over time, continually reducing our available options in the future—even when taken on judiciously, we still incur interest.
Competing goals
- Respond to the rapidly changing landscape.
- Provide stable, reliable, and secure service to the customer.
Instead of starting deployments at midnight on Friday and spending all weekend working to complete them, deployments occur throughout the business day when everyone is already in the oﬃce and without our customers even noticing—except when they see new features and bug fixes that delight them. And, by deploying code in the middle of the workday, IT Operations is working during normal business hours like everyone else for the first time in decades.
By creating fast feedback loops at every step of the process, everyone can immediately see the eﬀects of their actions. Whenever changes are committed into version control, fast automated tests are run in production-like environments, giving continual assurance that the code and environments operate as designed and are always in a secure and deployable state.
Even high-profile product and feature releases become routine by using dark launch techniques. Long before the launch date, we put all the required code for the feature into production, invisible to everyone except internal employees and small cohorts of real users, allowing us to test and evolve the feature until it achieves the desired business goal.
And, instead of refighting for days or weeks to make the new functionality work, we merely change a feature toggle or configuration setting. This is small change makes the new feature visible to ever-larger segments of customers, automatically rolling back if something goes wrong.
Instead of project teams where developers are reassigned and shuﬄed around after each release, never receiving feedback on their work, we keep teams intact so they can keep iterating and improving, using those learnings to better achieve their goals. This is equally true for the product teams who are solving problems for our external customers, as well as our internal platform teams who are helping other teams be more productive, safe, and secure.
Instead of a culture of fear, we have a high-trust, collaborative culture, where people are rewarded for taking risks. They are able to fearlessly talk about problems as opposed to hiding them or putting them on the back burner—after all, we must see problems in order to solve them.
Metrics
- Throughput:
  - code and change deployments
  - code and change deployments lead time
- Reliability
  - successful production deployments
  - mean time to restore service
Later, at the 2009 Velocity conference, John Allspaw and Paul Hammond gave the seminal “10 Deploys per Day: Dev and Ops Cooperation at Flickr” presentation, where they described how they created shared goals between Dev and Ops and used continuous integration practices to make deployment part of everyone’s daily work.
Lead Time: includes time between ticket created and work completed
Process Time: included time between work started and work completed
Achiving deployment lead time of minutes
- We achieve this by continually checking small code changes into our version control repository, performing automated and exploratory testing against it and deploying it into production.
- This is most easily achieved when we have architecture that is modular, well encapsulated, and loosely coupled so that small teams are able to work with high degrees of autonomy, with failures being small and contained, and without causing global disruptions.
In addition to lead times and process times, the third key metric in the technology value stream is percent complete and accurate (%C/A). This metric reflects the quality of the output of each step in our value stream.
When measuring the end-to-end value of any value stream it is important to stay away from proxy metrics (counting the number of lines of code committed or solely the frequency of deployments). While these metrics can reveal local optimizations, they don’t directly link to business outcomes such as revenue.
Flow metrics
- Flow velocity: number of ow items (e.g., work items) that are completed in a set time period. Helps to answer whether value delivery is accelerating.
- Flow eﬃciency: the proportion of flow items actively worked on to the total time that has elapsed. Identifies ineﬃciencies like long wait times and helps teams see if work upstream is in a wait state or not.
- Flow time: a unit of business value pulled by a stakeholder through a product’s valuestream (i.e., features, defects, risks, and debts).Helps teams see if time to value is getting shorter.
- Flow load: number of active or waiting flow items in a value stream. This is similar to a measure of work in progress (WIP) based on ow items. High ow load leads to ineﬃciencies and to reduced flow velocity or increased ow time. Helps teams see if demand is outweighing capacity.
- Flow distribution: the proportion of each ow item type in a value stream. Each value stream can track and adjust these depending on their needs in order to maximize the business value being delivered.
In order to maximize flow, we need to make work visible, reduce our batch sizes and intervals of work, build in quality by preventing defects from being passed to downstream work centers, and constantly optimize for global goals.
The Second Way enables the fast and constant flow of feedback from right to left at all stages of our value stream. It requires that we amplify feedback to prevent problems from happening again, or that we enable faster detection and recovery.
The Third Way enables the creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to experimentation and risk-taking, facilitating the creation of organizational learning, both from our successes and failures.
American Airlines used the following metrics:
- deployment frequency
- deployment cycle time
- change failure rate
- development cycle time
- number of incidents
- mean time to recover (MTTR)
To help us see where work is owing well and where work is queued or stalled, we need to make our work as visible as possible. One of the best methods of doing this is using visual work boards, such as kanban boards or sprint planning boards, where work can be represented on physical or electronic cards.
Example kanban board columns:
- Ready
- Investigate
- Development: Doing, Done
- Ops: Doing, Done
- UAT
- Delivered
We can limit multitasking when we use a kanban board to manage our work, such as by codifying and enforcing WIP (work in process) limits for each column or work center, that puts an upper limit on the number of cards that can be in a column.
Reduce the number of handoffs
- Each of these steps is a potential queue where work will wait when we rely on resources that are shared between diﬀerent value streams (e.g., centralized operations). e lead times for these requests are often so long that there is constant escalation to have work performed within the needed timelines.
- To mitigate these types of problems, we strive to reduce the number of handoﬀs, either by automating significant portions of the work, or by building platforms and reorganizing teams so they can self-service builds, testing, and deployments to deliver value to the customer themselves instead of having to be constantly dependent on others.
We cannot achieve deployments on demand if we always have to wait weeks or months for production or test environments. The countermeasure is to create environments that are on-demand and completely self-serviced, so that they are always available when we need them.
We cannot achieve deployments on demand if each of our production code deployments takes weeks or months to perform (e.g., each deployment requires 1,300 manual, error-prone steps, involving up to three hundred engineers). The countermeasure is to automate our deployments as much as possible, with the goal of being completely automated so deployments can be done self-service by any developer.
We cannot achieve deployments on demand if every code deployment requires two weeks to set up our test environments and data sets and another four weeks to manually execute all our regression tests. The countermeasure is to automate our tests so we can execute deployments safely and to parallelize them so the test rate can keep up with our code development rate.
We cannot achieve deployments on demand if overly tight architecture means that every time we want to make a code change we have to send our engineers to scores of committee meetings in order to get permission to make our changes. Our countermeasure is to create more loosely coupled architecture so that changes can be made safely and with more autonomy, increasing developer productivity.
Categories of waste and hardship:
- Partially done work: Partially done work becomes obsolete and loses value as time progresses.
- Extra processes: Any additional work being performed in a process that does not add value to the customer.
- Extra features: Features built into the service that are not needed by the organization or the customer
- Task switching: When people are assigned to multiple projects and value streams, requiring them to context switch and manage dependencies between work, adding additional eﬀort and time into the value stream.
- Waiting: Any delays between work requiring resources to wait until they can complete the current work.
- Motion: The amount of eﬀort to move information or materials from one work center to another. Motion waste can be created when people who need to communicate frequently are not colocated. Handoﬀs also create motion waste and often require additional communication to resolve ambiguities.
- Defects: Incorrect, missing, or unclear information, materials, or products create waste, as eﬀort is needed to resolve these issues. e longer the time between defect creation and defect detection, the more diﬃcult it is to resolve the defect.
- Nonstandard or manual work: Reliance on nonstandard or manual work from others, such as using non-rebuilding servers, test environments, and configurations.
- Heroics: In order for an organization to achieve goals, individuals and teams are put in a position where they must perform unreasonable acts, which may even become a part of their daily work
Make sure that what you’re measuring is commensurate with what your overall goals are. Make sure people are rewarded appropriately, and they’re not unfairly penalized for improving flow through the system. You need to think about the system, not about the department.
Therefore, because failure is inherent and inevitable in complex systems, we must design a safe system of work, whether in manufacturing or technology, where we can perform work without fear, con dent that most errors will be detected quickly, long before they cause catastrophic outcomes, such as worker injury, product defects, or negative customer impact.
Types of feedback:
- Dev Tests: As a programmer, did I write the code I intended to write?
- Continuous Integration (CI) and Testing: As a programmer, did I write the code I intended to write without violating any existing expectations in the code?
- Exploratory Testing: Did we introduce any unintended consequences?
- Acceptance Testing: Did I get the feature I asked for?
- Stakeholder Feedback: As a team, are we headed in the right direction?
- User Feedback: Are we producing something our customers/users love?
Swarming is necessary for the following reasons:
- It prevents the problem from progressing downstream, where the cost and eﬀort to repair it increases exponentially and technical debt is allowed to accumulate.
- It prevents the work center from starting new work, which will likely introduce new errors into the system.
- If the problem is not addressed, the work center could potentially have the same problem in the next operation (e.g., fty- ve seconds later), requiring more xes and work.
When conditions trigger an Andon cord pull, we swarm to solve the problem and prevent the introduction of new work until the issue has been resolved. This provides fast feedback for everyone in the value stream (especially the person who caused the system to fail), enables us to quickly isolate and diagnose the problem, and prevents further complicating factors that can obscure cause and eﬀect.
This case of the “almost dones” was happening too frequently. The team decided this was an area they wanted to improve. They noticed that teammates were only bringing up issues at speciﬁc times, like during standups. They wanted the team to shift their practice to collaborating as soon as the issue was identiﬁed, not waiting until the next day’s standup or meeting.
Instead, we need everyone in our value stream to find and fix problems in their area of control as part of their daily work. By doing this, we push quality and safety responsibilities and decision-making to where the work is performed, instead of relying on approvals from distant executives.
According to Lean, our most important customer is our next step downstream. Optimizing our work for them requires that we have empathy for their problems in order to better identify the design problems that prevent fast and smooth flow.
Types of culture
- Pathological organizations are characterized by large amounts of fear and threat. People often hoard information, withhold it for political reasons, or distort it to make themselves look better. Failure is often hidden.
- Bureaucratic organizations are characterized by rules and processes, often to help individual departments maintain their “turf.” Failure is processed through a system of judgment, resulting in either punishment or justice and mercy.
- Generative organizations are characterized by actively seeking and sharing information to better enable the organization to achieve its mission. Responsibilities are shared throughout the value stream, and failure results in reflection and genuine inquiry.
For instance, we may conduct a blameless post-mortem (also known as a retrospective) after every incident to gain the best understanding of how the accident occurred and agree upon what the best countermeasures are to improve the system, ideally preventing the problem from occurring again and enabling faster detection and recovery.
By removing blame, you remove fear; by removing fear, you enable honesty; and honesty enables prevention.
We improve daily work by explicitly reserving time to pay down technical debt, x defects, and refactor and improve problematic areas of our code and environments. We do this by reserving cycles in each development interval, or by scheduling kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want.
Similarly, in the technology value stream, as we make our system of work safer, we find and fix problems from ever-weaker failure signals. For example, we may initially perform blameless post-mortems only for customer-impacting incidents. Over time, we may perform them for lesser team-impacting incidents and near misses as well.
When new learnings are discovered locally, there must also be some mechanism to enable the rest of the organization to use and bene t from that knowledge. In other words, when teams or individuals have experiences that create expertise, our goal is to convert that tacit knowledge (i.e., knowledge that is diﬃcult to transfer to another person by means of writing it down or verbalizing) into explicit, codified knowledge, which becomes someone else’s expertise through practice.
In the technology value stream, we can introduce the same type of tension into our systems by seeking to always reduce deployment lead times, increase test coverage, decrease test execution times, and even by rearchitecting if necessary to increase developer productivity or increase reliability.
We may also perform game day exercises, where we rehearse large scale failures, such as turning oﬀ entire data centers. Or we may inject ever-larger scale faults into the production environment (such as the famous Net ix Chaos Monkey, which randomly kills processes and compute servers in production) to ensure that we’re as resilient as we want to be.
However, there is significant evidence that shows greatness is not achieved by leaders making all the right decisions—instead, the leader’s role is to create the conditions so their team can discover greatness in their daily work. In other words, creating greatness requires both leaders and workers, each of whom are mutually dependent upon each other.
Green field development is when we build on undeveloped land.
Brown field development is when we build on land that was previously used for industrial purposes, potentially contaminated with hazardous waste or pollution.
In technology, a green field project is a new software project or initiative, likely in the early stages of planning or implementation, where we build our applications and infrastructure from scratch, with few constraints.
On the other end of the spectrum are brown eld DevOps projects. These are existing products or services that are already serving customers and have potentially been in operation for years or even decades. Brown field projects often come with significant amounts of technical debt, such as having no test automation or running on unsupported platforms.
When transforming brown field projects, we may face significant impediments and problems, especially when no automated testing exists or when there is a tightly coupled architecture that prevents small teams from developing, testing, and deploying code independently.
Within bimodal IT there are systems of record, the ERP-like systems that run our business (e.g., MRP, HR, financial reporting systems), where the correctness of the transactions and data are paramount; and systems of engagement, which are customer- facing or employee-facing systems, such as e-commerce systems and productivity applications.
Systems of record typically have a slower pace of change and often have regulatory and compliance requirements (e.g., SOX).
Systems of engagement typically have a much higher pace of change to support rapid feedback loops that enable them to experiment to discover how to best meet customer needs.
Find innovators and early adopters: In the beginning, we focus our eﬀorts on teams who actually want to help—these are our kindred spirits and fellow travelers who are the first to volunteer to start the DevOps journey. In the ideal, these are also people who are respected and have a high degree of influence over the rest of the organization, giving our initiative more credibility.
Build critical mass and silent majority: In the next phase, we seek to expand DevOps practices to more teams and value streams with the goal of creating a stable base of support. By working with teams who are receptive to our ideas, even if they are not the most visible or influential groups, we expand our coalition who are generating more successes, creating a “bandwagon eﬀect” that further increases our influence. We specifically bypass dangerous political battles that could jeopardize our initiative.
Identify the holdouts: e “holdouts” are the high-profile, influential detractors who are most likely to resist (and maybe even sabotage) our eﬀorts. In general, we tackle this group only after we have achieved a silent majority and have established enough successes to successfully protect our initiative.
To scale the culture, they focused on three key attributes:
- Passion: teams focused on delighting customers, being the best at getting better, embrace failure and get stronger because of it.
- Selﬂessness: collaborate and share knowledge and code across organization, innersourcing, making space for others’ voices and helping others win.
- Accountability: own the outcomes even when they’re hard; how you do something is as important as what you do.
American Airlines focused on the following values:
- action and doing over analysis
- collaboration over silos
- clarity of mission over trying to do everything
- empowerment over personal stamp on every eﬀort (set the goals and empower teams to get there)
- getting something out (MVP) versus getting something perfect
- “We can do this” versus hierarchy (collaboration across org boundaries)
- finishing versus starting (limiting WIP and foccusing on top priorities)
With over two thousand people making changes, potentially several times a day, things could get very messy. To avoid this, MDTP uses the concept of an opinionated platform (also known as a paved road platform or guardrails). For instance, if you build a microservice, it must be written in Scala and use the Play framework. If your service needs persistence, it must use Mongo. If a user needs to perform a common action like uploading a file, then the team must use a common platform service to enable that when there is one. Essentially, a little bit of governance is baked into the platform itself.
As a result, after we select a candidate application or service for our DevOps initiative, we must identify all the members of the value stream who are responsible for working together to create value for the customers being served.
- Product owner, development, QA, IT operations/SRE, Infosec, Release managers, Technology executive or value stream manager
After we identify our value stream members, our next step is to gain a concrete understanding of how work is performed, documented in the form of a value stream map.
Our goal is not to document every step and associated minutiae, but to suﬃciently understand the areas in our value stream that are jeopardizing our goals of fast ow, short lead times, and reliable customer outcomes.
Each process block should include the lead time and process time for a work item to be completed, as well as the %C/A (percent complete and accurate) as measured by the downstream consumers of the output.
Create a dedicated transformation team
- Assign members of the dedicated team to be solely allocated to the DevOps transformation eﬀorts (as opposed to “maintain all your current responsibilities but spend 20% of your time on this new DevOps thing”).
- Select team members who are generalists, who have skills across a wide variety of domains.
- Select team members who have longstanding and mutually respectful relationships with key areas of the organization.
- Create a separate physical space or virtual space (such as dedicated chat channels) for the dedicated team, if possible, to maximize communication ow within the team and to create some isolation from the rest of the organization.
If possible, we will free the transformation team from many of the rules and policies that restrict the rest of the organization, as National Instruments did (described in the previous chapter).
One of the most important parts of any improvement initiative is to de ne a measurable goal with a clearly defined deadline, between six months and two years in the future. It should require considerable eﬀort but still be achievable.
Examples of improvement goals might include:
- Reduce the percentage of the budget spent on product support and unplanned work by 50%.
- Ensure lead time from code check-in to production release is one week or less for 95% of changes.
- Ensure releases can always be performed during normal business hours with zero downtime.
- Integrate all the required information security controls into the deployment pipeline to pass all required compliance requirements.
A typical iteration will be in the range of two to four weeks. For each iteration, the teams should agree on a small set of goals that generates value and makes some progress toward the long-term goal. At the end of each iteration, teams should review their progress and set new goals for the next iteration.
We will actively manage this technical debt by ensuring that we invest at least 20% of all Development and Operations capacity on refactoring and investing in automation work and architecture and non-functional requirements (NFRs).
If you’re in really bad shape today, you might need to make this 30% or even more of the resources. However, I get nervous when I ﬁnd teams that think they can get away with much less than 20%.
In traditional IT Operations organizations, we often use functional orientation to organize our teams by their specialties. We put the database administrators in one group, the network administrators in another, the server administrators in a third, and so forth. One of the most visible consequences of this is long lead times, especially for complex activities like large deployments where we must open up tickets with multiple groups and coordinate work handoﬀs, resulting in our work waiting in long queues at every step.
The problem is exacerbated when each Operations functional area has to serve multiple value streams (i.e., multiple Development teams) who all compete for their scarce cycles. In order for Development teams to get their work done in a timely manner, we often have to escalate issues to a manager or director, and eventually to someone (usually an executive) who can finally prioritize the work against the global organizational goals instead of the functional silo goals.
In addition to long queues and long lead times, this situation results in poor handoﬀs, large amounts of rework, quality issues, bottlenecks, and delays. This gridlock impedes the achievement of important organizational goals, which often far outweigh the desire to reduce costs.
Broadly speaking, to achieve DevOps outcomes, we need to reduce the eﬀects of functional orientation (“optimizing for cost”) and enable market orientation (“optimizing for speed”) so we can have many small teams working safely and independently, quickly delivering value to the customer.
These teams are designed to be cross-functional and independent—able to design and run user experiments, build and deliver new features, deploy and run their service in production, and fix any defects without manual dependencies on other teams, thus enabling them to move faster.
Instead, we will embed the functional engineers and skills (e.g., Ops, QA, Infosec) into each service team, or create a platform organization that provides an automated technology platform for service teams to self-serve everything they need to test, deploy, monitor, and manage their services in testing and production environments.
We can also achieve our desired DevOps outcomes through functional orientation, as long as everyone in the value stream views customer and organizational outcomes as a shared goal, regardless of where they reside in the organization.
What these organizations have in common is a high-trust culture that enables all departments to work together eﬀectively, where all work is transparently prioritized and there is suﬃcient slack in the system to allow high-priority work to be completed quickly. This is, in part, enabled by automated self-service platforms that build quality into the products everyone is building.
One countermeasure is to enable and encourage every team member to be a generalist. We do this by providing opportunities for engineers to learn all the skills necessary to build and run the systems they are responsible for, and regularly rotating people through diﬀerent roles. The term full-stack engineer is now commonly used (sometimes as a rich source of parody) to describe generalists who are familiar—at least have a general level of understanding—with the entire application stack (e.g., application code, databases, operating systems, networking, cloud).
By cross-training and growing engineering skills, generalists can do orders of magnitude more work than their specialist counterparts, and it also improves our overall ow of work by removing queues and wait time.
Another way to enable high-performing outcomes is to create stable service teams with ongoing funding to execute their own strategy and road map of initiatives.
Contrast this to the more traditional model where Development and Test teams are assigned to a “project” and then reassigned to another project as soon as the project is completed and funding runs out. This leads to all sorts of undesired outcomes, including developers being unable to see the long-term consequences of decisions they make (a form of feedback) and a funding model that only values and pays for the earliest stages of the software life cycle—which, tragically, is also the least expensive part for successful products or services.
Having architecture that is loosely coupled means that services can update in production independently, without having to update other services. Services must be decoupled from other services and, just as important, from shared databases (although they can share a database service, provided they don’t have any common schemas).
Bounded contexts ensure that services are compartmentalized and have well-de ned interfaces, which also enable easier testing.
Best practices for teams
- Trust and communication take time. They suggest it takes at least three months for team members to reach high performance, and suggest keeping teams together at least a year to benefit from their work together.
- Just-right sizing. They suggest eight is an ideal number, which is similar to the two-pizza team used by Amazon, and note that 150 is an upper limit (citing Dunbar’s number).
- Communication (can be) expensive. Skelton and Pais wisely point out that while within-team communication is good, any time teams have demands or constraints on other teams, it leads to queues, context switching, and overhead.
Types of teams
- Stream-aligned teams: An end-to-end team that owns the full value stream. This is similar to the market orientation described here.
- Platform teams: Platform teams create and support reusable technology often used by stream-aligned teams, such as infrastructure or content management. This team may often be a third party.
- Enabling teams: This team contains experts who help other teams improve, such as a Center of Excellence.
- Complicated-subsystem teams: Teams that own development and maintenance of a subcomponent of the system that is so complicated it requires specialist knowledge.
- Other: The authors also touch on other team types, such as SRE (site reliability engineer) and service experience.
Farrall defined two types of Ops liaisons: the business relationship manager and the dedicated release engineer. The business relationship managers worked with product management, line-of-business owners, project management, Dev management, and developers. They became intimately familiar with product group business drivers and product road maps, acted as advocates for product owners inside of Operations, and helped their product teams navigate the Operations landscape to prioritize and streamline work requests.
Similarly, the dedicated release engineer became intimately familiar with the product’s Development and QA issues and helped product management get what they needed from the Ops organization to achieve their goals.
DevOPs transformation strategies
- Create self-service capabilities to enable developers in the service teams to be productive.
- Embed Ops engineers into the service teams.
- Assign Ops liaisons to the service teams when embedding Ops is not possible.
In almost all cases, we will not mandate that internal teams use these platforms and services—these platform teams will have to win over and satisfy their internal customers, sometimes even competing with external vendors. By creating this eﬀective internal marketplace of capabilities, we help ensure that the platforms and services we create are the easiest and most appealing choices available (the path of least resistance).
Dianne Marsh, Director of Engineering Tools at Netflix, states that her team’s charter is to “support our engineering teams’ innovation and velocity. We don’t build, bake, or deploy anything for these teams, nor do we manage their configurations. Instead, we build tools to enable self-service. It’s okay for people to be dependent on our tools, but it’s important that they don’t become dependent on us.”
Because Operations is part of the product value stream, we should put the Operations work that is relevant to product delivery on the shared kanban board. This enables us to more clearly see all the work required to move our code into production, as well as keep track of all Operations work required to support the product. Furthermore, it enables us to see where Ops work is blocked and where work needs escalation, highlighting areas where we may need improvement.
In this step, we want developers to run production-like environments on their own workstations, created on demand and self-serviced. By doing this, developers can run and test their code in production-like environments as part of their daily work, providing early and constant feedback on the quality of their work.
Because we’ve carefully defined all aspects of the environment ahead of time, we are not only able to create new environments quickly but also ensure that these environments will be stable, reliable, consistent, and secure.
However, because delivering value to the customer requires both our code and the environments they run in, we need our environments in version control as well.
By putting all production artifacts into version control, our version control repository enables us to repeatedly and reliably reproduce all components of our working software system—this includes our applications and production environment, as well as all of our pre-production environments.
we must check in the following assets to our shared version control repository:
- all application code and dependencies (e.g., libraries, static content, etc.)
- any script used to create database schemas, application reference data, etc.
- all the environment creation tools and artifacts described in the previous step (e.g., VMware or AMI images, Puppet, Chef, or Ansible scripts.)
- any file used to create containers (e.g., Docker, Rocket, or Kubernetes de nitions or composition files)
- all supporting automated tests and any manual test scripts
- any script that supports code packaging, deployment, database migration, and environment provisioning
- all project documentation, notes, etc.)
- all cloud con guration les (e.g., AWS Cloudformation templates, Microsoft Azure Stack DSC files, OpenStack HEAT)
- any other script or configuration information required to create infrastructure that supports multiple services (e.g., enterprise service buses, database management systems, DNS zone files, configuration rules for firewalls, and other networking devices)
For instance, we may store large virtual machine images, ISO files, compiled binaries, and so forth in artifact repositories (e.g., Nexus, Artifactory). Alternatively, we may put them in blob stores (e.g., Amazon S3 buckets) or put Docker images into Docker registries, and so forth.
It is not suﬃcient to merely be able to re-create any previous state of the production environment; we must also be able to re-create the entire pre-production and build processes as well. Consequently, we need to put into version control everything relied upon by our build processes, including our tools (e.g., compilers, testing tools) and the environments they depend upon.
The latter pattern is what has become known as immutable infrastructure, where manual changes to the production environment are no longer allowed—the only way production changes can be made is to put the changes into version control and re-create the code and environments from scratch.
Containers satisfy three key things: they abstract infrastructure (the dial-tone principal–you pick up the phone and it works without needing to know how it works), specialization (Operations could create containers that developers could use over and over and over again), and automation (containers can be built over and over again and everything will just work).
In other words, we will only accept development work as done when it can be successfully built, deployed, and confirmed that it runs as expected in a production-like environment, instead of merely when a developer believes it to be done. Ideally, it runs under a production-like load with a production-like dataset, long before the end of a sprint.
To prevent this scenario, we need fast automated tests that run within our build and test environments whenever a new change is introduced into version control.
Test types
- Unit tests: These typically test a single method, class, or function in isolation, providing assurance to the developer that their code operates as designed.
- Acceptance tests: These typically test the application as a whole to provide assurance that a higher level of functionality operates as designed (e.g., the business acceptance criteria for a user story, the correctness of an API), and that regression errors have not been introduced (i.e., we broke functionality that was previously operating correctly).
- Integration tests: Integration tests are where we ensure that our application correctly interacts with other production applications and services, as opposed to calling stubbed out interfaces.
Integration tests are performed on builds that have passed our unit and acceptance tests. Because integration tests are often brittle, we want to minimize the number of integration tests and find as many of our defects as possible during unit and acceptance testing.
To detect this, we may choose to measure and make visible our test coverage (as a function of number of classes, lines of code, permutations, etc.), maybe even failing our validation test suite when it drops below a certain level (e.g., when less than 80% of our classes have unit tests).
Therefore, whenever we find an error with an acceptance or integration test, we should create a unit test that could find the error faster, earlier, and cheaper.
If we find that unit or acceptance tests are too diﬃcult and expensive to write and maintain, it’s likely that we have an architecture that is too tightly coupled, where strong separation between our module boundaries no longer exists (or maybe never existed). In this case, we will need to create a more loosely coupled system so modules can be independently tested without integration environments.
Because we want our tests to run quickly, we need to design our tests to run in parallel, potentially across many diﬀerent servers. We may also want to run diﬀerent categories of tests in parallel. For example, when a build passes our acceptance tests, we may run our performance testing in parallel with our security testing, as shown in Figure 10.3.
To mitigate this, a small number of reliable, automated tests are almost always preferable over a large number of manual or unreliable automated tests. Therefore, we focus on automating only the tests that genuinely validate the business goals we are trying to achieve.
However, trunk-based development required them to build more eﬀective automated testing. Gruver observed, “Without automated testing, continuous integration is the fastest way to get a big pile of junk that never compiles or runs correctly.”
Furthermore, they created a culture that halted all work anytime a developer broke the deployment pipeline, ensuring that developers quickly brought the system back into a green state.
This is how Ward Cunningham, developer of the first wiki, originally described technical debt: “when we do not aggressively refactor our codebase, it becomes more diﬃcult to make changes and to maintain over time, slowing down the rate at which we can add new features.”
Solving this problem was one of the primary reasons behind the creation of continuous integration and trunk- based development practices, to optimize for team productivity over individual productivity.
Our countermeasure to large batch size merges is to institute continuous integration and trunk-based development practices, where all developers check their code into trunk at least once per day. Checking in code this frequently reduces our batch size to the work performed by our entire developer team in a single day.
We may even configure our deployment pipeline to reject any commits (e.g., code or environment changes) that take us out of a deployable state. This method is called gated commits, where the deployment pipeline first confirms that the submitted change will successfully merge, build as expected, and pass all the automated tests before actually being merged into trunk.
Requirements for deployment pipeline:
- By using the same deployment mechanism for every environment (e.g., development, test, and production), our production deployments are likely to be far more successful, since we know that they has been successfully performed many times already earlier in the pipeline.
- During the deployment process, we should test that we can connect to any supporting systems (e.g., databases, message buses, external services) and run a single test transaction through the system to ensure that our system is performing as designed. If any of these tests fail, we should fail the deployment.
- We must continually ensure that these environments remain synchronized.
We created a development and deployment process that removed the need for handoﬀs to DBAs by cross-training developers, automating schema changes, and executing them daily. We created realistic load testing against sanitized customer data, ideally running migrations every day. By doing this, we run our service hundreds of times with realistic scenarios before seeing actual production traffic.
The goal at Etsy has been to make it easy and safe to deploy into production with the fewest number of steps and the least amount of ceremony.
If all the tests were run sequentially, Sussman states that “the 7,000 trunk tests would take about half an hour to execute. So we split these tests up into subsets, and distribute those onto the 10 machines in our Jenkins [CI] cluster… Splitting up our test suite and running many tests in parallel, gives us the desired 11 minute runtime."
Deployment is the installation of a specified version of software to a given environment (e.g., deploying code into an integration test environment or deploying code into production). Specifically, a deployment may or may not be associated with a release of a feature to customers.
Release is when we make a feature (or set of features) available to all our customers or a segment of customers (e.g., we enable the feature to be used by 5% of our customer base). Our code and environments should be architected in such a way that the release of functionality does not require changing our application code.
Environment-based release patterns: This is where we have two or more environments that we deploy into, but only one environment is receiving live customer traﬃc (e.g., by configuring our load balancers). New code is deployed into a non-live environment, and the release is performed moving traﬃc to this environment. These are extremely powerful patterns because they typically require little or no change to our applications. These patterns include blue-green deployments, canary releases, and cluster immune systems, all of which will be discussed shortly.
Application-based release patterns: This is where we modify our application so that we can selectively release and expose specific application functionality by small configuration changes. For instance, we can implement feature that progressively expose new functionality in production to the development team, all internal employees, 1% of our customers, or, when we are con dent that the release will operate as designed, our entire customer base. As discussed earlier, this enables a technique called dark launching, where we stage all the functionality to be launched in production and test it with production traﬃc before our release.
Dealing with database changes for blue-green deployment
- Create two databases: During the release, we put the blue database into read-only mode, perform a backup of it, restore onto the green database, and finally switch traﬃc to the green environment. The problem with this pattern is that if we need to roll back to the blue version, we can potentially lose transactions if we don’t manually migrate them from the green version first.
- Instead of supporting two databases, we decouple the release of database changes from the release of application changes by doing two things: First, we make only additive changes to our database; we never mutate existing database objects. Second, we make no assumptions in our application about which database version will be in production.
The Canary release pattern
- Facebook created groups of environments
  - A1 group: production servers that only serve internal employees.
  - A2 group: production servers that only serve a small percentage of customers and are deployed when certain acceptance criteria have been met (either automated or manual).
  - A3 group: the rest of the production servers, which are deployed after the software running in the A2 cluster meets certain acceptance criteria.
The cluster immune system expands upon the canary release pattern by linking our production monitoring system with our release process and by automating the rollback of code when the user-facing performance of the production system deviates outside of a predefined expected range, such as when the conversion rates for new users drops below our historical norms of 15%–20%.
Feature flag enables the following:
- Roll back easily
- Gracefully degrade performance: When our service experiences extremely high loads that would normally require us to increase capacity or, worse, risk having our service fail in production, we can use feature toggles to reduce the quality of service. In other words, we can increase the number of users we serve by reducing the level of functionality delivered (e.g., reduce the number of customers who can access a certain feature, disable CPU-intensive features such as recommendations, etc.).
- Increase our resilience through a service-oriented architecture: If we have a feature that relies on another service that isn’t complete yet, we can still deploy our feature into production but hide it behind a feature toggle. When that service finally becomes available, we can toggle the feature on. Similarly,when a service we rely upon fails, we can turn oﬀ the feature to prevent calls to the downstream service while keeping the rest of the application running.
- Perform A/B testing: Modern feature toggle frameworks, such as LaunchDarkly, Split, and Optimizely, also enable product teams to run experiments to test new features and see their impact on business metrics. In this way, we can demonstrate a causal relationship between new features and the outcomes we care about. This is an incredibly powerful tool that enables a scientific, hypothesis-driven approach to product development
To address these issues, the CSG team took a multi-pronged approach. First, they created a bias for action and culture change by applying the learning from John Shook’s Model of Change: “Change Behavior to Change Culture.”
In the case of eBay, when they needed to re-architect, they would first do a small pilot project to prove to themselves that they understood the problem well enough to even undertake the eﬀort.
What Shoup’s team did at eBay is a textbook example of evolutionary design, using a technique called the strangler fig application pattern—instead of “ripping out and replacing” old services with architectures that no longer support our organizational goals, we put the existing functionality behind an API and avoid making further changes to it. All new functionality is then implemented in the new services that use the new desired architecture, making calls to the old system when necessary.
The consequences of overly tight architectures are easy to spot: every time we attempt to commit code into trunk or release code into production, we risk creating global failures (e.g., we break everyone else’s tests and functionality or the entire site goes down). To avoid this, every small change requires enormous amounts of communication and coordination over days or weeks, as well as approvals from any group that could potentially be aﬀected.
a loosely coupled architecture with well-defined interfaces that enforce how modules connect with each other promotes productivity and safety. It enables small, productive teams that are able to make small changes that can be safely and independently deployed. And because each service also has a well-defined API, it enables easier testing of services and the creation of contracts and SLAs between teams.
Monitoring
- Data collection at the business logic, application, and environments layer: In each of these layers, we are creating telemetry in the form of events, logs, and metrics. Logs may be stored in application-specific files on each server (e.g., /var/log/httpd-error.log), but preferably we want all our logs sent to a common service that enables easy centralization, rotation, and deletion.
- An event router responsible for storing our events and metrics: This capability potentially enables visualization, trending, alerting, anomaly detection, and so forth. By collecting, storing, and aggregating all our telemetry, we better enable further analysis and health checks. is is also where we store configurations related to our services (and their supporting applications and environments) and is likely where we do threshold-based alerting and health checks. Examples of tools in this space include Prometheus, Honeycomb, DataDog, and Sensu.
This is often referred to as an information radiator, defined by the Agile Alliance as,the generic term for any of a number of handwritten, drawn, printed, or electronic displays which a team places in a highly visible location, so that all team members as well as passers-by can see the latest information at a glance: count of automated tests, velocity, incident reports, continuous integration status, and so on. This idea originated as part of the Toyota Production System.
Example telemetry:
- Business level: Examples include the number of sales transactions, revenue of sales transactions, user sign-ups, churn rate, A/B testing results, etc.
- Application level: Examples include transaction times, user response times, application faults, etc.
- Infrastructure level (e.g., database, operating system, networking, storage): Examples include web server traﬃc, CPU load, disk usage, etc.
- Client software level (e.g., JavaScript on the client browser, mobile application): Examples include application errors and crashes, user-measured transaction times, etc.
- Deployment pipeline level: Examples include build pipeline status (e.g., red or green for our various automated test suites), change deployment lead times, deployment frequencies, test environment promotions, and environment status.
Ideally, anyone viewing our information radiators will be able to make sense of the information we are showing in the context of desired organizational outcomes, such as goals around revenue, user attainment, conversion rates, etc. We should define and link each metric to a business outcome metric at the earliest stages of feature definition and development, and measure the outcomes after we deploy them in production.
As Jody Mulkey, CTO of Ticketmaster/LiveNation, observes, “Instead of measuring Operations against the amount of downtime, I find it’s much better to measure both Dev and Ops against the real business consequences of downtime: how much revenue should we have attained, but didn’t.”
outlier detection
- Hodge and Jim Austin of the University of York as detecting “abnormal running conditions from which significant performance degradation may well result, such as an aircraft engine rotation defect or a ow problem in a pipeline."
- We can automatically ﬂag misbehaving nodes without having to actually deﬁne what the ‘proper’ behavior is in any way. And since we’re engineered to run resiliently in the cloud, we don’t tell anyone in Operations to do something—instead, we just kill the sick or misbehaving compute node, and then log it or notify the engineers in whatever form they want.
A common use of standard deviations is to periodically inspect the data set for a metric and alert if it has significantly varied from the mean. For instance, we may set an alert for when the number of unauthorized login attempts per day is three standard deviations greater than the mean. Provided that this data set has Gaussian distribution, we would expect that only 0.3% of the data points would trigger the alert.
Scryer uses a combination of outlier detection to throw out spurious data points and then use techniques such as fast fourier transform (FFT) and linear regression to smooth the data while preserving legitimate traffic spikes that recurring in their data. The result is that Netflix can forecast traffic demand with surprising accuracy.
Smoothing often involves using moving averages (or rolling averages), which transform our data by averaging each point with all the other data within our sliding window. This has the eﬀect of smoothing out short-term fluctuations and highlighting longer-term trends or cycles.
Although xing forward can often be dangerous, it can be extremely safe when we have automated testing, fast deployment processes, and suﬃcient telemetry that allows us to quickly con rm whether everything is functioning correctly in production.
This is an example of how upstream work centers can locally optimize for themselves but actually degrade performance for the entire value stream. To prevent this from happening, we will have everyone in the value stream share the downstream responsibilities of handling operational incidents. We can do this by putting developers, development managers, and architects on pager rotation, just as Pedro Canahuati, Facebook Director of Production Engineering, did in 2009.6 This ensures everyone in the value stream gets visceral feedback on any upstream architectural and coding decisions they make.
when developers get feedback on how their applications perform in production, which includes fixing it when it breaks, they become closer to the customer. This creates a buy-in that everyone in the value stream benefits from.
Launch readiness review / handoff readiness review: The LRR must be performed and approved before any new Google service is made publicly available to customers and receives live production traffic, while the HRR is performed when the service is transitioned to an Ops-managed state, usually months after the LRR. The LRR and HRR checklists are similar, but the HRR is far more stringent and has higher ac ceptance standards, while the LRR is self-reported by the product team.
Any product team going through an LRR or HRR has an SRE assigned to them to help them understand the requirements and to help them achieve those requirements.
Scott Cook, the founder of Intuit, has long advocated building a culture of innovation, encouraging teams to take an experimental approach to product development, and exhorting leadership to support them. As he said, “Instead of focusing on the boss’s vote … the emphasis is on getting real people to really behave in real experiments and basing your decisions on that.”
One of the core beliefs in the Toyota Production System is that “people closest to a problem typically know the most about it.” This becomes more pronounced as the work being performed and the system the work occurs in become more complex and dynamic, as is typical in DevOps value streams. In these cases, creating approval steps from people who are located further and further away from the work may actually reduce the likelihood of success. As has been proven time and again, the further the distance between the person doing the work (i.e., the change implementer) and the person deciding to do the work (i.e., the change authorizer), the worse the outcome.
They developed a dashboard that looked at every release from three different angles: system level (how my product is), value stream (upstream/downstream dependencies, etc.), and environment (platform, events, etc.) Checking across these three angles, the dashboard gives a clear go/no-go for release.
Guidelines for code reviews include:
- Everyone must have someone to review their changes (e.g., to the code, environment, etc.) before committing to trunk.
- Everyone should monitor the commit stream of their fellow team members so that potential conflicts can be identified and reviewed.
- Define which changes qualify as high risk and may require review from a designated subject matterexpert (e.g., database changes, security-sensitive modules such as authentication, etc.).‡
- If someone submits a change that is too large to reason about easily—in other words, you can’t understand its impact after reading through it a couple of times, or you need to ask the submitter for clarification—it should be split up into multiple, smaller changes that can be understood at a glance.
For an organization where developers are not getting the needed attention for code reviews: To fix the problem and eliminate all of these delays, they ended up dismantling the entire Gerrit code review process, instead requiring pair programming to implement code changes into the system. By doing this, they reduced the amount of time required to get code review from weeks to hours. Hendrickson is quick to note that a code review work fine in many organizations, but it requires a culture that values reviewing code as highly as it values writing the code in the first place.
On the other hand, when asked to describe a great pull request that indicates an eﬀective review process, Tomayko quickly listed oﬀ the essential elements: there must be suﬃcient detail on why the change is being made, how the change was made, as well as any identified risks and resulting countermeasures.
As Adrian Cockcroft observed, “A great metric to publish widely is how many meetings and work tickets are mandatoryto perform a release—the goal is to relentlessly reduce the eﬀort required for engineers to perform work and deliver it to the customer.”
Consider a story that John Allspaw told about a newly hired junior engineer: The engineer asked if it was okay to deploy a small HTML change, and Allspaw responded, “I don’t know, is it?” He then asked “Did you have someone review your change? Do you know who the best person to ask is for changes of this type? Did you do everything you absolutely could to assure yourself that this change operates in production as designed? If you did, then don’t ask me—just make the change!”
By responding this way, Allspaw reminded the engineer that she was solely responsible for the quality of her change—if she did everything she felt she could to give herself confidence that the change would work, then she didn’t need to ask anyone for approval, she should make the change.
If accidents are not caused by “bad apples” but rather are due to inevitable design problems in the complex system that we created, then instead of “naming, blaming, and shaming” the person who caused the failure, our goal should always be to maximize opportunities for organizational learning, continually reinforcing that we value actions that expose and share more widely the problems in our daily work.
As John Allspaw, CTO of Etsy, states, “Our goal at Etsy is to view mistakes, errors, slips, lapses, and so forth with a perspective of learning.”
Types of people that need to be at a retrospective meeting:
- the people involved in decisions that may have contributed to the problem
- the people who identified the problem
- the people who responded to the problem
- the people who diagnosed the problem
- the people who were aﬀected by the problem
- anyone else who is interested in attending the meeting
They described how organizations are typically structured in one of two models: a standardized model, where routine and systems govern everything, including strict compliance with timelines and budgets, or an experimental model, where every day, every exercise and every piece of new information is evaluated and debated in a culture that resembles a research and design (R&D) laboratory.
To reinforce our culture of learning and calculated risk-taking, we need leaders to continually reinforce that everyone should feel both comfortable with and responsible for surfacing and learning from failures.
Robbins defines resilience engineering as “an exercise designed to increase resilience through large-scale fault injection across critical systems.”
Their standard incident analysis was a structured process to help them understand what happened and identify opportunities for improvement. They did this by understanding the timeline of the incident; asking, “What happened? How can we detect it sooner? How can we recover sooner? What went well? Understanding system behavior. And maintaining a blameless culture and avoiding finger pointing.
If we are not able to build everything oﬀ a single source tree, then we must nd another means to maintain known good versions of the libraries and their dependencies. For instance, we may have an organization-wide repository such as Nexus, Artifactory, or a Debian or RPM repository, which we must then update when there are known vulnerabilities, both in these repositories and in production systems.
Ideally, each library will have a single owner or a single team supporting it, representing where knowledge and expertise for the library resides. Furthermore, we should (ideally) only allow one version to be used in production, ensuring that whatever is in production leverages the best collective knowledge of the organization.
Our goal is to identify the technologies that:
- impede or slow down the ow of work
- disproportionately create high levels of unplanned work
- disproportionately create large numbers of support requests
- are most inconsistent with our desired architectural outcomes (e.g., throughput, stability, security, reliability, business continuity)
By removing these problematic infrastructures and platforms from the technologies supported by Ops, we enable everyone to focus on infrastructure that best helps achieve the global goals of the organization.
They found that the key to providing guidance versus governance was that it needed to be accessible (everyone can contribute), transparent (everyone should be able to see), ﬂexible (easy to change), and cultural (community-driven) in the simplest way possible. Ultimately, guidance should be there to empower engineers, not constrain them.
As changes are localized and easy for teams to adjust, the reversibility is easy and flexibility is high. For example, switching from Python to Golan for a given product’s API is considered highly flexible and easy to reverse. Changing a cloud provider or decommissioning a data center, on the other hand, is rigid and has an extremely large blast radius.
For changes with a “steep cost,” the CIO is brought into the process. Any engineer can then pitch their idea directly to the CIO and a group of senior leaders.
Tools such as Gauntlt have been designed to integrate into the deployment pipelines, which run automated security tests on our applications, our application dependencies, our environment, etc.
All developers should have their own PGP key, perhaps created and managed in a system such as keybase.io. All commits to version control should be signed—that is straightforward to configure using the open-source tools gpg and git. Furthermore, all packages created by the CI process should be signed, and their hash recorded in the centralized logging service for audit purposes.
Security related telemetry:
- Abnormal production program terminations (e.g., segmentation faults, core dumps, etc.): “Of particular concern was why certain processes kept dumping core across our entire production environment, triggered from traﬃc coming from the one IP address, over and over again. Of equal concern were those HTTP ‘500 Internal Server Errors.’ These are indicators that a vulnerability was being exploited to gain unauthorized access to our systems, and that a patch needs to be urgently applied."
Database syntax error: “We were always looking for database syntax errors in our code–these either enabled SQL injection attacks or were actual attacks in progress.”
protect continuous build, integration, or deployment pipelines:
- hardening continuous build and integration servers and ensuring we can reproduce them in an automated manner, just as we would for infrastructure that supports customer-facing production services, to prevent our continuous build and integration servers from being compromised
- reviewing all changes introduced into version control, either through pair programming at commit time or by a code review process between commit and merge into trunk, to prevent continuous integration servers from running uncontrolled code
- instrumenting our repository to detect when test code contains suspicious API calls (e.g., unit tests accessing the file system or network) is checked into the repository, perhaps quarantining it and triggering an immediate code review
- ensuring every CI process runs on its own isolated container or VM, and ensuring this is recreated from a known, good, verified base image at the start of every build
- ensuring the version control credentials used by the CI system are read-only
In the old way, with Dev handing off production-ready code to Security for testing, we had a major bottleneck in the throughput of the Security team. For large organizations that operate at scale, it can be really hard to find enough Security talent to continually test everything that is developed. Building the security tests into the development pipeline unlocked a lot more productivity for us and reduced our dependence on Security personnel for standard testing and routine deployments.
One way to support an assertion that our changes are low risk is to show a history of changes over a significant time period (e.g., months or quarters) and provide a complete list of production issues during that same period. If we can show high change success rates and low MTTR, we can assert that we have a control environment that is eﬀectively preventing deployment errors, as well as prove that we can eﬀectively and quickly detect and correct any resulting problems.
We can almost certainly automate the creation of complete and accurate RFCs, populating the ticket with details of exactly what is to be changed. For instance, we could automatically create a ServiceNow change ticket with a link to the JIRA user story, along with the build manifests and test output from our deployment pipeline tool and links to the scripts that will be run and the dry run output of these commands.
One of the main themes of the Salesforce transformation was to make quality engineering everyone’s job, regardless of whether they were part of Development, Operations, or Infosec. To do this, they integrated automated testing into all stages of the application and environment creation, as well as into the continuous integration and deployment process, and created the open-source tool Rouster to conduct functional testing of their Puppet modules.
Consequently, wherever possible, we should implement separation of duties as a control. Instead, we should choose controls such as pair programming, continuous inspection of code check-ins, and code review. These controls can give us the necessary reassurance about the quality of our work. Furthermore, by putting these controls in place, if separation of duties is required, we can show that we achieve equivalent outcomes with the controls we have created.
First, they worked backwards from the customer’s needs. Second, they were determined to deliver value iteratively to maximize learning and minimize risk. And third, they wanted to avoid anchoring bias.
productivity measurements
- It includes five dimensions: satisfaction and well-being, performance, activity, communication and collaboration, and eﬃciency and ow. By including measures across at least three dimensions in the framework, teams and organizations can measure developer productivity in ways that more accurately capture productivity, will have a better understanding of how individuals and teams work, and will have superior information to enable better decisions.
we may choose to measure our incidents by the following metrics:
- Event severity: How severe was this issue? This directly relates to the impact on the service and our customers.
- Total downtime: How long were customers unable to use the service to any degree?
- Time to detect: How long did it take for us or our systems to know there was a problem?
- Time to resolve: How long after we knew there was a problem did it take for us to restore service?

Dave's Blogs

bits and pieces of information from Dave's daily life

Book: The DevOps Handbook

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply