I’ve now completed the full draft of the book. We’re getting technical reviews of the book, and have started getting the diagrams professionally designed. It’s awesome to see my crude attempts at diagrams turned into clean, slick images!
]]>We’ve published an update with several new chapters, and revisions to previous chapters based on feedback. I posted some details on the update, and how I’m finding the process so far, on the Infrastructure as Code book blog.
]]>I’ve worked with many infrastructure teams who’ve adopted virtualization, cloud, and automated configuration tools, but struggle to keep everything patched, consistent, and generally as well managed as they’d like. I’ve also worked with many teams who’ve found very effective ways to work with these technologies, and met and spoke with many others besides. I’ve decided to share what I’ve learned from these teams in book form, and O’Reilly publishing have agreed to publish it!
We should have an early release available in the next few weeks, with a draft of the first three chapters of the book. We’ll then put out updates and new chapters as I go. Hopefully I’ll get good feedback so the end product will be a more useful resource than it would be if I banged the whole thing out in a dark corner.
Here are a few links:
Please sign up for the mailing list, and/or subscribe to the feed on the book website, if you’re interested in finding out more about the book.
]]>But large clients who are being eaten alive by newer companies like Amazon (especially Amazon) also raise the concern that they are too big, too important, and work with too much money and sensitive customer data to risk working like a wild new startup. There’s an implication that the new generation of companies built on the Internet, and those who have successfully shifted into the new way of working, don’t have the heavy responsibilities of established companies, and so aren’t as rigorous in how they operate.
But I’ve spent too much time inside those established companies to take this objection seriously. The way large companies, even global financial institutions, run their IT is sausage-factory stuff, wrapped up in complicated hierarchies, exhaustive process documentation, and expensive brand-name software.
This enterprise complexity destroys quality and effectivness, but is useful for ass-covering. It divides and obscures responsibility, and gives cover when things go wrong. “Sorry we shut down the bank for 3 days - we were following industry best practices, though! We’ll add more stages to the process to make sure it doesn’t happen again.”
Which brings us to the recently published book, Lean Enterprise: How High Performance Organizations Innovate at Scale. Written by three of my colleagues (one now former), it goes beyond agile software development to look at larger organizational challenges like risk management, marketing, culture, governance, and financial management. All the boring, serious stuff that are necessary to make a company work.
Agile software development is about stripping out the unnecessary, and doubling down on the things that really work, like testing. Historically, successful businesses have done this from top to bottom, end to end. All of the garbage that has come to be seen as the way serious businesses run are a sign of a stagnating organization that doesn’t understand how to perform well, so is settling for mediocrity. Joanne, Barry, and Jez have written a book that will hopefully inspire people to question business as usual, and think about business as more.
]]>Virtualization and cloud (IaaS, Infrastructure as a Service, in particular) have forced the need for automation of some kind. In the old days, the “Iron Age” of IT, infrastructure growth was limited by the hardware purchasing cycle. Since it would take weeks for a new server to arrive, there was little pressure to rapidly install and configure an operating system on it. We would slot in a CD and follow our checklist, and a few days later it would be ready.
But the ability to spin up new virtual machines in minutes required us to get a lot better at automating this process. Server image templates and cloning helped get us over the hump. But now we had a new problem. Because we could, assuming enough overall capacity, spin up new VMs at the drop of a hat, we found ourselves with an ever-growing portfolio of servers. The need to keep a constantly growing and changing number of servers up to date and avoiding Configuration Drift spawned new tools.
CFengine, Puppet, and Chef established a new category of infrastructure automation tool quickly adopted by the early adopters, those nimble organisations who were taking full advantage of IaaS cloud as it emerged. These organisations, whose IT was typically built around Agile and Lean mindsets, evolved “Infrastructure as Code” practices to managing their automated infrastructure.
The essence of Infrastructure as Code is to treat the configuration of systems the same way that software source code is treated. Source code management systems, Test Driven Development (TDD), Continuous Integration (CI), refactoring, and other XP practices are especially useful for making sure that changes to infrastructure are thoroughly tested, repeatable, and transparent.
As more traditional organisations have adopted virtualization - generally on in-house infrastructure rather than public clouds - they’ve felt the same need for automation to manage their systems. But although some have explored the toolsets used by the early adopters, many turn to traditional vendors of so-called enterprise management toolsets, who have moved to adapt and rebrand their software to catch the latest waves in the industry (“Now with DevOps!”)
The problem is that few of these toolsets are designed to support Infrastructure as Code. Yes, they do automate things. Once you point and click your way through their GUI to create a server template, you can create identical instances to your heart’s content. But when you go back and make tweaks to your template, you don’t have a traceable, easily understood record of the change. You can’t automatically trigger automatic testing of each change, using validation tools from multiple vendors, open source projects, and in-house groups.
In short, rather than using intensive, automatically enforced extreme change management you’re stuck with old-school, manual, “we’d do it more thoroughly if we had time” change management.
Infrastructure automation makes it possible to carry out actions repeatedly, across a large number of nodes. Infrastructure as code uses techniques, practices, and tools from software development to ensure those actions are thoroughly tested before being applied to business critical systems.
Here are some guidelines for choosing configuration management tools that support Infrastructure as Code:
Without the ability to ensure that every change is quickly and easily tested as a matter of course, we’re forced to rely on people to take the time to manually set up and run tests, even when they’re under pressure. Without visibility and openness of configuration changes, we end up locked into the limited toolset of a single vendor, and deprive ourselves of a huge ecosystem of tools for managing software changes.
]]>The defining characteristic of our move beyond the “Iron Age” into the “Cloud Era” is that infrastructure can now be treated like software. Ensuring we’re able to bring the most effective software development practices to bear is the key to getting the most value out of this shift.
The VMs were delivered six weeks later.
They were configured differently from one another.
One of them was broken - the application server failed to start due to permission errors.
Sad fact: this is the norm in our industry
Why did it take so long to deliver virtual machines, and why could they not deliver consistent, working machines? The IT department had the most expensive virtualisation platform money could buy, and used two different big-brand, top-dollar enterprise automation products for provisioning and configuring systems and software.
The organisation also had the most comprehensive, SOX certified, ITIL compliant processes you could have for change management and IT service delivery. That’s why it took six weeks; there were at least five different departments involved in creating the VM and configuring the operating system, user accounts, application server, and networking. The process involves CAB meetings, detailed service request documents, handover documents and security audits.
This is not an unusual story. I’ve worked in and spoken with dozens of enterprise IT organisations over the past few years, and this kind of thing is painfully common. In fact, it’s the norm. People in large organisations take this for granted. “Of course things take a while, and there are communication issues. We’re big and complex!”
When I suggest things don’t have to be this way, that there are some (admittedly a minority) large organisations which handle this stuff effectively by taking a different approach, they recoil. They say:
“That kind of stuff might work for websites and startups, but we’re too big and complex.”
This reaction is puzzling at first glance. Do people really think that more rigorous change management practices are not relevant to larger, older organisations?
I suspect the real root of the rejection of agile practices in large organisations is a belief that traditional change management practices work. Or at least, that they would work if properly applied. It certainly sounds like it should work. Spend more time planning and checking changes, get more people to reviewing and approve them, document everything more thoroughly, and evaluate each deviation from plan even more thoroughly, then, surely, the outcome will be of higher quality.
Right?
But in practice, things almost never work out this way. Each handover to another person or group is an opportunity to introduce errors and deviation. More misunderstanding, more loss of context, and less ownership for the end result. Very few organisations using these practices actually achieve high levels of quality (Hint: Your company is not NASA), instead things get done through heroics and workarounds to the process, with plenty of mistakes and disasters. (No, that’s not the organisation from my story.)
An effective IT organisation should be able to deliver a virtual machine in a matter of hours or minutes, rather than weeks. And it should be able to deliver this more reliably and consistently than a traditional process does, ensuring that the VM is fully compliant with corporate standards for security, quality, and conformity.
How could the organisation with the six week delivery time for VM’s achieve this?
There is more to it than this, of course, including how to ensure that changes to the standard template are applied to existing VMs created from earlier templates. But this is a start.
]]>Both of these are wrong.
There is no tradeoff between rapid delivery and reliable operations. New technologies such as cloud and infrastructure automation, plus agile approaches like DevOps and Continuous Delivery, allow changes to be made even more reliably, with far more rigorous control, than traditional approaches. There’s a useful parallel to Extreme Programming (XP), which takes quality assurance for software development to “the extreme”, with automated tests written and run in conjunction with the code they test, and run repeatedly and continuously throughout the development process.
The same is true with modern IT infrastrucure approaches. Automation and intensive collaboration support the business to make small changes safely, thorougly validate the safety of each change before rolling it out, immediately evaluate its impact once live, and rapidly decide and implement new changes in response to this information. The goal is to maximize both business value and operational quality.
The key is very tight loops involving very small changesets. When making a very small change, it’s easy to understand how to measure its impact on operational stability. The team can add monitoring metrics and alerts as appropriate for the change, deploy it to accurate testing environments for automated regression testing, and carry out whatever human testing and auditing can’t be automated. When this is done frequently, the work for each change is small, and the team becomes very efficient at carrying out the change process.
It’s good to validate changes before and after applying them to ensure they won’t cause operational problems. So, it must be even better to do this validation continuously, as each change is being worked out, rather than periodically (monthly or weekly).
It’s good to test that disaster recovery works correctly. So it must be even better to use disaster recovery techniques routinely as a part of normal processes for deploying changes, using Phoenix Servers or even Immutable Servers.
If it’s good to have a documented process that people should follow for making changes to systems, it must be even better to have the change process captured in a script. Unlike documentation, scripting will not fall out of date to actual procedure, it won’t skip steps, mis-type, or leave out key steps that certain people “just know”.
If it’s good to be able to audit changes that are made to a system, it must be even better to know that each change is automatically logged and traceable.
If it’s useful to have handovers so that the people responsible for operations and support can review changes and make sure they understand them, it must be even better to have continuous collaboration. This ensures those people not only fully understand the changes, but have shaped the changes to meet their requirements.
]]>The main advantage of this approach is that, by avoiding changes to a running system’s configuration, you reduce the risks that changes bring. You make changes to a base image, and can then run it through a battery of tests to make sure it’s OK before using it to create server images. This applies the principles behind Deployment Pipelines to infrastructure.
Ben and Peter Gillard-Moss have been evangelizing this approach within ThoughtWorks with their use of it on the Mingle SaaS project. Netflix are arguably the pioneers of this approach, and have released some open source tools to help manage AMI images on AWS for this purpose.
I’m running into increasing numbers of folks in the DevOps community who see infrastructures managed through heavily automated, continuous synchronization as too complicated and fragile.
If the chef-server, puppet-master approach to configuration management is Cloud Computing 2.0, immutable servers are the next thing. Interestingly, at least one commentator has confused this next generation of infrastructure management with pre-cloud practices. My (now-former - sniff) colleague Nic Ferrier responded to this based on a conversation with (still!) colleague Jim Gumbley.
These are truly interesting times in the world of IT infrastructure. The way we do things now is quite different from the way we did them ten years ago (albeit probably not for the majority - as with much technology, the future is not evenly distributed), and certainly different from the way we’ll do things in ten more years. It’s a blast to be involved in the shift!
]]>In order to get the team moving quickly, we’ve kicked this all off using what we’ve called a “tracer bullet” (or “trail marker”, for a less violent image). The idea is to get the simplest implementation of a pipeline in place, priortizing a fully working skeleton that stretches across the full path to production over a fully featured, final-design functionality for each stage of the pipeline.
Our goal is to get a “Hello World” application using our initial technology stack into a source code repository, and be able to push changes to it through the core stages of a pipeline into a placeholder production environment. This sets the stage for the design and implementation of the pipeline, infrastructure, and application itself to evolve in conjunction.
This tracer bullet approach is clearly useful in our situation, where the application and infrastructure are both new. But it’s also very useful when starting a new application with an existing IT organization and infrastructure, since it forces everyone to come together at the start of the project to work out the process and tooling for the path to production, rather than leaving it until the end.
The tracer bullet is more difficult when creating a pipeline from scratch for an existing application and infrastructure. In these situations, both application and infrastructure may need considerable work in order to automate deployment, configuration, and testing. Even here, though, it’s probably best to take each change made and apply it to the full length of the path to production, rather than wait until the end-all be-all system has been completely implemented.
When planning and implementing the tracer bullet, we tried to keep three goals in mind as the priority for the exercise.
Things can and should be made simple to start out with. Throughout the software development project changes are continuously pushed into production, multiple times every week, proving the process and identifying what needs to be added and improved. By the time the software is feature complete, there is little or no work needed to go live, other than DNS changes and publicizing the new software.
Don’t implement things that aren’t needed to get the simple, end to end pipeline in place. If you find yourself bogged down implementing some part of the tracer bullet pipeline, stop and ask yourself whether there’s something simpler you can do, coming back to that harder part once things are running. On my current project we may need a clever unattended provisioning system to frequently rebuild environments according to the PhoenixServer pattern. However, there are a number of issues around managing private keys, IP addresses, and DNS entries which make this a potential yak shave, so for our tracer bullet we’re just using the Chef knife-rackspace plugin.
The flip side of starting simply is not to take shortcuts which will cost you later. Each time you make a tradeoff in order to get the tracer bullet pipeline in place quickly, make sure it’s a positive tradeoff. Keep track of those tasks you’re leaving for later.
Examples of false tradeoffs are leaving out testing, basic security (e.g. leaving default vendor passwords in place), and repeatability of configuration and deployment. Often times these are things which actually make your work quicker and more assured - without automated testing, every change you make may introduce problems that will cost you days to track down later on.
It’s also often the case that things which feel like they may be a lot of work are actually quite simple for a new project. For my current project, we could have manually created our pipeline environments, but decided to make sure every server can be torn down and rebuilt from scratch using Chef cookbooks. Since our environments are very simple - stock Ubuntu and a JDK install and we’re good to go - this was actually more trivial than it would have been later on once we’ve got a more complicated platform in place.
Many organizations are in the habit of turning the selection of tools and technologies into complicated projects in their own right. This comes from a belief that once a tool is chosen, switching to something else will be very expensive. This is pretty clearly a self-fulfilling prophecy. Choose a reasonable set of tools to start with, ones that don’t create major barriers to getting the pipeline in place, and be ready to switch them out as you learn about how they work in the context of your project.
Put your tracer bullet in place fully expecting that the choices you make for its architecture, technology, design, and workflow will all change. This doesn’t just apply to the pipeline, but to the infrastructure and application as well. Whatever decisions you make up front will need to be evaluated once you’ve got working software that you can test and use. Taking the attitude that these early choices will change later lowers the stakes of making those decisions, which in turn makes changing them less fraught. It’s a virtuous circle that encourges learning and adaptation.
It’s tempting to make it easy to get pre-live releases into the production environment, waiting until launch is close to impose the tighter restictions required for “real” use. This is a bad idea. The sooner the real-world constraints are in place, the quicker the issues those constraints cause will become visible. Once these issues are visible, you can implement the systems, processes, and tooling to deal with those issues, ensuring that you can routinely and easily release software that is secure, compliant, and stable.
Another thing often left until the end is bringing in the people who will be involved in releasing and supporting the software. This is a mistake. In siloed organizations where software design and development is done by separate groups, the support people have deep insight into the requirements for making the operation and use of the software reliable and cost effective.
Involving them from the start and throughout the development process is the most effective way to build supportability into the software. When release time comes, handover becomes trivial because the support team have been supporting the application through its development.
Bringing release and support teams in just before release means their requirements are introduced when the project is nearly finished, which forces a choice between delaying the release in order to fix the issues, or else releasing software which is difficult and/or expensive to support.
The question of what to include in the tracer bullet and what to build in once the project is up and running depends on the needs of the project and the knowledge of the team. On my current project, we found it easy to get a repeatable server build in place with chef configuration. But we did this with a number of shorcuts.
Starting out with a tracer bullet approach to our pipeline has paid off. A week after starting development we have been able to demonstrate working code to our stakeholders. This in turn has made it easier to consider user testing, and perhaps even a beta release, far sooner than had originally been considered feasible.
]]>As Martin points out, people have different understanding of what quality means, but the definition that counts from a delivery point of view is that it’s the attributes that make the software easier to maintain and extend. Developers can work more quickly on code that is easy to understand and free from bugs.
So in practice, teams that prioritize speed over quality tend to achieve neither, while teams that prioritize quality, in many cases, deliver code very quickly.
However, this isn’t always the case. Some teams focus on quality, but end up taking forever to deliver simple things. What’s missing from the speed vs. quality tradeoff is a second axis, completeness versus simplicity.
Another word for completeness on this chart would be complexity, but this quadrant represents the aspirations of a team - what the team is trying to achieve - and no team aspires to complexity. Instead, teams try to design and implement software and systems which are complete.
A team that prioritizes completeness wants a system that can cope with anything. It can meet completely new requirements through configuration rather than code, easily scale to handle any load, and tolerate any conceivable or inconceivable failure.
The problem with this is partly described by the YAGNI principle. Most of what the team built isn’t actually going to be needed. A large proportion of the stuff that will be needed in the future is stuff that the team didn’t anticipate. But the real killer is that adding all this stuff adds more moving parts. It’s more stuff to implement, more stuff to go break, and then it’s more stuff to wade through when working on the codebase.
So the team sets out to build the perfect, well-engineered system, but over time the schedule comes under pressure, and the team realizes it needs to step up the pace. Elements of the design are dropped, leaving parts of the system that were already implemented unused, but still taking up space (and adding complexity) in the codebase.
There is a nearly inevitable slide into cutting corners in order to get things done, and before you know it, you’re trading off quality (“we’ll go back and clean it up later”) for speed. As we’ve seen, this leads to a quagmire of poor code quality which slows work down, made even worse because of an overcomplicated design and large amounts of unnecessary code.
What seems to unite high performing development teams is an obsessive focus on both quality and simplicity. Implement the simplest thing that will satisfy the actual, proven need, and implement it well. Make sure it’s clean, easy to understand, and correct. If something is wrong with it, fix it immediately.
There’s a line to tread here. I’ve seen some teams interpret this too strictly, and delivering software that works correctly and is simple, but is crappy in terms of user experience. The definition of quality software must include doing an excellent job of satisfying the user’s needs, while being ruthless about limiting the needs it tries to satisfy.
Teams that get this focus right are able to reliably deliver high quality software remarkably fast.
]]>As Pat says, if you were to pick only one agile practice to adopt, retrospectives are it. It’s the engine a team uses to identify and address ways to improve performance, so regular retrospectives become the forum to work out which other practices would be helpful, how to adjust they way they’re being used, and which ones are getting in the way or just unnecessary.
If you’ve tried retrospectives but not gotten as much out of them as the above bold claim suggests, Pat’s book could be for you. Everything in it is refreshingly practical and actionable for such a potentially hand-wavy, touchy-feely subject. It ranges from high level topics and techniques, through to dealing with common problems such as lack of action afterwards, to nuts and bolts details about the materials to use.
If you want a more detailed review of the book, check out our other colleague Mark Needham’s review. Then get the book itself!
And, yeah, check out the stuff our other colleagues have written as well. I may be too lazy to write them all up, but they’re quality stuff.
]]>You can either sign up (if it’s before the day), or view the recorded webinar (if you’re reading this from the future), on the ThoughtWorks website.
]]>A usefully simplistic view of the evolution of ideas about making software ready for release is this:
Going from traditional Agile development to Continuous Delivery is not about adopting a shorter cycle for making the software ready for release. Making releasable builds every night is still not Continuous Delivery. CD is about moving away from making the software ready as a separate activity, and instead developing in a way that means the software is always ready for release.
A common misunderstanding is that Continuous Delivery means releasing into production very frequently. This confusion is made worse by the use of organizations that release software multiple times every day as poster children for CD. Continuous Delivery doesn’t require frequent releases, it only requires ensuring software could be released with very little effort at any point during development. (See Jez Humble’s article on Continuous Delivery vs. Continuous Deployment.) Although developing this capability opens opportunities which may encourage the organization to release more often, many teams find more than enough benefit from CD practices to justify using it even when releases are fairly infrequent.
As I mentioned, there are sometimes conflicts between Continuous Delivery and practices that development teams take for granted as being “proper” Agile.
One of these points of friction is the requirement that the codebase not include incomplete stories or bugfixes at the end of the iteration. I explored this in my previous post on iterations. This requirement comes from the idea that the end of the iteration is the point where the team stops and does the extra work needed to prepare the software for release. But when a team adopts Continuous Delivery, there is no additional work needed to make the software releasable.
More to the point, the CD team ensures that their code could be released to production even when they have work in progress, using techniques such as feature toggles. This in turn means that the team can meet the requirement that they be ready for release at the end of the iteration even with unfinished stories.
This can be a bit difficult for people to swallow. The team can certainly still require all work to be complete at the iteration boundary, but this starts to feel like an arbitrary constraint that breaks the team’s flow. Continuous Delivery doesn’t require non-timeboxed iterations, but the two practices are complementary.
Many development teams divide software builds into two types, “snapshot” builds and “release” builds. This is not specific to Agile, but has become strongly embedded in the Java world due to the rise of Maven, which puts the snapshot/build concept at the core of its design. This approach divides the development cycle into two phases, with snapshots being used while software is in development, and a release build being created only when the software is deemed ready for release.
This division of the release cycle clearly conflicts with the Continuous Delivery philosophy that software should always be ready for release. The way CD is typically implemented involves only creating a build once, and then promoting it through multiple stages of a pipeline for testing and validation activities, which doesn’t work if software is built in two different ways as with Maven.
It’s entirely possible to use Maven with Continuous Delivery, for example by creating a release build for every build in the pipeline. However this leads to friction with Maven tools and infrastructure that assume release builds are infrequent and intended for production deployment. For example, artefact repositories such as Nexus and Artefactory have housekeeping features to delete old snapshot builds, but don’t allow release builds to be deleted. So an active CD team, which may produce dozens of builds a day, can easily chew through gigabytes and terabytes of disk space on the repository.
A standard practice with Continuous Delivery is automatically deploying every build that passes basic Continuous Integration to an environment that emulates production as closely as possible, using the same deployment process and tooling. This is essential to proving whether the code is ready for release on every commit, but this is more rigorous than many development teams are used to having in their CI.
For example, pre-CD Continuous Integration might run automated functional tests against the application by deploying it to an embedded application server using a build tool like Ant or Maven. This is easier for developers to use and maintain, but is probably not how the application will be deployed in production.
So a CD team will typically add an automated deployment to an environment will more fully replicates production, including separated web/app/data tiers, and deployment tooling that will be used in production. However this more production-like deployment stage is more likely to fail due to its added complexity, and may be may be more difficult for developers to maintain and fix since it uses tooling more familiar to system administrators than to developers.
This can be an opportunity to work more closely with the operations team to create a more reliable, easily supported deployment process. But it is likely to be a steep curve to implement and stabilize this process, which may impact development productivity.
Given these friction points, what benefit is there to moving from traditional Agile to Continuous Delivery worthwhile, especially for a team that is unlikely to actually release into production more often than every iteration?
The friction points I’ve described seem to come up fairly often when Continuous Delivery is being introduced. My hope is that understanding the source of this friction will be helpful in discussing it when it comes up, and working through the issues. If developers who are initially uncomfortable with breaking with the “proper” way of doing things, or find a CD pipeline overly complex or difficult understand the aims and value of these practices, hopefully they will be more open to giving them a chance. Once these practices become embedded and mature in an organization, team members often find it’s difficult to go back to the old ways of doing them.
Edit: I’ve rephrased the definition of the “traditional agile” approach to making software ready for release. This definition is not meant to apply to all agile practices, but rather applies to what seems to me to be a fairly mainstream belief that agile means stopping work to make the software releasable.
]]>The orthodox approach to the iteration is to treat it as a timebox for delivering a batch of stories, which is the approach most Scrum teams take with sprints (the Scrum term for an iteration). In recent years many teams have scrapped this approach, either using iterations more as a checkpoint, as many ThoughtWorks teams do, or scrapping them entirely with Kanban and Lean software development.
For the purpose of this post, I will refer to these two general approaches to running iterations as the “orthodox” or “timeboxed-batch” iteration model on the one hand, and the “continuous development” model on the other hand. Although orthodox iterations have value, certainly over more old-school waterfall project management approaches, continuous development approaches which do away with timeboxing and avoid managing stories in batches allow teams to more effectively deliver higher quality software.
Orthodox iteration model: (or “timeboxed-batch” model). Each iteration works on a fixed batch of stories, all of which must be started and finished within a single iteration. Continuous development model: Stories are developed in a continuous flow, avoiding the need to stop development in order to consolidate a build containing only fully complete stories.
In the classically run iteration or sprint, the Product Owner (PO) and team choose a set of stories that it commits to deliver at the end of the iteration. All of the stories in this batch should be sufficiently prepared before the iteration begins. The level and type of preparation varies between teams, but usually includes a level of analysis including the definition of acceptance criteria. This analysis should have been reviewed by the PO and developers to ensure there is a common understanding of the story. The PO should understand what they can expect to have when the story is implemented, and the technical team should have enough of an understanding of the story to estimate it and identify potential risks.
The iteration begins with an iteration kickoff meeting (IKO) where the team reviews the stories and confirms their confidence that they can deliver the stories within the iteration. The developers then choose stories to work on, discussing each story with the PO, Business Analyst (BA), and/or Quality Analyst (QA) as appropriate, then breaking it down into implementation tasks. Implementation takes place with continual reviews, or even pairing with these other non-developers, helping to keep implementation on track, and minimizing the amount of rework needed when the story is provisionally finished and goes into testing.
The QA and BA/PO then test and review each story as its implementation is completed. This is in addition to the automated testing which has been written and repeatedly running following TDD and CI practices. Only once the story is signed off do the developers move on to another one of the stories in the iteration’s committed batch.
As the end of the iteration approaches, developers and QAs should be wrapping up the last stories and preparing a releasable build for the showcase, which is typically held on the final day of the iteration. In the showcase, the team demonstrates the functionality of the completed stories to the PO and other stakeholders, the stories are signed off. The team holds a retrospective to consider how they can work better, then on the next working day they hold the IKO to start the following iteration.
When the iteration ends the team has a complete, fully tested and releasable build of the application, regardless of whether the software actually will go into production at this point.
The start and end dates of the iteration are firmly fixed. If there are stories (or defect fixes) which aren’t quite ready at the end of the iteration, the iteration end date is never slipped. Instead, the story is not counted as completed, so must be carried over to the next iteration.
This style of iteration offers many benefits over traditional waterfall methodologies. A short, rigid cycle for producing completely tested and releasable code forces discipline on the team, keeping the code in a near-releasable state throughout the project, and avoiding the temptation to leave work (e.g. testing) for “later”, building up unmanageable burdens of work, stress, and defects to be dealt with under the pressure of the final release phase.
The timeboxed iteration also forces the team to learn how to define stories of a manageable size. If stories are routinely too big to complete in one iteration this is a clear sign that the team needs improve the way it defines and prepares stories.
This demonstrates another benefit of the iteration, which is frequent feedback. The team is able to evaluate not only the quality of their code and its relevance to the business by getting feedback quickly, they are also able to evaluate how effectively they are working, and try out ideas for improving continually throughout the project.
The timeboxed-batch approach to iterations has value, particularly for teams inexperienced with agile. However, it has fundamental problems. At core, this approach is waterfall written small, with many of the same flaws, albeit with a small enough cycle that issues can be dealt with more quickly than with a full waterfall project.
To understand why this is so, let’s flesh out the idealized anatomy of the iteration from above with some of the things which often happen in practice.
At the end of the day, the orthodox iteration suffers from two problems which are inherent in its very definition: it organizes work into batches, and it enforces a timebox.
Batching work is the antithesis to flow. The Lean approach to working aims to maximize the flow of work for the members of a team, which in software development translates to getting stories flowing easily through creation, analysis, implementation, validation, and release. When a developer finishes one story and it is signed off, she should have another story ready for her to pick up and start on. This shouldn’t need to wait on an arbitrary ceremony, and certainly shouldn’t have to wait for everyone else on the team to finish their stories and get them all signed off.
The batching focus of orthodox iterations doesn’t only cause developers to block, it also turns BA’s, QA’s, and the PO into bottlenecks. As described above, the start and end of the iteration each put a full working set of stories in the same state at once, all needing the same activity carried out on them at once.
Imagine an assembly line which starts up to assemble twenty cars, then stops while they are all inspected at once. Only once all of the cars are inspected and their defects fixed does the line start up again to begin assembling another twenty cars.
Timeboxing is also a source of problems for iterations. The main problem is the arbitrary deadline creates pressure to get stories “over the line” so they can count towards the velocity for the iteration. Unless management is enlightened (or uninterested) enough to avoid focusing on fluctuations of velocity from iteration to iteration (and even the most enlightened managers I’ve worked with do get worked up over velocity) this leads to the temptation to rush and cut corners, or to play games with stories.
Rushing obviously endangers the quality of the code, which almost certainly leads to delays down the line when the defects surface. Playing games, such as closing unfinished stories and opening defects to complete the work, or counting some points towards an unfinished story, undermines the team’s ability to measure and predict its work honestly. These bad habits will catch up one way or another.
Expecting code to be complete at the end of the iteration, fully tested, fixed, and ready for deployment, is unrealistic unless the iteration is structured with significant padding at the end. This padding must come after all reviews, including the stakeholder showcase, to allow time to make corrections, unless those reviews are mere rubber stamp sessions, with no genuine feedback permitted. This then means the team will be underutilized during the padding time. Otherwise, if there is so much rework done during this period that the entire team is fully engaged, then the risk of introducing new defects is too high to be confident in stable code by the end.
The alternative is to break the strict timeboxed-batched iteration model by interleaving work on the next iteration with the cleanup work from the previous iteration. This turns out to not be such a bad idea, and leads to evolving away from the timeboxed-batch iteration model towards the continuous development model.
The continuous development model may be purely iteration-less, e.g. Kanban, or it may still retain the iteration as a period for measuring progress and for scheduling activities such as showcases. Once development is underway stories are prepared, developed, and tested using a “pull” approach, being worked on as team members become available, so that stories are constantly flowing, and everyone is constantly working on the highest value work available at the moment. This requires some different approaches to managing work flow than are used with other approaches. For more information, look into Kanban and Lean software development.
Since joining a year or so ago I’ve found that although no two ThoughtWorks projects run in exactly the same way, there is a strong tendency to use iterations which look a lot like Kanban, but retaining a one or two week iteration. Iterations are used to report progress (including velocity), and to schedule showcases and other regular meetings, but stories are not moved through the process in batches. Teams don’t start and stop work as a whole other than the start and end of a release. If the showcase is two days away, nothing stops a developer pair from starting on a new story knowing full well it will be incomplete when the codebase is demoed to the stakeholders, and possibly even deployed to later stage environments.
Although we do make projections and aim to have certain stories done by the next showcase, the team doesn’t promise to deliver a specific batch of stories. If it makes sense, stories can be dropped, added, or swapped as needed. This gives the business more flexibility to adapt their vision of the software as it is developed. It also reduces the pressure to mark a given story as “done” by a hard deadline, since there is no disruption from letting work carry on over the end of an iteration.
I’ve seen a Scrum team become ornery and rebellious when a PO made a habit of asking to swap stories after a sprint had started, even though work hadn’t been started on the particular stories involved. This was made worse because bugfixes were scheduled into sprints alongside stories, meaning that any serious defect found in production completely disrupted the team. Another factor that aggravated the situation was that the stories for each sprint were agreed before the end of the previous iteration. So if the showcase raised ideas for improvements to the functionality completed in iteration N, new stories could only be started in iteration N + 2 at the soonest. This hardly created a situation where the PO or the business felt the development team was responsive to business needs.
Also see Oren Teich’s post Go slow to go fast, which points out the problems with deadlines, and that iterations are simply a shorter deadline.
There are certainly challenges in moving to continuous development over the timeboxed-batch model. There is more risk of stories dragging on across multiple iterations. This can be mitigated by monitoring cycle time and keeping things visible, so that the team can discuss the issue and make changes to their processes if it becomes a problem.
For teams which are new to agile and still struggle to create appropriately sized stories, the timeboxed model may be more helpful to build the discipline and experience needed before being able to move to a continuous model. However, for experienced teams, timeboxing and batching stories simply has too many negative effects.
Continuous development, with a looser approach to iterations, maximizes the productivity of the team, avoids pitfalls that put quality at risk, and offers the business and the team more flexibility.
]]>A nice automated server provisioning process as I’ve advocated helps ensure machines are consistent when they are created, but during a given machine’s lifetime it will drift from the baseline, and from the other machines.
There are two main methods to combat configuration drift. One is to use automated configuration tools such as Puppet or Chef, and run them frequently and repeatedly to keep machines in line. The other is to rebuild machine instances frequently, so that they don’t have much time to drift from the baseline.
The challenge with automated configuration tools is that they only manage a subset of a machine’s state. Writing and maintaining manifests/recipes/scripts is time consuming, so most teams tend to focus their efforts on automating the most important areas of the system, leaving fairly large gaps.
There are diminishing returns for trying to close these gaps, where you end up spending inordinate amounts of effort to nail down parts of the system that don’t change very often, and don’t matter very much day to day.
On the other hand, if you rebuild machines frequently enough, you don’t need to worry about running configuration updates after provisioning happens. However, this may increase the burden of fairly trivial changes, such as tweaking a web server configuration.
In practice, most infrastructures are probably best off using a combination of these methods. Use automated configuration, continuously updated, for the areas of machine configuration where it gives the most benefit, and also ensure that machines are rebuilt frequently.
The frequency of rebuilds will vary depending on the nature of the services provided and the infrastructure implementation, and may even vary for different types of machines. For example, machines that provide network services such as DNS may be rebuilt weekly, while those which handle batch processing tasks may be rebuilt on demand.
]]>The first step in achieving this is making sure server instances are built using an automated process. This ensures every server is built the same way, that improvements can be easily folded into the server build process, and that it is a simple matter to spin up new instances and to scrap and replace broken ones. Automating this process also means your team of highly skilled, well-paid professionals don’t need to spend large amounts of their time on the brainless rote-work of menu-clicking through OS installation work.
I first used automated installation by PXE-booting physical rack servers in 2002, following the advice I found on the then-current infrastructures.org site, and in later years applied the same concepts with virtualized servers and then IaaS cloud instances.
I think of this as the machine lifecycle (which I tend to call the ‘server lifecycle’ because that’s what I normally work with, although it’s just as applicable to desktops). This involves a number of activities required to set up and manage a single machine instance, such as partitioning storage, installing software, and applying configuration.
These activities are applied during one or more phases of the machine lifecyle. There are three phases: “Package Image”, “Provision Instance”, and “Update Instance”. There are a number of different strategies for deciding which activities to do in each phase.
The various activities may be applied during one or more phases, depending on the strategy used to manage the machine’s lifecycle. Some strategies carry out more activities during the packaging phase, for instance, while other approaches might have a simpler packaging phase but do more in the provisioning and/or updating phase.
In the image packaging phase, some or all elements of a machine instance are pre-packaged into a machine image in a way that can be reused to create multiple running instances.
This could be as simple as using a bare OS installation CD or ISO from the vendor. Alternately, it could be a snapshot of a fully installed, fully configured runnable system, such as a Symantec Ghost image, VMWare template, or EC2 AMI. Either way, these images are maintained in a Machine Image Library for use in the instance provisioning phase.
Different machine lifecycle strategies use different approaches to image packaging. ManualInstanceBuilding and ScriptedInstanceBuilding both tend to use stock OS images, which involves less up-front work and maintenance of the Machine Image Library, since the instances are take straight from the vendor. However, work is still needed to create, test, and maintain the checklists or scripts used to configure instances when provisioning.
On the other hand, CloningExistingMachineInstances and TemplatedMachineInstances both create pre-configured server images, which need only minor changes (e.g. hostnames and IP addresses) to provision new instances. This is appealing because less work is done to provision a new instance, but the drawback is that creating and updating images takes more work. Admins tend to make updates and fixes to running instances which may not make it into the templates, which contributes to ConfigurationDrift, especially if changes are made ad-hoc.
CloningExistingMachineInstances, which usually takes the shape of copying an existing server to create new ones as needed, tends to make ConfigurationDrift worse, as new servers inherit the runtime cruft and debris (log files, temporary files, etc.) of their parents, and it is difficult to bring various servers into line with a single, consistent configuration. TemplatedMachineInstances are a better way to keep an infrastructure consistent and easily managed.
The tradeoffs between scripted installs vs. packaged images depends partly on the tools used for scripting and / or packaging, which in turn often depends on the hosting platform. Amazon AWS requires the use of templates (AMIs), for example. In either case, exploiting automation more fully in the provisioning phase favours the case for keeping the packaging phase as lightweight as possible.
In the provisioning phase, a machine instance is created from an image and prepared for operational use.
Examples of activities in this phase include instantiating a VM or cloud instance, preparing storage (partitioning disks, etc.), installing the OS, installing relevant software packages and system updates, and configuring the system and applications for use.
There are two main strategies for deciding which activities belong in the packaging versus the provisioning phases. One is RoleSpecificTemplates, and the other is GenericTemplate.
With RoleSpecificTemplates, the machine image library includes images that have been pre-packaged for specific roles, such as web server, application server, mail server, etc. These have the necessary software and configuration created in the packaging phase, so that provisioning is a simple matter of booting a new instance and tweaking a few configuration options. There are two drawbacks of this approach. Firstly, you will have more images to maintain, which creates more work. When the OS used for multiple roles is updated, for example, the images for all of those roles must be repackaged. Secondly, this pattern gives you less flexibility, since you can’t easily provision an instance that combines multiple roles, unless you create - and then maintain - images for every combination of roles that you might need.
With the GenericTemplate pattern, each image is kept generic, including only the software and configuration that is common to all roles. The role for each machine instance is assigned during the provisioning phase, and software and configuration are applied accordingly then. The goal is to minimise the number of images in the machine image library, to reduce the work needed to maintain them. Typically, a separate template is needed for each hardware and OS combination that can’t be supported from a single OS install. The JeOS (Just Enough Operation System) concept takes this to the extreme, making the base template as small as possible.
The GenericTemplate pattern does require a more robust automated configuration during provisioning, and may mean provisioning an instance takes longer than using more fully-built images, since more packages will need to be installed during install.
Once a machine instance is running and in use, it is continuously updated. This includes activities such as applying system and software updates, new configuration settings, user accounts, etc.
Many teams carry out these updates manually, however it requires a high level of discipline and organization to maintain systems this way, especially as the number of systems grows. The number of machines that a team can be managed is closely dependent on the size of the team, so the ration of servers to sysadmins is low. In practice, teams using manual updates tend to be reactive, updating machines opportunistically when carrying out other tasks, or in order to fix problems that crop up. This leads to ConfigurationDrift, with machines becoming increasingly inconsistent with one another, creating various problems including unreliable operation (software that works on one machine but not another), and extra work to troubleshoot and maintain.
]]>I’ve met many smart and skilled systems administrators in this situation. These folks know automation can make their life easier, but they can’t afford to take time away from turning cranks, greasing wheels, and unjamming the gears to keep their infrastructure puffing along in order to focus on improving their situation.
I’m convinced this is largely due to habit. Even though these teams understand that automation would be useful to them, when the pressure is on (and the pressure is always on), they roll up their sleeves, ssh into the servers and knock them into shape, because that’s the fastest way to get stuff done. Manual infrastructure management is what they’re used to. I find that most of these teams haven’t had personal experience of well-automated infrastructures, and don’t tend to believe it’s something they can realistically implement for their own operations.
Sysadmins who have worked in teams with mature, comprehensive automation, on the other hand, can’t go back. Sure, they might log into a box to diagnose and fix something that needs fixing right now, but they can’t relax until they’ve baked the fix into their automated configuration, and made sure that their monitoring will alert them ahead of time if the problem happens again.
Breaking out of manual infrastructure management and setting up an effective automation regime is difficult. Although there are loads of tools out there to make it work, it helps to understand good strategies for implementing them. I recommend looking over the material on the infrastructures.org site. It hasn’t been updated in a few years, so doesn’t take much of the advances since then into account, including virtualization, cloud, and newer tools like Chef and Puppet, but there is still rich material there.
Another must-read which more up to date is Web Operations by John Allspaw, Jesse Robbins, and a bunch of other smart peeps.
I’m also planning to share a few of the practices I’ve seen and used for automation in upcoming posts.
]]>If you believe that SLA’s ‘formalise waste’ this way how would you approach my situation where communications are beyond poor (atrocious) and the org structure is silo’d and no one is accountable for their work?
Kenfin’s example illustrates my point quite well - the organization’s structure is an obstacle to effective delivery. Since he’s not in a position to fix this problem, he’s turned to SLA’s as a way to manage the problem. They won’t make the issues go away, but they may give him a handle to manage them, and importantly, make them more predictable.
But it’s a fair question, what can someone in Kenfin’s shoes do in the face of an IT organization which is inherently not aligned to effectively providing the services he needs to deliver software to his users effectively?
A common strategy, and one that I’ve helped teams inside these kinds of organizations do, is to completely bypass the existing IT organization. The goal is to put control of everything that the product team needs in order to deliver into its hands, rather than leaving it at the mercy of a group (or multiple groups) who have other priorities.
One way to do this is outsourcing, finding another company that specializes in the functions that the IT group would provide, whether this means development, integration, hosting, or something else. This works best if the project is not seen as core to the business, so that it avoids fear of entrusting sensitive data or business critical functions to outsiders. It also helps if the project needs skills that can’t be found in-house.
My friends at Cognifide have built their business on this, building technically complex content-focused websites for corporate clients, delivering far more quickly, and with greater expertise, than most corporate IT organizations can manage. This is also the premise that Software as a Service (SaaS) is based on. By choosing SalesForce for CRM a company completely bypasses the massive IT project that would be required to implement an off the shelf, self-hosted CRM package (integration with other applications aside).
There are pitfalls to outsourcing to bypass IT. Many outsourcers are no more responsive than an in-house IT department, using SLAs and change control processes to make their workload, risks, and profitability more manageable.
The strategy I’ve most often been involved with myself (although I didn’t really think of it this way at the time) is product departments building their own IT capabilities. Again, this is about having control of the services and resources the group needs in order to deliver to its own customers.
The typical pattern is an “online” (or often, “digital”) department of a company where online was originally on the fringes of the main business, but has in recent years grown into a major channel for sales, customer service, or even delivery of products (for example in publishing).
The online team leverages their growing importance, as well as the specialized needs they have compared with typical corporate IT custometrs, to get approval from top management to create their own “digital operations team” or similar. This team may outsource elements of infrastructure, such as hosting (with IaaS cloud providers as an increasingly appealing option), but they are able to respond immediately to the needs of the online product group, because a) they don’t have to juggle requests from other departments and teams, and b) they report directly to the manager of that group.
Those strategies are not feasible for every team. I’ve certainly had to support projects where we had no alternative but to struggle along with unresponsive IT. In these cases, SLAs may well have to do, even though they represent waste and inefficiency.
There are a few other things you might at least try out in these cases. Your goal is still to have the resources you need in order to get things done at your disposal as far as possible. So see if you can identify those services which are especially critical, and particularly those which are likely to change frequently, and see if you can get some dedicated resource assigned to your project. You want someone who will sit with your team, be incentivized by the success of your project, and who has the skills, authority, and system privileges to carry out the tasks you need.
If full time secondment of people to your team is not quite feasible due to budget, lack of available resource, etc., see if you can at least get commitments of time from the right people. Can someone come to daily standups? Weekly meetings? Regular release management meetings? Ask for as much as possible to start with, then see what you can get.
Also, maybe you can hire someone into your own team with qualifications and background that will help them effecitely liaise with difficult IT teams. Your own DBA, security consultant, etc. can engage with the IT groups using their own language, couching things in terms that address their concerns. They may be able take certain tasks off the IT group’s plates, which ends up giving you the ability to get things done more quickly, while at the same time making IT grateful that their workload is lighter.
These are all ways to work around the core problems. The best solution is of course for the organization to restructure itself in a way that aligns its resources with its goals. Most companies, especially large ones, insist on organizing themselves in ways that are self-defeating. It’s a shame that many people who work in large companies accept this as normal, often even as desirable.
Grouping everyone with a given function into a single group forces them to focus on juggling the competing needs of many stakeholders, managing their own risks (especially the risk of getting blamed when projects fail). They will inevitably favor the abstract principles of their own technical practices over what is most effective in making the business succeed. Much better to group people into units that have complete ownership for delivering business value, and find ways to connect staff of given function with each other so they can develop their skills as working practices.
Unfortunately most of us are rarely in a position to influence this, so I hope that my suggestions will be helpful to some people in making things a little less painful.
]]>Download Tomcat 4.1 or 5.5, and unzip it into an appropriate directory. I usually put it in /usr/local, so it ends up in a directory called /usr/local/apache-tomcat-5.5.17 (5.5.17 being the current version as of this writing), and make a symlink named /usr/local/tomcat to that directory. When later versions come out, I can unzip them and relink, leaving the older version in case things don’t work out (which rarely if ever happens, but I’m paranoid).
For each instance of Tomcat you’re going to run, you’ll need a directory that will be CATALINA_BASE. For example, you might make them /var/tomcat/serverA and /var/tomcat/serverB.
In each of these directories you need the following subdirectories: conf, logs, temp, webapps, and work.
Put a server.xml and web.xml file in the conf directory. You can get these from the conf directory of the directory where you put the tomcat installation files, although of course you should tighten up your server.xml a bit.
The webapps directory is where you’ll put the web applications you want to run on the particular instance of Tomcat.
I like to have the Tomcat manager webapp installed on each instance, so I can play with the webapps, and see how many active sessions there are. See my instructions for configuring the Tomcat manager webapp.
Tomcat listens to at least two network ports, one for the shutdown command, and one or more for accepting requests. Two instances of Tomcat can’t listen to the same port number on the same IP address, so you will need to edit your server.xml files to change the ports they listen to.
The first port to look at is the shutdown port. This is used by the command line shutdown script (actually, but the Java code it runs) to tell the Tomcat instance to shut itself down. This port is defined at the top of the server.xml file for the instance.
<Server port="8001" shutdown="_SHUTDOWN_COMMAND_" debug="0">
Make sure each instance uses a different port value. The port value will normally need to be higher than 1024, and shouldn’t conflict with any other network service running on the same system. The shutdown string is the value that is sent to shut the server down. Note that Tomcat won’t accept shutdown commands that come from other machines.
Unlike the other ports Tomcat listens to, the shutdown port can’t be configured to listen to its port on a different IP address. It always listens on 127.0.0.1.
The other ports Tomcat listens to are configured with the <Connector> elements, for instance the HTTP or JK listeners. The port attribute configures which port to listen to. Setting this to a different value on the different Tomcat instances on a machine will avoid conflict.
Of course, you’ll need to configure whatever connects to that Connector to use the different port. If a web server is used as the front end using mod_jk, mod_proxy, or the like, then this is simple enough - change your web server’s configuration.
In some cases you may not want to do this, for instance you may not want to use a port other than 8080 for HTTP connectors. If you want all of your Tomcat intances to use the same port number, you’ll need to use different IP addresses. The server system must be configured with multiple IP addresses, and the address attribute of the <Connector> element for each Tomcat instance will be set to the appropriate IP address.
Startup scripts are a whole other topic, but here’s the brief rundown. The main different from running a single Tomcat instance is you need to set CATALINA_BASE to the directory you set up for the particular instance you want to start (or stop). Here’s a typical startup routine:
JAVA_HOME=/usr/java JAVA_OPTS="-Xmx800m -Xms800m" CATALINA_HOME=/usr/local/tomcat CATALINA_BASE=/var/tomcat/serverA export JAVA_HOME JAVA_OPTS CATALINA_HOME CATALINA_BASE $CATALINA_HOME/bin/catalina.sh start]]>