Skip to main content

infrastructure

Configuration Drift

In my previous article on the server lifecycle I mentioned ConfigurationDrift, a term that I've either coined, or I've forgotten where I originally heard, although most likely I got it from the Puppet Labs folks.

Configuration Drift is the phenomenon where running servers in an infrastructure become more and more different as time goes on, due to manual ad-hoc changes and updates, and general entropy.

A nice automated server provisioning process as I've advocated helps ensure machines are consistent when they are created, but during a given machine's lifetime it will drift from the baseline, and from the other machines.

There are two main methods to combat configuration drift. One is to use automated configuration tools such as Puppet or Chef, and run them frequently and repeatedly to keep machines in line. The other is to rebuild machine instances frequently, so that they don't have much time to drift from the baseline.

The challenge with automated configuration tools is that they only manage a subset of a machine's state. Writing and maintaining manifests/recipes/scripts is time consuming, so most teams tend to focus their efforts on automating the most important areas of the system, leaving fairly large gaps.

There are diminishing returns for trying to close these gaps, where you end up spending inordinate amounts of effort to nail down parts of the system that don't change very often, and don't matter very much day to day.

On the other hand, if you rebuild machines frequently enough, you don't need to worry about running configuration updates after provisioning happens. However, this may increase the burden of fairly trivial changes, such as tweaking a web server configuration.

In practice, most infrastructures are probably best off using a combination of these methods. Use automated configuration, continuously updated, for the areas of machine configuration where it gives the most benefit, and also ensure that machines are rebuilt frequently.

The frequency of rebuilds will vary depending on the nature of the services provided and the infrastructure implementation, and may even vary for different types of machines. For example, machines that provide network services such as DNS may be rebuilt weekly, while those which handle batch processing tasks may be rebuilt on demand.

Automated server management lifecycle

One of the cornerstones of a well-automated infrastructure is a system for provisioning individual servers. A system that lets us reliably, quickly, and repeatably create new server instances that are consistent across our infrastructure means we spend less time fiddling with individual servers. Instead, servers become disposable components that are easily swapped, replaced, and expanded as we focus our attention on the bigger picture of the services we're providing.

The first step in achieving this is making sure server instances are built using an automated process. This ensures every server is built the same way, that improvements can be easily folded into the server build process, and that it is a simple matter to spin up new instances and to scrap and replace broken ones. Automating this process also means your team of highly skilled, well-paid professionals don't need to spend large amounts of their time on the brainless rote-work of menu-clicking through OS installation work.

I first used automated installation by PXE-booting physical rack servers in 2002, following the advice I found on the then-current infrastructures.org site, and in later years applied the same concepts with virtualized servers and then IaaS cloud instances.

The machine lifecycle

I think of this as the machine lifecycle (which I tend to call the 'server lifecycle' because that's what I normally work with, although it's just as applicable to desktops). This involves a number of activities required to set up and manage a single machine instance, such as partitioning storage, installing software, and applying configuration.

Basic Server Lifecycle phases

These activities are applied during one or more phases of the machine lifecyle. There are three phases: "Package Image", "Provision Instance", and "Update Instance". There are a number of different strategies for deciding which activities to do in each phase.

The various activities may be applied during one or more phases, depending on the strategy used to manage the machine's lifecycle. Some strategies carry out more activities during the packaging phase, for instance, while other approaches might have a simpler packaging phase but do more in the provisioning and/or updating phase.

Machine lifecycle phases

Image packaging phase

In the image packaging phase, some or all elements of a machine instance are pre-packaged into a machine image in a way that can be reused to create multiple running instances.

This could be as simple as using a bare OS installation CD or ISO from the vendor. Alternately, it could be a snapshot of a fully installed, fully configured runnable system, such as a Symantec Ghost image, VMWare template, or EC2 AMI. Either way, these images are maintained in a Machine Image Library for use in the instance provisioning phase.

With the ManualInstanceBuilding pattern, everything happens during provisioning

Different machine lifecycle strategies use different approaches to image packaging. ManualInstanceBuilding and ScriptedInstanceBuilding both tend to use stock OS images, which involves less up-front work and maintenance of the Machine Image Library, since the instances are take straight from the vendor. However, work is still needed to create, test, and maintain the checklists or scripts used to configure instances when provisioning.

On the other hand, CloningExistingMachineInstances and TemplatedMachineInstances both create pre-configured server images, which need only minor changes (e.g. hostnames and IP addresses) to provision new instances. This is appealing because less work is done to provision a new instance, but the drawback is that creating and updating images takes more work. Admins tend to make updates and fixes to running instances which may not make it into the templates, which contributes to ConfigurationDrift, especially if changes are made ad-hoc.

What happens in each phase with the TemplateMachineInstances pattern

CloningExistingMachineInstances, which usually takes the shape of copying an existing server to create new ones as needed, tends to make ConfigurationDrift worse, as new servers inherit the runtime cruft and debris (log files, temporary files, etc.) of their parents, and it is difficult to bring various servers into line with a single, consistent configuration. TemplatedMachineInstances are a better way to keep an infrastructure consistent and easily managed.

The tradeoffs between scripted installs vs. packaged images depends partly on the tools used for scripting and / or packaging, which in turn often depends on the hosting platform. Amazon AWS requires the use of templates (AMIs), for example. In either case, exploiting automation more fully in the provisioning phase favours the case for keeping the packaging phase as lightweight as possible.

Instance Provisioning Phase

In the provisioning phase, a machine instance is created from an image and prepared for operational use.

Examples of activities in this phase include instantiating a VM or cloud instance, preparing storage (partitioning disks, etc.), installing the OS, installing relevant software packages and system updates, and configuring the system and applications for use.

There are two main strategies for deciding which activities belong in the packaging versus the provisioning phases. One is RoleSpecificTemplates, and the other is GenericTemplate.

With RoleSpecificTemplates, the machine image library includes images that have been pre-packaged for specific roles, such as web server, application server, mail server, etc. These have the necessary software and configuration created in the packaging phase, so that provisioning is a simple matter of booting a new instance and tweaking a few configuration options. There are two drawbacks of this approach. Firstly, you will have more images to maintain, which creates more work. When the OS used for multiple roles is updated, for example, the images for all of those roles must be repackaged. Secondly, this pattern gives you less flexibility, since you can't easily provision an instance that combines multiple roles, unless you create - and then maintain - images for every combination of roles that you might need.

What happens in each phase with the GenericTemplate pattern

With the GenericTemplate pattern, each image is kept generic, including only the software and configuration that is common to all roles. The role for each machine instance is assigned during the provisioning phase, and software and configuration are applied accordingly then. The goal is to minimise the number of images in the machine image library, to reduce the work needed to maintain them. Typically, a separate template is needed for each hardware and OS combination that can't be supported from a single OS install. The JeOS (Just Enough Operation System) concept takes this to the extreme, making the base template as small as possible.

The GenericTemplate pattern does require a more robust automated configuration during provisioning, and may mean provisioning an instance takes longer than using more fully-built images, since more packages will need to be installed during install.

Instance Updating Phase

Once a machine instance is running and in use, it is continuously updated. This includes activities such as applying system and software updates, new configuration settings, user accounts, etc.

Many teams carry out these updates manually, however it requires a high level of discipline and organization to maintain systems this way, especially as the number of systems grows. The number of machines that a team can be managed is closely dependent on the size of the team, so the ration of servers to sysadmins is low. In practice, teams using manual updates tend to be reactive, updating machines opportunistically when carrying out other tasks, or in order to fix problems that crop up. This leads to ConfigurationDrift, with machines becoming increasingly inconsistent with one another, creating various problems including unreliable operation (software that works on one machine but not another), and extra work to troubleshoot and maintain.

Breaking into automated infrastructure management

Automated management of infrastructure is vital for delivering highly effective IT services. But although there are plenty of tools available to help implement automation, it's still common to see operations teams manually installing and managing their servers, which leads to a high-maintenance infrastructure, which soaks up the team's time on firefighting and other reactive tasks.

Doing it by hand

I've met many smart and skilled systems administrators in this situation. These folks know automation can make their life easier, but they can't afford to take time away from turning cranks, greasing wheels, and unjamming the gears to keep their infrastructure puffing along in order to focus on improving their situation.

I'm convinced this is largely due to habit. Even though these teams understand that automation would be useful to them, when the pressure is on (and the pressure is always on), they roll up their sleeves, ssh into the servers and knock them into shape, because that's the fastest way to get stuff done. Manual infrastructure management is what they're used to. I find that most of these teams haven't had personal experience of well-automated infrastructures, and don't tend to believe it's something they can realistically implement for their own operations.

Sysadmins who have worked in teams with mature, comprehensive automation, on the other hand, can't go back. Sure, they might log into a box to diagnose and fix something that needs fixing right now, but they can't relax until they've baked the fix into their automated configuration, and made sure that their monitoring will alert them ahead of time if the problem happens again.

Breaking out of manual infrastructure management and setting up an effective automation regime is difficult. Although there are loads of tools out there to make it work, it helps to understand good strategies for implementing them. I recommend looking over the material on the infrastructures.org site. It hasn't been updated in a few years, so doesn't take much of the advances since then into account, including virtualization, cloud, and newer tools like Chef and Puppet, but there is still rich material there.

Another must-read which more up to date is Web Operations by John Allspaw, Jesse Robbins, and a bunch of other smart peeps.

I'm also planning to share a few of the practices I've seen and used for automation in upcoming posts.

How to waste money on virtualization

I've been disappointed to see otherwise switched-on technical groups, and even high-priced 'managed hosting' companies, fail to take advantage of virtualization, even as they spend (and charge) loads of money to migrate physical servers onto virtualized infrastructures such as VMWare.

Moving from an OS running directly on hardware to an OS running on software that pretends to be hardware opens amazing possibilities, akin to the shift from paper to online data management. You're replacing immutable, physical servers with data, which means you can treat your infrastructure the same way you treat any other data - you can automate it, copy it, transfer it, and basically put it at the mercy of anything you can script.

While it's disappointing to see people use virtualization to replicate the experience of hosting on immutable, physical hardware, it's truly appalling to see hosting vendors offer this as a service, and add a premium price on top. It's only more annoying when they call it 'Cloud'.

I've recently seen a tender response from a big name, international hosting provider which basically offers to provide you with a couple of dedicated hardware boxes with ESX installed on them. Reasonable, if not ideal; obviously it's more powerful to have access to a pool of hardware resources that you share with other customers, so you can flex capacity when you need to, but there are reasons why a customer might not want to go this way.

What blew my mind was the way virtual machines on these boxes would be provided. For each VM, you pay a setup fee, and a monthly management fee. These fees are in pretty much the same ballpark as what you would pay for dedicated hardware for each of these VM's. But this company also charges a very hefty fee for each physical ESX box, so you're actually paying quite a bit more for N virtual machines than you would pay for N physical servers.

On top of this, you lose the flexibility that virtualization allows. Need to spin up a new image to prototype some changes to your application configuration? Request a price, get an invoice, raise a PO, pay a setup fee, and commit to three months of paying for the new image. Need to clone a running image to debug a production bug? Similar story. Expand capacity for a few weeks to support a marketing campaign? Implement an upgrade strategy that involves cloning, upgrading, testing, then swapping the clone into production? Even after you've jumped through the budgeting and purchasing hoops, you'll need to send your 'managed' hosting vendor a change request and wait a few days for their engineers to use the virtualization management console for you to carry out each step.

I'm not a Capability Maturity Model kind of guy, but I could see the benefit of having one for virtualization, to help enterprise CIO types understand what they should be demanding from vendors. The lowest level would involve using virtualization to replicate physical hardware, and the next would introduce flexibility in managing instances, supporting the types of use cases I described above. Higher levels of maturity would be more cloud-like, particularly around self service in provisioning images, flexible capacity management, and dynamic provisioning. I see higher levels of capability moving away from Infrastructure as a Service and towards providing a development and deployment platform that abstracts the details of servers, i.e. Platform as a Service.

A virtualization CMM would be grossly abused by marketers, but something like it might provide a few clues, and stomp out the practice of hosting providers offering virtualization without the benefits.

SAN Virtualization

Virtualization in the server fram relies heavily on storage, something I understand only to a certain level. This
review by InfoStor discusses virtualization of the SAN itself to get the flexibility you really need to get the best value out of server virtualization. It's somewhat over my head at the moment - I haven't gotten to this point yet - so I'm putting this link here for future reference.

Found via virtualization.info.

The hidden pitfalls of server virtualization

Where's there's buzz, there's bullshit. Slashing hardware costs and the time I spend herding servers makes virtualization sound like a silver bullet, but I have no doubt it's not as easy as the salespeople tell me to pull it off. I've spent some time researching and thinking it through, and have come up with a few things that I need to keep in mind when planning to get into virtualization properly.

Will I really spend less on hardware?

What's cool about virtualization?

How would you like to cut your annual server farm budget to a fraction of its current cost, reaping the glory and gold due to a corporate IT champion? Lop your data center to a third of its current size, no a fifth, maybe less! At the same time, you can improve resiliency, flexibility, and potency!

The Virtual Buzz

Virtualization is all the buzz these days, especially for server farms. As my own collection of server hardware heads towards 20 boxes and is still hard pressed to handle all the tasks I need it to do, I'm finding the lure of the herd hard to ignore.

Whipping up a solid LDAP infranstructure

Posted in

I've been much too quiet lately. I'm still hard at work putting together what I hope will be a very strong infrastructure for my company's application hosting operations, with about 15 servers for production, content management, and staging and testing.

One of the core components of this infrastructure is an OpenLDAP server, which I've been working on over the past week. Up until now it's been enough to have a couple of accounts which are created locally on all of the servers by puppet. I've got a chunk of disk space on a SAN which is shared across the machines, which is handy for having a common home area for key accounts I use to login and administer the machines, as well as the puppet templates and manifests.

The cool kids talk about operations

Tim O'Reilly, the boss of O'Reilly publishing and a key booster of the Web 2.0 meme, recently posted an article about operations.

One of the big ideas I have about Web 2.0 [is] that once we move to software as a service, everything we thought we knew about competitive advantage has to be rethought. Operations becomes the elephant in the room.

O'Reilly laments that most of the tools for deploying systems and applications on open source platforms (i.e. Linux) are not themselves open source. Luke Kaines and others have commented on the article with examples of open source deployment and operations management tools, including Puppet, and others I've mentioned for system configuration and network monitoring.

Syndicate content