Platform Engineering Beyond CFEngine

Notes on Sustainable Platform Engineering

Camille Fournier
8 min read2 days ago

Readers of a certain age and heritage will remember CFEngine, one of the early popular configuration management systems (still in use today!). It provides a powerful way for administrators to manage system configuration across heterogenous clusters of machines, and ushered in the age of automated configuration management, a key component of the modern technology ecosystem. I’ve worked with and managed teams responsible for cfengine infrastructure, and have seen first-hand the value and challenge of managing and operating these systems. But I have also seen the limitations of approaching platform engineering from this mindset.

Configuration management, infrastructure provisioning, and the orchestration thereof is a devilish problem for organizations. Significant innovation has been poured into this area, from advances in “Infrastructure as Code” offerings like terraform, to orchestration platforms like kubernetes. The problem of maintaining heterogenous environments, especially as these environments grow in dimensions of complexity (eg, all of the pieces that need to exist for you to deploy an application onto the cloud using a multitude of cloud services), is neverending. While we may have moved beyond cfengine, many modern platform engineering teams focus on this part of the stack, making it easy for application teams to select their appropriate archetype or blueprint to provision the cloud resources their application needs to run.

I happen to also think that this practice, particularly when it is the main focus of your platform team, is misguided. It comes from walking down a path that seems logical:

  1. Terraform sucks. Especially when you’re in a company that is moving to the cloud from an old datacenter model, and your initial approach is to make every app team write its own terraform to provision things, you quickly see that every team is spending a ton of time figuring this out. And a lot of the time they’re figuring out the same things! So why not save them some effort, centralize that terraform writing, and create some patterns for everyone to adopt. That makes sense, right? This gets “day 0” (initial deployment) quickly.
  2. Then you start to think, well, there’s a lot of other things we could configure too! We could configure observability so they’re set up for appropriate logging and alerts. We could allow them to specify the level of availability they need and provision them into multiple zones or regions. All of this is great, and useful; it gives application teams better deployments. So why not keep going down this path?
  3. (sometimes) Even though terraform sucks, and we think we should centralize it, we don’t want to hide it from the app teams completely either, because that is restrictive. So we don’t completely hide it away from them: the code our blueprints generate is checked in nicely to a repo that application teams can access and tweak as needed. We want to be helpful but not limiting.

You can spend a good long time building this way; indeed, keeping up with the changes in cloud offerings and application team needs means you can probably do it forever. So, why not take this approach? It isn’t a stupid thing to do, many teams have used it to reasonable success. But it also relies on a few assumptions, sometimes spoken, sometimes not.

  • Assumption #1: While there are enough similarities in the infrastructure needed to provision that we can create these “blueprints” and have them be useful for many applications, there aren’t any better abstractions that we could create for these similar groups of applications. We really don’t want to have much if any production software that we (the platform team) are operating that application teams rely on beyond deployment stages, because that puts us on a critical path that we’re not sure we can handle.
  • Assumption #2: The cloud provider operates all of the underlying systems in such a way that they rarely perturb the application teams, and the app teams can rely on themselves or their SREs to manage it all when they do. We can be there to help debug but we don’t need to be part of critical production support.
  • Assumption #3: The biggest bottleneck that we can solve for application teams is getting the stack that their application needs provisioned and running in the cloud. This seems totally obvious at the beginning of a cloud journey where the whole company is trying to figure out how to do this stuff in the first place, and very few people are operating meaningful applications in the cloud.

Now that I have extrapolated these assumptions, you may see where I am going. If you agree with all of these, I wish you the best; you may very well be correct for your organization that this is the right approach! But there are some major limitations that fall out of these assumptions.

Limitation #1: We don’t want to be on the hook for operations, so we don’t want to build software we have to operate, especially on the production critical path.

Avoiding operations is, generally speaking, a bad idea for platform teams. The value of a platform team is often commensurate with how much operational burden they are able to remove from application teams, and you undermine your value by shying away from it. When there is a budget crunch, teams that are doing “enablement” are cut long before those that are responsible for running critical systems; one is viewed as optional, but the other is known to be essential.

Understandably, sometimes application teams don’t want you in the middle of their operations, especially if you aren’t very good at them in the first place. But it’s very hard for you to be half-in and half-out of ownership when you’re the experts in all of the configuration and infrastructure provisioning; it’s hard to know how to do that without knowing the underlying infrastructure well from an operational perspective, which means you’re likely to be needed during incidents to help debug what is happening; or, the app team knows enough that they don’t need you and then…. what are they really getting from using your blueprints at that point? Maybe it helped them bootstrap initially, but now it’s probably just a pain for their experts to translate your definitions to their own mental models and feel comfortably in control of what is happening.

Application teams that want to own everything themselves, should; the ones that need it will do it well, the ones that don’t do it well will have a clearer idea of the cost/benefit tradeoff of owning everything themselves.

For everyone else, I would argue that if you can find enough similar blueprints, you should consider whether you can build a platform that they deploy their application on top of, rather than just maintaining configuration for them. To be clear, this is tricky. It’s tempting to jump to “trying to build Heroku for the company” and many of us have seen that fail. At the same time, if there is meaningful work to be done to combine a bunch of parts together so that a common group of applications can be deployed into the cloud, I think it’s likely that there is something you could be operating. It might be a thin shim of software that takes hosted kubernetes with some company-specific controllers and integration with company-specific entitlements and finops support, and allows your application teams to specify a much simpler subset of configuration parameters and deploy their applications into that platform. Ask yourself, what could you build and operate yourself that would meaningfully free application teams from having to understand and maintain so much of the underlying infrastructure stack?

It’s important to note that this may not apply to a small company that can use a few select cloud provider platforms directly to do what they need (the aforementioned Heroku, Vercel, AppEngine, etc). I would in fact not advise trying to spend a lot of developer time building your own platforms if out-of-the-box vendor offerings will do! But for larger companies or certain types of complexity, writing some (not too much) custom software to make something that does what your company needs and hides the bare cloud complexity from the application teams can be worthwhile.

Limitation #2: We don’t actually want to write our own software.

Our bespoke kubernetes platform is an example of a real type of platform that I’ve seen built and operated by a platform engineering team. It works because kubernetes allows you to write your own code to modify its behavior; and while this is the kind of tricky work you may not want to do too often, it is very useful for taking a powerful general-purpose system and focusing it on specific usecases that you can then offer as a service.

You may be thinking, surely if this is useful someone else will build it, or the cloud provider will offer it, but there’s no reason that I should really need to write software myself. But there is a wide gulf between building a platform that is so generally useful that it can be sold to many types of companies with diverse usecases (very very hard), and building something useful within a very specific environment. In particular, the integration of the generally useful components with the quirks of your company (often relating to identity/entitlements, corporate hierarchy/department codes for billing, and other such company-specific details) is a fruitful place to look to balance writing just enough code to create a valuable platform.

Limitation #3: We don’t want to become a bottleneck.

The appeal of the blueprint idea is that it has the escape hatch; anything that could be done in terraform is at least theoretically fair game, the application teams can work to enable more cloud things themselves, and you haven’t created any real restrictions, just nice paved paths that make certain patterns easier. And I am very sympathetic to giving teams a way out of your choices so that you don’t limit the company’s innovation through the capabilities of your platforms.

Unfortunately, leaving the wide world of possibilities open retains all of the downsides of the wide world of possibilities. Choice is always appealing in the beginning and often limiting over time as every bespoke choice must be supported indefinitely. If you allow every team to independently choose what they want all the time, you get what we call the over-general swamp. Each different choice requires its own glue (one-off code for integration, automation, configuration, etc), and while glue is quick to create, it is painful to change.

Taking a product approach to platforms doesn’t mean making everyone happy all the time and building exactly what they ask you to build. It means taking the time to understand what is actually making your application teams less efficient, and doing hard work yourself so they don’t have to. Sometimes that hard work is making curatorial choices about what is supported and what is not, and making some niche group unhappy. As any product manager will tell you, it’s stressful bearing the burden of setting strategy and making bets that don’t always pay off.

Products, not Scripts

To wrap this up, Platform Engineering done well is more than configuration management and infrastructure provisioning. It is doing the hard work to identify, build, and operate abstractions that allow application teams spend less time on the underlying technology and more on solving problems for their business. Instead of thinking about how you can support the scripting and automation of infinite variations, think about the products you can build to cover the most common archetypes in a way that goes beyond the act of provisioning and deployment.

This is the first in a series of articles where I attempt to work through some thoughts on how to differentiate platform engineering teams and create meaningful value through platform engineering. In that sense, these notes are seeking audience resonance, and I am curious to hear your questions and reactions! Thanks to James Turnbull and Pete Miron for their early comments.

Enjoy this post? You might like my books: The Manager’s Path, available on Amazon and Safari Online, and Platform Engineering: A Guide for Technical, Product, and People Leaders, coming out fall 2024 and available in early release on O’Reilly Learning!

--

--