Preventing and Rescuing a Platform from Fire-Fighting Mode

There are few things that are certain when building a new platform, but one thing that I’ve noticed in overwhelmingly many platforms is that it comes a time when the platform team is stuck in fire-fighting mode. Here are a few personal opinions on why this happens and a few options to get out.

On a path into the fire

It is normal for a platform to start in an experimentation mode, maybe with a trial with one or two teams, maybe with testing different approaches in how to handle different use-cases, but when more teams start depending on the platform, experimentation time is over (for “old” features) and the foundation of the platform has to be rock-solid. Otherwise, the inevitable happens and the platform might end up buried by its own success.

There is a moment in the evolution of a platform when it is no longer an experiment and it becomes a critical component of the product or infrastructure. This moment isn’t always obvious, it usually isn’t advertised or signaled in any special way (or even celebrated). The platform and its team kind of slips over it. This moment might be when more teams decide to use the platform in production or usage increases dramatically, whatever the reason, it is a critical point and it should be recognized as such. By this moment, the team should already behave like their product is no longer experimental and it is ready to be widely released.

When is a platform ready?

This is not a precise definition, but I would look at the following aspects before launching a new platform:

The platform is treated as a product. This means having a TPM role or position, having means to know its customers or users, how the product is used and have a clear mission and vision.
The platform has a community around it. This can be even in an incipient stage. An active community can go a long way in the growth of the platform.
The platform has proper boundaries. It is very clear where does the platform ends and where the responsibility of the users start.
The platform has a solid automation foundation. As the platform evolves, more and more things can be automated. Having a solid automation foundation from the beginning helps a lot. For example, having a self-service catalog, even though most operations only create a ticket in the backlog.
The platform teams has a sustainable way to add new features. Every new feature that is added is a potential source of instability. For example, in case of infrastructure platforms, a new technology should not be released before the team are experts on that technology.
The platform is used in a sustainable way. Most (all?) platforms are shared, which puts them in danger to fall into the tragedy of the commons. Providing feedback to the users for their usage helps diminish this risk.

The interesting thing of these aspects is that if they are not done before the platform is widely released, then it is going to be very hard to do it afterwards

Into the fire already?

If by any chance the platform team is already stuck in fire-fighting mode, here are a few ideas to try to get out:

Reduce scope of the team. If the platform includes multiple domains that only some team members handle, but all should own, then consider splitting the team around these domains in order to reduce friction and increase focus.
Evaluate the process of handling issues. Can it be done in a more effective way that leaves less of a mark on the team?
Categorize and count the issues. Are efforts spend on the most important one? Can they be prevented through fixes or new features?