Metric-driven CI Stability
When I joined our Developer Experience team, we had very little visibility into how we were serving other engineers with our CI tools. We had no hard-evidence to back up any claims. We identified what we could and should measure. Then we established SLIs/SLOs to formalize those concepts and conducted an experiment to improve Buildkite stability. In the end we reached our goal on stability going from ~95% to 99.5%+ of builds that didn't fail because of something we had control. Now that we have hard data around the job we are doing, we don't have to make a decision about when to focus on CI vs. when to work on other tasks. If the SLO isn't being met, we work on that. If the SLO is being met, then we can work to improve our other, slightly less-critical tools. The audience should walk away with some understanding of how to identify what can have an SLO applied to it, how to gather data for the SLI, and why this is good for their team’s productivity.