Please Implement This Simple SLO

Hello Newguy! I'm glad to see that you've completed your on-boarding here at Corp. Générique. I have your first ticket ready. It's a simple task. Just 2 story points. But it should give you the opportunity to learn a little about how our services work together. Simply implement an availability SLO for our cherished Foo service. You know how to implement an SLO, right?

I think we should aim for four nines. I'm sure your aware of all the industry best practices, so I won't bore you with too many suggestions. We have a physical copy of the Google SRE Book downstairs, if you need it. It should be quite quick, so could you present your SLO for the coming Friday demo?

An ordinary and uneventful Friday arrives

Thanks for presenting Newguy. Great work! I'm glad you've become familiar with our Prometheus metrics emitted by our Foo service. But I have a few questions. Your SLO shows that we've already exhausted our error budget. But we haven't had any incidents recently. Could there be something wrong with the calculation? ... Oh, Hold on... I see the issue! We're counting 4xx response codes as bad events for our SLO. That's ridiculous! Our Foo service is working just as intended. If users are making invalid requests or looking for resources that don't exist, then that's not our problem, and it does not indicate that there is an issue with Foo . So, if you could just exclude those 4xx requests, then we can add alerting to your SLO and we can close this ticket. That's all for this meeting. I'll see you guys again for retro in about 5 minutes.

Tuesday morning incident post-mortem

I see... John thought that setting the rate limit to 0 would effectively disable the feature, but actually setting any non strictly positive integer means that the feature usage is unlimited. I am assigning John an action item to update our config documentation.

But what I really want to know, is why we only found out about this after I got pinged by the Bar team? Newguy, why didn't our SLO catch this? The Foo service was down for 4 hours! I did the math myself, and with a 100% error rate, the SLO should have triggered after 5 seconds.

... Right, I see. So since Foo service was down, it was not emitting any metrics. Then our SLO doesn't see any requests or any failures, hence no alert. A small, but sadly impactful, oversight Newguy. You should have used the metrics emitted by the load balancer. I'll reopen the SLO ticket with reference to the minutes of this meeting. Please get this fixed.

Thirty minutes before heading home on Thursday

Hey, Newguy! I just had a chat with the PM of the Bar team. Apparently, they've been having issues with creating objects via our web UI. Something about the frontend sending a string and the backend expecting an object that contains a string and returning 400. I've already spoken to our frontender about fixing our UI, but I was wondering, shouldn't we have caught this in our SLO? I know I said we should exclude 4xx response codes from the error count. But if it comes from our UI, then that's different. Because our UI is our responsibility, and really it is a part of the Foo service. Can you just tweak the SLO to count 4xx response codes as errors if they come from our web UI? You can use the header information to determine if the request comes from our UI, since we always set a few non-standard headers.

... Mmm, that's quite the pickle. So since you are now using the load balancer metrics to calculate the SLO, you don't have an easy way to create new metrics based off of header values. Am I understanding you correctly? Regardless, I'm sure this is a common problem. After all, we're not doing anything unusual with Foo service. So any other company would run into similar issues. There's probably lots of literature online about how to solve this problem. Once you've implemented a solution, can you present it in tomorrow's demo?

Another dull and lifeless Friday arrives

Thanks again Newguy for the insightful demo. Who could have guessed that we'd still be talking about this SLO a whole week later. Ha Ha Ha! I liked the solution of combining the total request count metrics from the load balancer with your new custom metric from the Foo service. Very innovative. I'll present this to the leadership team later today.

A rainy and foreboding Monday morning

Hey Newguy. Look, if this task was too challenging for you, you should have asked for help. I presented your SLO to the leadership team on Friday and they noticed some issues with your SLO. Here, look at my screen! I'm looking at our SLI from Friday afternoon, just after 13:00. The SLI takes a small dip for some reason. And then it goes to 102%! Is the leadership team supposed to believe that we in the Foo team create magical applications that have higher than 100% availability. What would that even mean!? I'm no expert in SLOs, but I know that availability cannot go over 100%.

... No. I don't want to hear any of your excuses about different scrape times. If the result is sometimes wrong, how can we trust it at all? Even a child knows that 102% availability is wrong.

Look, we clearly have a problem here. This task was voted a two-pointer and after two weeks we're still here talking about this. This shows me that there's a skills mismatch between you and what we expect from someone in this role. I spoke to HR this morning. You have a meeting with them now in room 101. You can leave your laptop here. Farewell Newguy.