I’m just so exhausted these days. We have formal SLA’s, but its not like they’re ever followed. After all, Customer X needs to be notified within 5 minutes of any anomalous events in their cluster, and Customer Y is our biggest customer, so we give them the white glove treatment.
Yadda yadda, bla bla. So on and so forth, almost every customer has some exception/difference in SLAs.
I was hired on to be an SRE, but I’m just a professional dashboard starer at this point. The amount of times I’ve been alerted in the middle of the night because CPU was running high for 5 minutes is too damn high. Just so I can apologize to Mr. Customer that they maybe had a teensy slowdown during that time.
If I try to get us back to fundamentals and suggest we should only alert on impact, not short lived anomalies, there is some surface level agreement, but everyone seems to think “well we might miss something, so we need to keep it”.
It’s like we’re trying to prevent outages by monitoring for potential issues rather than actually making our system more robust and automate-able.
How do I convince these people that this isn’t sustainable? That trying to “catch” incidents before they happen is a fools errand. It’s like that chart about the “war on drugs” where it shows exponential expense growth as you try to prevent ALL drug usage (which is impossible). Yet this tech company seems to think we should be trying to prevent all outages with excessive monitoring.
And that doesn’t even get into the bonkers agreements we make with customers to agree to do a deep dive research on why 2 different environments have a response time that differs by 1ms.
Or the agreements that force us to complete customer provided training - while not assessing how much training we already committed to. It’s entirely normal to do 3-4x HIPAA / PCI / Compliance trainings when everyone else in the org only has to do one set of those.
I’m at a point where I’m considering moving on. This job just isn’t sustainable and there’s no interest in the org to make it sustainable.
But perhaps one of y’all managed to fix something similar in their org with a few key conversations and some effort? What other things could I try as a sort of final “Hail Mary” before looking to greener pastures?
Sure, if it were a normal service and not a distributed database that requires days to scale. Days. It’s not, “add one node” and we’re good. It’s Add Node - Migrate Data - Add Node - Migrate Data… And in many cases, we have explicit instructions NOT to scale the customer because they won’t be able to afford the larger cluster.
Also, would you auto-scale for a 5 minute blip that goes away in that time and doesn’t consistently recur? I certainly wouldn’t. The customer might not be able to pay for the size we put them on.
Our customers can simultaneously demand that we respond to all alerts AND not to scale their cluster. Who’s fuckin’ idea this was, I’ve no clue.
No. That’s reading far more into my statement than I hoped. The reliability is indeed there - it’s VERY unlikely our managed database goes down to a technology issue in our control. If it does, it’s usually an operator error thing. However, if it were down to just operator error alerts and things actually impacting the end users, my job would be a dream!
Automation is somewhat there, but there’s a few stakeholders that insist on human validated steps. So, while I have an ansible playbook for most issues, operating that playbook takes hours.
Sure, if it were technical. But this is largely not a technical issue, as you had assumed. The issue is that there is someone, with power, who gets to say that we must follow unreasonable customer requests to the letter. Even if those requests run counter to our sustainability.