Why So Many Outages?
For some customers, JIRA and some other Atlassian products have been down for an entire week. Some are reporting that Atlassian is saying that it could be another 2 weeks until the products are back up and running. Chalk that up worse than Roblox's 3 day outage back in October 2021. Why so many outages?
We don't know the full story behind Atlassian's outage yet, but both outages seem to be run-of-the-mill engineering issues. No nefarious hacks or exploits, no third-party or cloud provider downtime.
Roblox doesn't use public cloud, but Atlassian's outage only affects cloud customers (on-prem deployments are functioning correctly). While I believe that companies like Roblox will have trouble keeping up in a cloud services world where the bar is always being raised – these outages aren't always a cloud issue.
The Meta outage timeline was due in small part to remote work – after misconfiguring DNS, engineers couldn't access internal tools and networks used to debug and remediate the problem. Maybe there's an opportunity to rethink infrastructure in a world where much of site reliability is done completely remotely, with even new failure modes.
Something that companies are learning from Atlassian's radio silence on the outage – communication matters. Many customers are left in the dark, and we'll see if they use this as an opportunity to move some workflows off the product.