Post Mortem of today's downtime
Our data collection was down today for approx 3.5 hours from 5:20 AM PST to 9:00 AM PST. The rest of the services were still up, so you could still see login and see all your historical data.
The root cause of this issue was due to Azure rebooting a lot of our instances without any notification whatsoever. This took a while to resolve since we had no idea for a while why any of our services was down. Azure status page indicated everything was up. It seems due to some bug in the Azure portal our support plan was shown as downgraded so were unable to get a response directly from the Azure team in time to help us solve this issue.
We have had tons of problems in the past with Azure. While most of our infrastructure has indeed moved off of Azure, some boxes had still remained. We are now in the process of completely moving off of Azure.
Uptime is highly valuable to us. We know our customers rely on us as system of record to monitor their boxes. Even a 3.5 hour downtime is a huge deal for us. We will continue doing our very best to ensure this never happens again.