Post mortem of yesterday's outage
Yesterday we were hit with multiple issues at once which resulted in an outage.
1. RackSpace had an outage in their Northern Virginia region.
2. We were getting DDOS’d.
3. The hypervisor Rackspace deployed our cloud server on was running into issue and would keep killing our java process.
We were able to diagnose 2 and 3 only after Rackspace recovered from their long load balancer outage. The fact that all 3 happened at the same time did not help issues either.
Why didn’t you connect your servers to a load balancer in a different region?
Rackspace only allows connecting the servers to a load balancer in the same region.
Why not quickly create a giant box and re-route the dns to point to it instead of the load balancer?
All our users and agents connect to dripstat via https. For security reasons, the setup scripts for creating application boxes do not put our ssl certificate on those machines. The certificate is put directly on the load balancer and it is the one that provides ssl. So even if we provisioned a new box, you won’t be able to connect to it.
Since Rackspace constantly told us the issue would be solved in ‘just a few minutes’, we thought it would take longer to create new scripts than just wait for them to fix it.
Preventing further incidents like this
We have now ensured that another occurrence of the above doesn’t bring down our service. All our data collector nodes will be deployed in multiple regions. We have put connection throttling in place to prevent any further DDOS attacks.
We understand that as a monitoring service, we ourselves simply cannot go down. DripStat is a critical piece of many corporations’ infrastructures and they rely on us to ensure smooth functioning for their businesses. Downtimes are not something we take lightly. We have put in measures to ensure incidents like above don’t affect our service in the future and we will continue to put in measures to ensure reliability.