Postmortem of Dec 27 incident

Impact: Low

Time of Incident: 27 Dec 2015 at 10:15 AM PST


At 10:15 AM PST on Dec 27, we removed one of our set of streaming analytics nodes for maintenance and replaced it with new ones. What we didn't realize at the time was that these nodes had been running into firewall rule issues and had not been persisting their data to deep storage.

Due to this, while they were able to show data as it was coming in, it was not being persisted anywhere. So once we terminated these nodes, this data was lost. The data in question here is from Dec 24 at 4:15 AM to Dec 27 10:15 AM.

We are marking this incident as low priority since the data was indeed available at the time it was coming in. So if you logged in on say Dec 25, you would have seen all your data. The loss in this case is 'historical' data for those 3 days.

We have since then automated some of our processes, so the firewall rules are always set for these types of nodes, so an issue like this can never happen again. We truly apologize for any inconvenience caused due to this. We know performance data is important to you and ensuring that no data point is ever lost is always our top priority.

Show Comments