JVM Stats hiccup postmortem
Time of incident: 22:00 to 24:00 Oct 8 2015
Impact: Very Low
Yesterday night one of our cleanup scripts accidentally deleted the hostname-to-id mapping of some of the live JVMs from our database. The mappings were restored from a backup taken earlier at 03:45. Our system also instantly generated new ids instantly for the jvms that were connected, so no metrics were lost during the incident.
All impact was limited to the 'JVM Stats' tab. All other metrics are unimpacted.
During the incident
If you choose a period longer than 1 hour, you were unable to see JVM Stats for that period. However, all other metrics were available.
After the incident
The impact here is mostly cosmestic.
Since the host-to-id mappings were regenerated while doing the restore, if you see the jvm stats for an older time range which also includes a period after Oct 8 22:00, you may see the hostname listed twice in the jvm list view.
If you connected a brand new host, that you never connected to DripStat before, between 03:45 (time of backup) and 22:00 Oct 8, then you may not see jvm metrics for the few hours before 24:00. However, all other metrics will be fully visible and correct.
Instead of daily backups we have now switched to hourly backups and automated our restore procedure to reduce recovery time and impacted jvms. We will also be more careful and double-check our scripts that delete old data.