JVM Stats hiccup postmortem

Time of incident: 22:00 to 24:00 Oct 8 2015

Impact: Very Low

Details

Yesterday night one of our cleanup scripts accidentally deleted the hostname-to-id mapping of some of the live JVMs from our database. The mappings were restored from a backup taken earlier at 03:45. Our system also instantly generated new ids instantly for the jvms that were connected, so no metrics were lost during the incident.

Impact

All impact was limited to the 'JVM Stats' tab. All other metrics are unimpacted.

During the incident

If you choose a period longer than 1 hour, you were unable to see JVM Stats for that period. However, all other metrics were available.

After the incident

The impact here is mostly cosmestic.

Since the host-to-id mappings were regenerated while doing the restore, if you see the jvm stats for an older time range which also includes a period after Oct 8 22:00, you may see the hostname listed twice in the jvm list view.

If you connected a brand new host, that you never connected to DripStat before, between 03:45 (time of backup) and 22:00 Oct 8, then you may not see jvm metrics for the few hours before 24:00. However, all other metrics will be fully visible and correct.

Moving forward

Instead of daily backups we have now switched to hourly backups and automated our restore procedure to reduce recovery time and impacted jvms. We will also be more careful and double-check our scripts that delete old data.

Show Comments