The Highly Variable Network Performance of Amazon EC2

We recently moved a bunch of our app server nodes from Rackspace to Amazon EC2. Since then, we have noticed highly variable amounts of network performance. We monitor DripStat using DripStat itself, so we can show you exactly how this looks like.

Architecture Overview

To give an idea of how things work on our backend:

  1. Every JVM that is connected to DripStat, pings our ‘Data Collector’ servers every minute.
  2. The data collectors verify the data and perform some light processing on it.
  3. That data is then put on a Kafka queue to be indexed and queried.

Our Kafka nodes are located on Softlayer in the Washington region. The EC2 nodes are located in US-east region.

Our data collectors are deployed on m3.medium instances. Amazon says they have 'Moderate’ network performance. Considering we aren’t sending out huge amounts of data, this feels fine.

Day 1

This is the response time graph of the data collectors for the first day. 

As you can notice, doing a send() to Kafka barely takes any time.

Day 2

However, on day 2, things start going wrong.

Notice the huge latency to Kafka. Its almost taking a full minute! The Kafka spikes seem to come and go, indicating this could be a noisy neighbor issue.

We check the Kafka boxes and they seem to be absolutely fine.

Finally, we bite the bullet and upgrade to m3.xlarge instance which has 'High’ network performance listed. Now our Kafka timing is back to normal.

However, our boxes are way over-powered than our needs!

Conclusion

While previously we could get away with a m3.medium box, we now are forced to use m3.xlarge, which is 4x costlier, just because of EC2’s highly variable network performance.

Show Comments