Visualizing AWS Storage with Real-Time Latency Spectrograms

btown · on Jan 27, 2015

> Every few seconds one of the writes takes forever [~5s]. You can notice the long periods of inactivity, and after that a green dot at the right of the chart: that’s our slow call. What is likely happening is: the local cache saturates and when that happens the application has to wait until the local data is pushed to the remote volume. Boy, you sure don’t want one of your critical code paths to hit one of these slow calls.

I'm surprised that there's no asynchronous way that the FS cache will flush itself i.e. when it reaches 50% capacity, and rate-limit incoming requests if it's too full. The idea that an FS cache is so dumb that it can't do anything while it's flushing its entire self is a bit scary - I'd expect that circular buffers and granular locking mechanisms could be used to great effect here. Is this kernel code? Userspace code? Is there research into this? Fundamental tradeoffs that I'm missing?

mrjones · on Jan 27, 2015

It would be interesting to see the client/benchmarking program. It almost sounds like it could be single-threaded ... which would mean the delay is an artifact of the benchmark only having one op outstanding, rather than something inherent in the storage layer.

btown · on Jan 28, 2015

Even with one client thread, though, shouldn't there be a background OS thread maintaining the FS cache and flushing parts of it? I don't think it should block the client just because it decided it was too full.

huhtenberg · on Jan 27, 2015

That's clever and well executed. Wrong palette though :P

Red implies problems, green implies "normality", but here this association is misplaced. Perhaps a typical "fire" palette would be better - from dark brown to red to orange to yellow and, ultimately, to white for the extremes.

degio · on Jan 27, 2015

OP here. Unfortunately the ansi palette is pretty limited so I didn't have a lot of flexibility in the color choice. That said, this can definitely be improved. I can work on it if people find it useful.

In the meantime, it's very easy to tune the colors your own: just modify this line https://github.com/draios/sysdig/blob/master/userspace/sysdi... in your local version of the script, using this as a reference http://misc.flogisoft.com/_media/bash/colors_format/256_colo....

chrisan · on Jan 27, 2015

> Unfortunately the ansi palette is pretty limited so I didn't have a lot of flexibility in the color choice.

I believe the issue raised isnt the palette range itself, but rather that it is the reverse of what it is typically expected. The current red area "should" be green indicating there are many calls in the fast region while the current trailing green blocks "should" be red indicating problem issues

This color of green=good and red=bad I believe stems from Triage tags: http://en.wikipedia.org/wiki/Triage_tag

Sometimes white is used below green as 'dismiss/not an issue'

morpher · on Jan 28, 2015

Here, "good" is on the left and "bad" is on the right. The color is orthogonal (it gives the number of operations with latencies in a given bucket). For example, a red square on the right side of the output would have definitely been "bad".

bcantrill · on Jan 28, 2015

Neat! This is definitely a step forward -- and thanks for the shout-out to our (that is, Sun's and Joyent's) prior work here. Tempted to also incorporate this into agghist and aggpack, the new DTrace actions I added for this kind of functionality.[1] Anyway, good stuff -- it's always good to see new visualizations of system behavior!

[1] http://dtrace.org/blogs/bmc/2013/11/10/agghist-aggzoom-and-a...

andrewguenther · on Jan 28, 2015

It would be interesting to run these tests on different instance sizes, specifically for data on the instance store. The larger the instance, the fewer neighbors you have to worry spending those precious IOPS.

As for SSD vs Magnetic EBS, I can't say that I'm surprised. I'd assume that EBS implements some sort of cache in between you and your actual disk on the other side of the network so that the writes can return even faster. Try doing this again with reads and I'd bet you'd get some interesting results.

Edit: Also, did you pre-warm your EBS volumes? http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-prewa...

degio · on Jan 28, 2015

Yes, I did pre-warm the volumes before using them.

And yes, there are several interesting workloads that I didn't test, including read only and read+write. It's potential material for another blog post.

robszumski · on Jan 27, 2015

Nice job on the graphics for the post. Thanks for taking the time to animate and annotate well.

amulyasharma · on Jan 28, 2015

In the world of IOPS provisioned iops application demanding faster and faster iops this tool is handy for devops guy to find the truth of iops being used and how its performing, selecting if there is need to upgrade the storage ..

outputlogic · on Jan 28, 2015

Calling this visualization a heatmap would be more appropriate than a spectrogram.

digikata · on Jan 27, 2015

I really want to lop off the 'ns' and '10 sec' divisions of all the charts and expand the resolution...

simonebrunozzi · on Jan 27, 2015

Well done!

armandomonaco · on Jan 27, 2015

cool project