By Lukas Biewald, July 20, 2008

Amazon’s S3 Web Service: Our #1 cause of failure

FaceStat uses Amazon’s S3 service to store and serve most images. Today they went down for 7 hours. During this time, FaceStat was completely broken — users couldn’t upload or view images, which is the point of the site.

They first mention “elevated error rates” at 9:05am; but our own logs indicate they went down 20 minutes before that. We guess “elevated error rates” is the new euphemism for “we are completely f’d and taking you down with us.”

Using Amazon’s S3 has about the same cost and complexity as hosting the images ourselves, but we had thought that the reliability of Amazon would be significantly higher. But that now seems wrong. Here’s a chart of this month’s FaceStat downtimes by cause:

It wasn’t just us — dozens or hundreds of useful websites were down today. SmugMug, the photosharing site, was broken. Avatars didn’t show up in Twitter., the excellent service we use to send and receive data from clients, was completely down. Scribd had no document data, rendering the site basically worthless. We decided to make the best of it and replace the homepage with an embedded flash game — since every feature of our site is broken, why not give our users something else to do?

How many sites will start moving off of S3 after this? SmugMug says they’re still happy with Amazon. But we’re not sure if this is warranted. According to Amazon’s SLA, even if the July uptime rate is below 99%, we only get a 25% discount off future AWS costs. This comes nowhere close to compensating us for the headaches they’ve caused us. It’s astonishing that serving content off our own boxes can be more reliable than serving content off of Amazon.

Google App Engine, an Amazon Web Service competitor, appears to be just as bad if not worse.

(For other folks trying to figure out how to get off of Amazon, Park Place is interesting — a server that clones the S3 API, but hosts the content on your own machine.)

The decision whether to use a cloud service like S3 is more complicated than the hype makes it out to be. We have 150,000 images for about that many users, but it still fits on a single hard drive, so it’s not too hard to set up our own image system. But if you’re growing and have to start scaling, you start needing something like S3. On the other hand, S3 is supposed to be for small folks like us who don’t want to spend time worrying about administration of such a system. So in order for it to pay off, we need reasonably high guarantees of reliability. Amazon has failed to deliver.

Lukas and Brendan