Amazon SimpleDB thoughts

Amazon recently announced their latest web service, SimpleDB, to a roar of buzz and hype. I finally got a chance to sit down and read through the docs and blog posts, and like with the Facebook Data Store API, I’ve written up my thoughts.

Amazon’s docs do a pretty good job of describing SimpleDB, so I won’t try to reproduce them. Instead, I’ll focus on observations, and I’ll emphasize a few important points that are buried deep in the docs.

The executive summary is: I like it. It’s solid, straightforward, and eminently useful. Sure, it’s limited. It includes design decisions that clearly simplified the implementation at the cost of functionality and usability. Still, as a result of those decisions, SimpleDB has the potential to be very robust, scalable, and performant.

With SimpleDB alongside S3 and EC2, Amazon’s web services look more and more like the Unix philosophy: small, simple tools that do one job, do it well, and fit together in ways that complement each other. Very, very cool.

Then again, I’m a sucker for anything based on tuplespaces…

Contents

Introduction
Tuplespaces!
Queries
Attributes and ordering
Scaling and Dynamo
Usage-based pricing

Icon-Permalink.png Introduction

SimpleDB is a simple, schemaless structured storage engine. It stores items, which are bags of key/value pairs. Keys and values are always strings; primitive data types like integers and floats are not natively supported. Developers choose a unique string name for each item at creation time.

The primary operations are Put, Get, and Delete - which are self explanatory – and Query, which accepts attribute predicates and boolean operators in a custom string query format and returns all matching items.

Items are partitioned into domains. An item’s name must only be unique within its domain. Similarly, queries only return items from a single domain.

Icon-Permalink.png Tuplespaces!

One of the coolest things about SimpleDB is that its interface is pure tuplespaces, also known as Linda. (Thanks to Nelson, who was one of the first people to point out this huge piece of SimpleDB’s provenance.)

The tuplespaces concept never spread too far beyond of research, but I’ve always loved it, and I’ve had “build tuplespaces on top of a DHT” on my list of project ideas for years. As long as I get to play with it, I don’t care if Amazon beat me to the punch. After all, they can afford a few more servers and sysadmins than I can.

There are at least a couple noticeable differences between SimpleDB and standard tuplespaces interfaces. First, most tuplespace implementations only support equals and wildcard query operators. SimpleDB offers inequality, prefix, and boolean operators, which it probably supports with extra secondary indices.

Second, most tuplespaces implementations offer at least limited support for transactions, in the form of an atomic “update” operation that can remove existing tuples and add new ones. SimpleDB has no such operation, nor any other support for transactions.

Icon-Permalink.png Queries

SimpleDB uses a a minimal, string-based query language. It’s best described by example. Here’s one from the docs that will return all blue items that cost less than 14.99:

"['Price' < '14.99'] intersection ['Color' = 'Blue']"

It’s interesting that Amazon went with a custom, proprietary query language, as opposed to a subset of SQL like Facebook’s FQL. Again, it almost certainly made it easier for them to develop, but it raises the learning curve for developers, not to mention contributing to lock-in somewhat.

Luckily, since all attribute values are strings, they avoid the issue of serializing non-string values and operands. I’ve used a decent number of ORMs and database libraries, and this always tends to be a wart. It can definitely be done safely, and somewhat cleanly, but it’s always awkward.

Apart from the query language, there’s no support for joins, full text search, or sorting query results. I doubt I’d miss joins, but I’d definitely miss full text search and sorting. I expect that sorting alone will be one of the largest pain points for developers who try to use SimpleDB as a replacement for a standard RDBMS.

Finally, separate from the utilization-based pricing SimpleDB imposes a hard deadline on query execution time. If a query takes longer than 5 seconds, it’s cut off. Tough love, but reasonable.

Icon-Permalink.png Attributes and ordering

Like in tuplespaces, SimpleDB attribute names and values are untyped strings, so comparison is always lexicographic. That simplicity is endearing and attractive at first glance, and it almost certainly made SimpleDB easier for Amazon to develop. Unfortunately, it causes problems for numbers, dates, and composite types like points, which aren’t compared lexicographically.

To their credit, Amazon does explain how to zero-pad numbers and offset negative numbers, and their libraries include code that handles these operations. Still, no matter how you look at it, jumping through those kinds of hoops is ugly and awkward, for both data access and presentation. Worse, developers will need to write custom code to map to/from lexicographic ordering for any non-numeric types, such as points and dates. It doesn’t help that the SimpleDB docs themselves have lots of examples of numeric comparisons that aren’t offset or zero-padded.

Apart from ordering, attribute values are limited to 1024 characters, which is way too low. I can understand that they want to encourage developers to use S3 for binary data, but articles, comments, and other text data is often much larger than 1024 characters. It would be infeasible for many apps to store and access that data separately from the rest of their data, which could prevent a number of applications from using SimpleDB as their only structured storage engine.

Finally, it’s worth noting that all strings are UTF-8, including domains, item names, attribute names, and values. It’s what you’d expect, but it took me a fairly long time to find that tidbit in the docs.

Icon-Permalink.png Scaling and Dynamo

SimpleDB is almost certainly originally seemed to be based on Amazon’s Dynamo, a distributed hash table that’s highly replicated and available in exchange for a relatively low churn tolerance. That link is from Amazon’s CTO Werner Vogel’s blog, where he said:

Let me emphasize the internal technology part before it gets misunderstood: Dynamo is not directly exposed externally as a web service; however, Dynamo and similar Amazon technologies are used to power parts of our Amazon Web Services, such as S3.

Dynamo’s key characteristic is that it really is just a DHT, so its only operations are put, get, and delete. In particular, it doesn’t provide secondary indices. So, if SimpleDB was based on Dynamo, how would SimpleDB be queries executed? Maybe they’d use a modified full text index…but then you’d expect SimpleDB to offer full text search, which it doesn’t. Hmm.

One useful hint is that SimpleDB only guarantees eventual consistency. (Thanks to Ken for pointing this out.) Evidently, items and indices are replicated, and the replicas are updated asynchronously. That’s a big, big caveat for developers, but it helps us start to reverse engineer the architecture of their storage and indexing engine.

Personally, I wonder if SimpleDB’s indexing is based on a full text conventional index that’s augmented to support structured data, similar to Google Base or eBay’s search engine. if so, I’m sure Amazon has its reasons for not (yet) offering full text search over SimpleDB domains.

Icon-Permalink.png Usage-based pricing

The pricing model for SimpleDB is very interesting. Similar to S3 and EC2, SimpleDB charges for bandwidth and usage. However, SimpleDB also charges for machine utilization, measured in normalized CPU-hours.

This makes sense from a cost modeling perspective, but it’s surprisingly hard to implement in the storage engine. The particularly impressive part is that SimpleDB includes machine utilization in the response to every API call. Wow. Measuring utilization can be hard in general, but it’s even harder in real time.

13 thoughts on “Amazon SimpleDB thoughts

  1. God post. It aid in clearing my doubts about SimpleDB.

    I think that the big value of SimpleDB is the concept (a limitlessly scalable DB accessible everywhere by webservices) not the implementation. I believe that soon other big players will come with similar proposals.

  2. I can’t see myself using the service anytime soon. The limitations it imposes feel way to strong to me. The lack for specific datatypes makes me feel cold, like it did going from Java to PHP. No ordering? What’s up with that? If they can do EC2 and S3 they sure could have managed to implement ordering. This would make a many situation very frustrating to code. If they don’t want to stick to a SQL subset, I can live with that. But come on. I know they are calling it SimpleDB for a reason, but I can’t see modern web apps being able to utilize this as their primary database.

    Could somebody point out to me some obvious applications of SimpleDB?

  3. JB says:

    Very useful analysis – thanks. One aspect of sdb which I haven’t seen discussed is backup/data portability. What tools, if any, will Amazon provide for this? What happens when a site built on sdb hits it big and decides they can do a better job bringing the data in-house? What about sites that just want backup in case AWS suffers a catastrophic failure?

  4. @Sam: SimpleDB should not be confused with a “regular” database. Firstly it´s in the cloud and thus less responsive (it just takes longer for the signals to travel). Secondly its API/data model is different from the relational data model. So if you´re used to mass data handling with single SQL statements, then SimpleDB is different. It just sports a sort of “select”.

    Now, what can you do with the features left in SimpleDB – or to put it differently: which make SimpleDB shine?

    It´s a very simple API, so put on top of it your favorite higher level API, e.g. retrieve complete items from a query instead of just item names, if you like. Or model higher level data structures (like list, trees) with the simple items. SimpleDB is good at serving several concurrent request, e.g. retrieve all children of a root note in parallel.

    This already hints at how naturally you can map object models to items. No inverse references using foreign keys, but “forward pointers” from parents to children. Think “easy objects graph serialization”.

    This hints at object/data caching. SimpleDB is not supposed to replace your local long term storage, but to ease short term storage, to foster communication between collaborating parties.

    If you like (and are a .NET programmer), check out this implementation of the SimpleDB API for local and remote use: NSimpleDB, http://code.google.com/p/nsimpledb/. It might help to clear things up for you.

    -Ralf

  5. SDB Explorer has been made as an industry leading graphical user interface (GUI) to explore Amazon SimpleDB service thoroughly and in a very efficient and user friendly way.

  6. Andrew says:

    What does “eventual consistency” mean in a system without transactional semantics? If your only operations are get, put and delete, how can something ever be inconsistent – inconsistent with what? If the answer is that you may get old versions of data when a get is requested after a put has been successfully completed, then this would seem to be a problem for a huge number of applications. Are there timing guarantees – do I know my put will be “committed” within a certain amount of time? Is there any way to know that the data you receive is not current? That would help, but not much.

    Thoughts?

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>