In webapp circles, denormalizing vs normalizing is an age old debate. Normalization is the classical approach you learned in Databases 101, or from that Oracle graybeard back in the day. Denormalization is the new kid on the block, child of web scale and NoSQL and 100x read/write ratios.
There’s still plenty of debate, but denormalizing is now a standard technique for scaling a modern webapp, especially when combined with a NoSQL datastore, and for good reason: it works. A new caveat occurred to me the other day, though. Denormalizing isn’t so useful when you’re small or medium sized, of course, since scaling isn’t a problem yet and denormalizing takes work. However, denormalizing may also be a bad idea when you’re extremely large.
At a high level, denormalizing buys faster reads by doing more writes up front. This works up to a point, but beyond that it can get out of hand. This writeup of Raffi Krikorian‘s presentation on scaling Twitter illustrates it well:
These high fanout users are the biggest challenge for Twitter. Replies are being seen all the time before the original tweets for celebrities…If it takes minutes for a tweet from Lady Gaga to fan out, then people are seeing her tweets at different points in time…Causes much user confusion.
[Twitter is] not fanning out the high value users anymore. For people like Taylor Swift, don’t bother with fanout anymore, instead merge in her timeline at read time. Balances read and write paths. Saves 10s of percents of computational resources.
So is there a sweet spot for denormalizing, if not for whole applications or datasets then maybe at the level of individual queries or objects? It seems like it might follow an inverted U-shaped graph, with a sweet spot somewhere in the range of 10 to 1000 writes per operation. It’s still doable outside that range, but the cost benefit tradeoff may start to go downhill. If the old adage goes normalize til it hurts, then denormalize til it works, we may need to add and renormalize when it starts hurting again.
Denormalizing is great! Except if you’re small, of course…and maybe also if you’re really big?
http://t.co/N9v97yIW9Z via Twitter
RT @snarfed_org: Denormalizing is great! Except if you’re small, of course…and maybe also if you’re really big?
http://t.co/N9v97yIW9Z via Twitter
Normalizing and schemifying (?) data is sexy and things and one can impress women with how much indexing space they saved by using proper column types, join tables and the like.. but you sure run into trouble when things start changing rapidly. (So, it would seem that denormalized, schema-y data “should” be the norm..?)
I think the main point is that one’s application code should probably be aware of “edge cases” (that N% of your users that make up 80%+ of your IO). Those sort of usage patterns are always there:
The one user with 1000 followers (80% who don’t even log in) and the 1000 users with 0 followers (who can’t log in since they don’t exist).
Would be neat to keep a list of “hot keynames” so, if an update came in destined for one of these users who always checks your site.. that update gets immediately committed (or immediately sent down whatever conveyor belt you use for parceling out updates).
If the update is for some noname who logged on once to follow Toonces The Cat and accidentally followed YOUGAKUDAN_00 (only ~42,000 followers but 37,342,623 tweets)… maybe we can just lazy load his 100 tweets from the last 60 seconds for that user if they every accidentally log on again.
+1ed this.
via plus.google.com
I encourage everyone who happens to read this publication to read “The Relational Model for Large Shared Data Banks” and understand all the problems that it addresses. This way, one can understand that “denormalizing”” is not a solution, nor is a “standard” let alone a “‘modern’ technique”, but rather it is RETURNING to problems that existed before 1970.
Cheers!