Transactions Across Datacenters
(and other weekend projects)

Special Lecture Series in Computer Science
University of San Francisco
Feb. 12, 2009

Of three properties of distributed data systems - consistency, availability, partition tolerance - choose two.

-Eric Brewer, CAP theorem, PODC 2000

Scaling is hard.

-Various

What's ahead

Why transactions?
Consistency
Why across datacenters?
Multihoming
Techniques
Tradeoffs

GMail

Mail delivery: eventual, guaranteed
Mailbox operations: immediate, guaranteed
Web clip history: best effort

Why transactions?

Correctness
Consistency
Enforce invariants
ACID

Cliched examples

Transfer money from A to B

subtract from A
add to B

What if something happens in between?

another transaction on A or B
machine crashes
...

Gmail: archive an email

What's ahead

Why transactions?
Consistency
Why across datacenters?
Multihoming
Techniques
Tradeoffs

Consistency

Weak
Eventual
Strong

Weak consistency

After a write, subsequent reads may or may not see the new data
Best effort only
"Message in a bottle"
IP/UDP, SMS, JPEG
GMail: Web Clip history

Eventual consistency

After a write, subsequent reads will eventually see the new data
Search engine indexing
DNS, SMTP, snail mail
Amazon S3, SimpleDB
GMail: mail delivery

Strong consistency

After a write, subsequent reads will see the new data
File systems
RDBMSes
App Engine datastore, Azure tables
GMail: mailbox operations

What's ahead

Why transactions?
Consistency
Why across datacenters?
Multihoming
Techniques
Tradeoffs

Why across datacenters?

Catastrophic failures
Expected failures
Routine maintennance
Geolocality

CDNs, edge caching

Why not across datacenters?

Within a datacenter

High bandwidth: 1-100Gbps interconnects
Low latency: < 1ms within a rack, < 5ms across
Little to no cost

Between datacenters

Low bandwidth: 10Mbps-1Gbps
High latency: expect 100s of ms
$$$ for bandwidth or fiber

What's ahead

Why transactions?
Consistency
Why across datacenters?
Multihoming
Techniques
Tradeoffs

Multihoming

Multihoming (n): operating from multiple datacenters simultaneously
Hard problem.
...consistently? Harder.
...with real time writes? Hardest.

Option 1: Don't.

...instead, bunkerize.

Most common
Oracle, MySQL (even sharded)
Microsoft, Cisco, telcos, US gov't

Bad at catastrophic failure

Large scale data loss

Not great for serving

No geolocation

Option 2: Primary with hot failover(s)

Better, but not ideal

Mediocre at catastrophic failure
Window of lost data
Failover data may be inconsistent

Banks, brokerages, etc.
Geolocated for reads, not for writes

Option 3: True multihoming

Simultaneous writes in different DCs, maintaining consistency
Two way

NASDAQ
Hard.

N way

Amazon, Google
Harder.

Handles catastrophic failure, geolocality

What's ahead

Why transactions?
Consistency
Why across datacenters?
Multihoming
Techniques
Tradeoffs

Interested in...

Replication
Transactions

distributed?
decentralized?

Backups

Make a copy
Sledgehammer
Weak consistency
Replication, not transactions

Locking

Sledgehammer
Common in RDBMSes
Good for consistency, bad for throughput
Optimizations

shared locks, read/write locks
InnoDB (MySQL), Oracle

Transactions, not replication

Optimistic concurrency

Opposite of "pessimistic" locking
Journal writes, check for collisions before commit
Useful variant: multi version concurrency control
Good for throughput, read-heavy workloads
Transactions, not replication

Master/slave replication

Usually asynchronous

Good for throughput, latency

Most RDBMSes

e.g. MySQL binary logs

Weak/eventual consistency

Granularity matters!

Replication, not transactions

Multi-master replication

Umbrella term for merging concurrent writes
Usually asynchronous, eventual consistency
Usually need serialization protocol

e.g. timestamp oracle: monotonically increasing timestamps
Either SPOF with master election...
...or distributed consensus protocol

Replication and transactions

Two Phase Commit

Centralized consensus protocol
Single coordinator (SPOF)
1: propose, 2: vote, 3: commit/abort
Heavyweight (high latency) and blocking

3PC buys non-blocking with extra message

Replication and distributed transactions
Supports true multihoming

Paxos

Decentralized, distributed consensus protocol
"Either Paxos, or Paxos with cruft, or broken"

Mike Burrows

Majority writes; survives minority failure
Protocol similar to 2PC/3PC

Lighter, but still high latency

Replication and distributed transactions
Supports true multihoming

What's ahead

Why transactions?
Consistency
Why across datacenters?
Multihoming
Techniques
Tradeoffs

Tradeoffs (very approximate)

	Paxos ... 2PC ... MMR ... MSR ... Backups
Consistency	Strong		Eventual	Weak
Latency	High		Low
Throughput	Medium	Low	High
Data loss	None		Some	Lots
Failover	N/A		Minimal impact	Read only

Tradeoffs (very approximate)

	Paxos ... 2PC ... MMR ... MSR ... Backups
Consistency	Strong		Eventual	Weak
Latency	High		Low
Throughput	Medium	Low	High
Data loss	None		Some	Lots
Failover	N/A		Minimal impact	Read only
	GMail

In conclusion...

:'(

What's behind (phew!)

Why transactions?
Consistency
Why across datacenters?
Multihoming
Techniques
Tradeoffs

Questions?

(slides will be on snarfed.org)