Transactions Across Datacenters
(and other weekend projects)



Great Ideas in Computer Science
Ryan Barrett
Sept. 8, 2009





Of three properties of distributed data systems - consistency, availability, partition tolerance - choose two.

Eric Brewer, CAP theorem, PODC 2000







Scaling is hard.

Various

Context

  • multihoming (n): operating from multiple datacenters simultaneously
  • serving
  • data processing
  • read-only data
  • read/write data (the hardest kind!)
  • Case study: App Engine datastore

What's ahead and takeaways

  • Consistency?
  • Why transactions?
  • Why across datacenters?
  • Multihoming
  • Techniques and tradeoffs
  • Conclusions

Consistency

  • Weak
  • Eventual
  • Strong
  • Use case: read after write

Weak consistency

  • After a write, reads may or may not see it
  • Best effort only
  • "Message in a bottle"
  • App Engine: memcache
  • VoIP, live online video
  • Realtime multiplayer games

Eventual consistency

  • After a write, reads will eventually see it
  • App Engine: mail
  • Search engine indexing
  • DNS, SMTP, snail mail
  • Amazon S3, SimpleDB

Strong consistency

  • After a write, reads will see it
  • App Engine: datastore
  • File systems
  • RDBMSes
  • Azure tables

What's ahead

  • Consistency?
  • Why transactions?
  • Why across datacenters?
  • Multihoming
  • Techniques and tradeoffs
  • Conclusions

Why transactions?

  • Cliched example: transfer money from A to B
    • subtract from A
    • add to B
  • What if something happens in between?
    • another transaction on A or B
    • machine crashes
    • ...

Why transactions?

  • Correctness
  • Consistency
  • Enforce invariants
  • ACID

What's ahead

  • Consistency?
  • Why transactions?
  • Why across datacenters?
  • Multihoming
  • Techniques and tradeoffs
  • Conclusions

Why across datacenters?

  • Catastrophic failures
  • Expected failures
  • Routine maintenance
  • Geolocality
    • CDNs, edge caching

Why not across datacenters?

  • Within a datacenter
    • High bandwidth: 1-100Gbps interconnects
    • Low latency: < 1ms within rack, 1-5ms across
    • Little to no cost
  • Between datacenters
    • Low bandwidth: 10Mbps-1Gbps
    • High latency: 50-150ms
    • $$$ for fiber

What's ahead

  • Consistency?
  • Why transactions?
  • Why across datacenters?
  • Multihoming
  • Techniques and tradeoffs
  • Conclusions

Multihoming

  • Hard problem.
  • ...consistently? Harder.
  • ...with real time writes? Hardest.
  • What to do?

Option 1: Don't

  • ...instead, bunkerize.
    • Most common
    • Microsoft Azure (tables)
    • Amazon SimpleDB
  • Bad at catastrophic failure
    • Large scale data loss
    • Example: SVColo
  • Not great for serving
    • No geolocation

Option 2: Primary with hot failover(s)

  • Better, but not ideal
    • Mediocre at catastrophic failure
    • Window of lost data
    • Failover data may be inconsistent
  • Amazon Web Services
    • EC2, S3, SQS: choose US or EU
    • EC2: Availability Zones, Elastic Load Balancing
  • Banks, brokerages, etc.
  • Geolocated for reads, not for writes

Option 3: True multihoming

  • Simultaneous writes in different DCs
  • Two way: hard
  • N way: harder
  • Handle catastrophic failure, geolocality
  • ...but pay for it in latency

What's ahead

  • Consistency?
  • Why transactions?
  • Why across datacenters?
  • Multihoming
  • Techniques and tradeoffs
  • Conclusions

Techniques and tradeoffs


Backups M/S MM 2PC Paxos
Consistency  
Transactions  
Latency  
Throughput  
Data loss  
Failover  

Backups

  • Make a copy
  • Sledgehammer
  • Weak consistency
  • Usually no transactions
  • Datastore: early internal launch

Backups


Backups M/S MM 2PC Paxos
Consistency Weak
Transactions No
Latency Low
Throughput High
Data loss Lots
Failover Down

Master/slave replication

  • Usually asynchronous
    • Good for throughput, latency
  • Most RDBMSes
    • e.g. MySQL binary logs
  • Weak/eventual consistency
    • Granularity matters!
  • Datastore: current

Master/slave replication


Backups M/S MM 2PC Paxos
Consistency Weak Eventual
Transactions No Full
Latency Low
Throughput High
Data loss Lots Some
Failover Down Read only

Multi-master replication

  • Umbrella term for merging concurrent writes
  • Asynchronous, eventual consistency
  • Need serialization protocol
    • e.g. timestamp oracle: monotonically increasing timestamps
    • Either SPOF with master election...
    • ...or distributed consensus protocol
  • No global transactions!
  • Datastore: no strong consistency

Multi-master replication


Backups M/S MM 2PC Paxos
Consistency Weak Eventual
Transactions No Full Local
Latency Low
Throughput High
Data loss Lots Some
Failover Down Read only Read/write

Two Phase Commit

  • Semi-distributed consensus protocol
    • deterministic coordinator
  • 1: propose, 2: vote, 3: commit/abort
  • Heavyweight, synchronous, high latency
  • 3PC buys async with extra round tripReplication and distributed transactions
  • -->
  • Datastore: poor throughput

Two Phase Commit


Backups M/S MM 2PC Paxos
Consistency Weak Eventual Strong
Transactions No Full Local Full
Latency Low High
Throughput High Low
Data loss Lots Some None
Failover Down Read only Read/write

Paxos

  • Fully distributed consensus protocol
  • "Either Paxos, or Paxos with cruft, or broken"
    • Mike Burrows
  • Majority writes; survives minority failure
  • Protocol similar to 2PC/3PC
    • Lighter, but still high latency

Paxos for the Datastore

  • Close together? no.
  • In the same datacenter? no.
  • Opt-in? maybe...

Paxos for App Engine

  • Coordinate moving state
  • ...usually via lock server
  • Moving apps
  • Memcache
  • Offline processing

Paxos


Backups M/S MM 2PC Paxos
Consistency Weak Eventual Strong
Transactions No Full Local Full
Latency Low High
Throughput High Low Medium
Data loss Lots Some None
Failover Down Read only Read/write

Conclusions

  • No silver bullet...
  • ...but still worth doing!
  • Embrace tradeoffs...
  • ...and punt them to application developers!

What's behind (phew!)

  • Consistency?
  • Why transactions?
  • Why across datacenters?
  • Multihoming
  • Techniques and tradeoffs
  • Conclusions


Questions? bit.ly/5YhO2

Slides: bit.ly/2Cs7Ec