Let's start with a summary of what eventual consistency is and what its limitations are. Eventually consistent systems in the style of Amazon Dynamo are designed to be fast and highly available during failures, even when parts of the system fail. We're guaranteed that all copies of data will converge to the same state at some future point. However, this puts the burden of handling temporary inconsistencies on the app. Here are some examples of guarantees that eventual consistency can't provide:
- Key uniqueness: If we can't guarantee uniqueness of email addresses registered with accounts, one email address may get attached to two accounts.
- If check-and-set is done by reading and then writing, the order of updates between competing writers may not be the same on all storage nodes. This means that there will be a temporary disagreement.
- All-or-nothing updates: If there's a network partition and the coordinator's write request only succeeds on one out of three storage nodes, the client will receive a response indicating partial success.
- Read-your-write consistency: The writes succeed in the zone that accepts the write. Readers in other zones aren't guaranteed to see the writes.
We're given these limitations of eventual consistency. How can we design a system that does enforce uniqueness constraints and maintain high availability?
To characterize coordination avoidance, we need a system model. We're aiming for a system with these key properties:
- Global validity: Invariants hold over committed states.
- Transactional availability: We guarantee a response.
- All copies of data in a distributed database will converge to the same state at some future point in the absence of any changes.
- Coordination-freedom: Transactions shouldn't have to communicate during transaction execution
We're trying to build a system where data is replicated across many servers, transactions can happen on any server without talking to others, and yet we can guarantee that our data remains consistent and that all servers eventually agree.
Our database as a collection of data items1. Each has multiple versions. App clients submit requests to the database in the form of transactions, or groups of operations2 that should be executed together. Each transaction operates on a logical replica, or set of versions of the items mentioned in the transaction. Transactions operate over "snapshots” of database state. Upon commit3, the replica state is merged into the set of versions on at least one server. We assume this merge operator is commutative, associative, and idempotent. For example, if server
and
then
To determine whether a database state is valid according to application correctness criteria, we use invariants. You need usernames to be unique. Other kinds of constraints are very similar: An account balance never goes negative, a meeting room doesn't have overlapping bookings. A transaction can commit, or abort4 if committing the transaction would violate a declared invariant over the replica state of its set of transactions
If
Figure -1
graph TD
subgraph PRECONDITION
Ds("Ds<br><font color='blue'>Initial state: empty meeting room schedule</font>") --> Di1("Di1")
Ds --> Dj1("Dj1")
Di1 --> Din("Din")
Dj1 --> Djm("Djm")
Din --> Di1
Djm --> Dj1
style Ds fill:#f9f,stroke:#333,stroke-width:2px
style Di1 fill:#ccf,stroke:#333,stroke-width:2px
style Dj1 fill:#ccf,stroke:#333,stroke-width:2px
style Din fill:#f9f,stroke:#333,stroke-width:2px
style Djm fill:#f9f,stroke:#333,stroke-width:2px
Ds -->|"I(Ds) = True"| Di1
Ds --> |"I(Ds) = True"| Dj1
Di1 --> |"I(Di1) = True"| Din
Dj1 --> |"I(Dj1) = True"| Djm
Di1 -.-> |"-ti2<br><font color='blue'>T1: Alice books Room A, 10:00-11:00</font>"| Din
Dj1 -.-> |"-tj2<br><font color='blue'>T2: Bob books Room A, 11:00-12:00</font>"| Djm
Din -.-> |"-tin"| Di1
Djm -.-> |"-tjm"| Dj1
Ds -.-> |"-ti1"| Di1
Ds -.-> |"-tj1"| Dj1
note_local[/"<font color='green'>Transactions complete locally</font>"/]
Di1 & Dj1 --> note_local
end
subgraph IMPLICATION
Din_Djm("Din ⊔ Djm<br><font color='green'>Merged state: both bookings are present, no conflict</font>")
Din_Djm --> |"I(Din ⊔ Djm) = True"| Din
Din_Djm --> |"I(Din ⊔ Djm) = True"| Djm
style Din_Djm fill:#ccf,stroke:#333,stroke-width:2px
Din -.-> Din_Djm
Djm -.-> Din_Djm
end
PRECONDITION -.-> |"valid divergence from initial state"| IMPLICATION
IMPLICATION -.-> |"merge must be valid"| PRECONDITION
Theorem 1: A globally
The theorem establishes
Writes are performed in the same, well-defined order8. The merge procedures9 are deterministic so that servers resolve the same conflicts in the same manner. The Bayou system uses a primary commit scheme. One server designated as the primary takes responsibility for committing updates. The primary is responsible for deciding the final order of committed operations. Truncating the logs guarantees that they can catch up with the latest state. As a multi-tenant database, Manhattan needs to provide high quality of service to each customer without overwhelming the log.
Example 1: Handling a money transfer
sequenceDiagram
participant Client
participant RequestLog as Request Log (partition)
participant DebitStream as Debit Instruction Stream (partition)
participant CreditStream as Credit Instruction stream (partition)
Client->>RequestLog: Submit request (request_id, from, to, amount)
Note over RequestLog: Validate & persist request
RequestLog-->>Client: Request acknowledged (request_id)
RequestLog->>DebitStream: Debit instruction (request_id, from_account, amount)
Note over DebitStream: Process debit
DebitStream--xRequestLog: Debit result (success/failure)
RequestLog->>CreditStream: Credit instruction (request_id, to_account, amount)
Note over CreditStream: Process credit
CreditStream--xRequestLog: Credit result (success/failure)
Note over RequestLog: Aggregate results, update request status
RequestLog-->>Client: Final request status (success/failure/partial)
Apply every request exactly once to both the payer and payee accounts. We can consider more complex invariants, such as foreign key constraints. Insertions are