1. Failures
<aside>
💡 Fail-Stop failures: falure ⇒ stop computer
</aside>
Replication can’t solve problems like:
- Logic bugs
- Configuration errors
- Malicious errors
And may solve problems like:
2. Challenge
- Has primary actually failed?
- Can’t tell diff between network partition and computer fail
- May cause split-brain system
- How do we keep primary / backup in sync
- Apply all changes in the right order
- Deal with non-determinism
- Fail over
3. Two Approaches
- State transfer ⇒ Send snapshots to the backup
- Replicated State Machine ⇒ Only send operations to the backup
<aside>
💡 Level of operations to replicate
- Application-level
- Machine level ⇒ transparent!
- Then application doesn’t need to be modified at all
- Use virtual machines
</aside>