Hiểu thực tế tàn khốc của distributed systems: network partitions, latency, packet loss và partial failures. Master CAP theorem và trade-off thinking để thiết kế hệ thống resilient.
Chia sẻ bài học
Tôi còn nhớ lần đầu tiên production system của tôi bị network partition.
2 giờ sáng. Monitoring alert nổ inh ỏi. Một phần users ở US không login được. Một phần users ở EU login bình thường. Database vẫn healthy. Application servers vẫn running.
Tôi hoang mang total. Kiểm tra logs, không thấy errors. Kiểm tra CPU, memory, tất cả normal.
Senior architect online: "Check network connectivity giữa US và EU datacenters."
Bingo. AWS có network issue. US datacenter không connect được EU datacenter. Hệ thống đang ở trạng thái network partition.
Đó là lúc tôi học bài học đắt giá nhất: Distributed systems thì luôn luôn fail. Không phải "if", mà là "when".
Single machine (local code):
# Local function call
result = calculate_total(items)
Assumptions:
Function luôn return (hoặc throw exception)
Execution time predictable
Data consistent
Either success or fail (no in-between)
Distributed system (network call):
# Remote API call
result = api.calculate_total(items)
Reality:
Request có thể lost (no response)
Response có thể timeout (slow network)
Request có thể duplicate (retry)
Partial success/failure possible
Data có thể inconsistent across nodes
Key insight: Khi move từ single machine sang distributed, mọi assumption bạn từng biết đều sai.
Peter Deutsch và James Gosling đã tổng kết 8 giả định sai lầm mà developers thường mắc phải:
1. The network is reliable
→ Reality: Packets drop, connections fail
2. Latency is zero
→ Reality: Network calls take 10-1000ms
3. Bandwidth is infinite
→ Reality: Network có giới hạn throughput
4. The network is secure
→ Reality: Data có thể bị intercept
5. Topology doesn't change
→ Reality: Servers die, networks reconfigure
6. There is one administrator
→ Reality: Multiple teams, multiple configs
7. Transport cost is zero
→ Reality: Serialization, network transfer cost
8. The network is homogeneous
→ Reality: Different protocols, versions
Lesson: Thiết kế distributed system = Thiết kế cho failure.
Murphy's Law: "Anything that can go wrong, will go wrong."
Trong distributed systems, law này được amplified:
"Everything that can go wrong, WILL go wrong. Simultaneously. At 3am. On Black Friday."
Failure 1: The Silent Data Corruption
Scenario:
- Database replicate Master → Slave
- Network packet corruption (1 in 10 million)
- Slave nhận corrupted data
- Application read từ Slave
- Return sai data cho users (subtle bugs)
Impact: 3 days để detect, 1 week để fix
Failure 2: The Cascading Timeout
Scenario:
- Service A call Service B (timeout 30s)
- Service B bị slow (disk issue)
- Service A đợi 30s per request
- Requests queue up
- Service A memory exhausted
- Service A crashes
- Services depend on A also fail
Impact: Entire system down 2 hours
Failure 3: The Clock Skew
Scenario:
- Server 1 clock: 10:00:00
- Server 2 clock: 09:59:30 (30s behind)
- Event A happens on Server 1 at 10:00:00
- Event B happens on Server 2 at 09:59:45
- System thinks B happened before A (wrong!)
Impact: Transaction ordering corrupted
Takeaway: Không có failure scenario nào "too weird". Tất cả đều xảy ra trong production.
Định nghĩa: Một phần hệ thống không communicate được với phần khác, nhưng cả hai vẫn running.
graph TB
subgraph "US Datacenter"
US1[Server US-1]
US2[Server US-2]
end
subgraph "EU Datacenter"
EU1[Server EU-1]
EU2[Server EU-2]
end
US1 <--> US2
EU1 <--> EU2
US1 -.X Network Partition X.-> EU1
US2 -.X Network Partition X.-> EU2
style US1 fill:#51cf66
style US2 fill:#51cf66
style EU1 fill:#51cf66
style EU2 fill:#51cf66
Network partition: US và EU datacenters không connect được nhau, nhưng servers trong mỗi datacenter vẫn hoạt động bình thường
Scenario thực tế:
Timeline:
10:00:00 - Network partition xảy ra
10:00:01 - User A (US) updates profile
10:00:02 - User B (EU) reads profile → Sees old data
10:00:30 - Network heals
10:00:31 - EU datacenter receives update
10:00:32 - User B reads again → Sees new data
Problem: 30 giây inconsistency
Tại sao partition xảy ra?
Common causes:
- Network equipment failure (switch, router)
- Fiber optic cable cut
- AWS/GCP regional issues
- DDoS attacks
- Misconfiguration
- Too much traffic (congestion)
Không thể prevent, chỉ có thể handle:
# Assume network luôn available
def update_user(user_id, data):
primary_db.write(data)
replica_db.write(data) # Assume này luôn succeed
# Handle partition
def update_user(user_id, data):
try:
primary_db.write(data)
except WriteError:
log_error("Primary DB unavailable")
queue_for_retry(data)
try:
replica_db.write(data)
except WriteError:
# Replica unavailable, but primary succeeded
# System vẫn functional (eventual consistency)
log_warning("Replica unavailable, will retry")
Định nghĩa: Network call chậm hơn expected, nhưng eventually complete.
Problem: Slow ≠ Down. System vẫn hoạt động nhưng UX terrible.
Normal latency: 50ms
High latency: 5000ms (100x slower)
User experience:
- Click button
- Wait 5 seconds
- Still loading...
- Click again (duplicate request!)
- Wait more...
- Give up, close tab
Cascading effect:
sequenceDiagram
participant User
participant API
participant Service
participant DB
User->>API: Request (timeout 10s)
API->>Service: Call service (timeout 8s)
Service->>DB: Query (slow, takes 9s)
Note over Service,DB: DB slow due to load
Note over API,Service: Service timeout!
Service--xAPI: Timeout error
API--xUser: 500 error
Note over User: User retries
User->>API: Retry request
Note over API,DB: More load, slower
High latency gây timeout, timeout gây retry, retry gây more load, more load gây slower → Death spiral
Solutions:
# Solution 1: Timeouts với reasonable values
response = requests.get(url, timeout=2) # Fail fast
# Solution 2: Circuit breaker
class CircuitBreaker:
def call(self, func):
if self.failure_rate > threshold:
raise CircuitOpenError("Too many failures")
try:
return func()
except TimeoutError:
self.record_failure()
# Solution 3: Bulkhead pattern (isolate resources)
thread_pool_for_service_a = ThreadPool(10)
thread_pool_for_service_b = ThreadPool(10)
# Service A slow không ảnh hưởng Service B
Định nghĩa: Network packets không đến destination.
Send: [Packet 1] [Packet 2] [Packet 3] [Packet 4]
Receive: [Packet 1] [ ] [Packet 3] [Packet 4]
↑ Lost!
Typical packet loss rates:
LAN (local network): 0.01% - 0.1%
WiFi: 1% - 5%
Internet (good): 0.1% - 1%
Internet (poor): 5% - 20%
Impact:
# Request bị lost
client.send(request)
# ... no response ...
# Client không biết: request lost? or response lost?
# TCP retransmit automatically (but slow)
# UDP không retry (faster but unreliable)
Application-level handling:
# Idempotent operations
def create_user(user_id, data):
# Use user_id (client-generated UUID)
# Retry-safe: create twice = same result
db.insert_or_update(user_id, data)
# Non-idempotent (dangerous)
def increment_counter():
counter += 1
# Retry = wrong result!
Định nghĩa: Same message delivered multiple times.
Scenario:
sequenceDiagram
participant Client
participant Server
participant DB
Client->>Server: Create order
Server->>DB: INSERT order
DB-->>Server: Success
Note over Server,Client: Response packet lost!
Server--xClient: (no response received)
Note over Client: Client timeout, retry
Client->>Server: Create order (again)
Server->>DB: INSERT order (duplicate!)
DB-->>Server: Success
Server-->>Client: Success
Note over DB: Now have 2 orders!
Network packet loss gây duplicate messages khi client retry
Real-world impact:
Scenario: Payment processing
1. User click "Pay $100"
2. Request timeout
3. User click again
4. Both requests succeed
5. User charged $200!
💸 Company lawsuit, refunds, bad press
Solutions:
# Solution 1: Idempotency key
def process_payment(amount, idempotency_key):
if cache.exists(idempotency_key):
return cache.get(idempotency_key) # Return cached result
result = charge_card(amount)
cache.set(idempotency_key, result, ttl=86400)
return result
# Client
idempotency_key = uuid.generate()
process_payment(100, idempotency_key)
# Retry với same key → No duplicate charge
# Solution 2: Database constraints
CREATE TABLE orders (
order_id UUID PRIMARY KEY, -- Client-generated
user_id INT,
amount DECIMAL
);
-- Duplicate INSERT fails due to PK constraint
Định nghĩa: Một phần hệ thống fail, phần khác vẫn hoạt động.
Đây là failure type nguy hiểm nhất.
Scenario: E-commerce checkout
Step 1: Charge credit card → Success
Step 2: Create order in DB → Success
Step 3: Send confirmation email → Email service down
Step 4: Update inventory → Inventory service slow, timeout
Result:
- User charged
- Order created
- No email (user confused)
- Inventory not updated (oversell risk)
Partial success = Inconsistent state
graph LR
A[Checkout Request] --> B[Payment Service]
B --> C[Order Service]
C --> D[Email Service]
C --> E[Inventory Service]
B -.Success.-> B
C -.Success.-> C
D -.Failed X.-> D
E -.Timeout X.-> E
style B fill:#51cf66
style C fill:#51cf66
style D fill:#ff6b6b
style E fill:#ff6b6b
Partial failure: Một số services succeed, một số fail, dẫn đến inconsistent state
Solutions:
# Solution 1: Saga pattern (compensating transactions)
def checkout_saga(order):
try:
# Step 1: Charge
payment_id = payment_service.charge(order.amount)
# Step 2: Create order
order_id = order_service.create(order)
# Step 3: Send email
email_service.send(order.email)
# Step 4: Update inventory
inventory_service.decrease(order.items)
return Success(order_id)
except EmailServiceError:
# Email failed, compensate
order_service.mark_no_email(order_id)
queue.retry_email(order) # Retry later
return PartialSuccess(order_id)
except InventoryServiceError:
# Critical failure, rollback
payment_service.refund(payment_id)
order_service.cancel(order_id)
return Failure("Inventory unavailable")
# Solution 2: Event sourcing
# Log all events, replay để achieve consistency
events = [
{type: "payment_charged", amount: 100},
{type: "order_created", order_id: 123},
{type: "email_failed", retry_at: "10:00"},
{type: "inventory_timeout", retry_at: "10:01"}
]
# Background job process failures
Trong distributed database với network partition, chỉ có thể chọn 2 trong 3:
C (Consistency):
Mọi read nhận được latest write
Mọi node thấy cùng data cùng lúc
A (Availability):
Mọi request nhận được response (non-error)
System luôn hoạt động
P (Partition Tolerance):
System tiếp tục hoạt động khi có network partition
Nodes không communicate được nhau
Reality: Network partition LUÔN xảy ra (P là mandatory).
→ Phải chọn giữa C hoặc A.
Scenario: Banking App với 2 Datacenters
graph LR
subgraph "US Datacenter"
US_DB[(Database<br/>Balance: $100)]
end
subgraph "EU Datacenter"
EU_DB[(Database<br/>Balance: $100)]
end
US_DB -.X Network Partition.-> EU_DB
USER_A[User A<br/>US] --> US_DB
USER_B[User B<br/>EU] --> EU_DB
Network partition happens. What to do?
Strategy: Sacrifice availability để đảm bảo consistency.
# CP system behavior
def withdraw(amount):
if not can_reach_all_replicas():
raise ServiceUnavailableError("System temporarily unavailable")
# Only proceed nếu có thể sync all replicas
for replica in all_replicas:
replica.withdraw(amount)
return Success()
Timeline khi partition:
10:00:00 - Network partition
10:00:01 - User A (US) withdraw $50
→ US DB: "Cannot reach EU, refuse request"
→ User sees error (unavailable)
10:00:02 - User B (EU) withdraw $30
→ EU DB: "Cannot reach US, refuse request"
→ User sees error (unavailable)
Result:
System unavailable (bad UX)
No inconsistency (data safe)
Trade-offs:
Data luôn consistent
No risk of conflicting updates
Good cho: Banking, payments, inventory
System down khi partition
Bad UX
Lost transactions
Strategy: Sacrifice consistency để đảm bảo availability.
# AP system behavior
def withdraw(amount):
# Proceed with local replica, sync later
local_db.withdraw(amount)
queue_sync_to_other_replicas(amount)
return Success()
Timeline khi partition:
10:00:00 - Network partition
10:00:01 - User A (US) withdraw $50
→ US DB: $100 - $50 = $50
→ Success (local write)
10:00:02 - User B (EU) withdraw $30
→ EU DB: $100 - $30 = $70
→ Success (local write)
10:01:00 - Network heals
10:01:01 - Replicas sync
→ Conflict! US says $50, EU says $70
Result:
System available (good UX)
Data inconsistent (problem!)
Conflict resolution strategies:
# Strategy 1: Last-Write-Wins (timestamp)
us_balance = {"value": 50, "timestamp": "10:00:01"}
eu_balance = {"value": 70, "timestamp": "10:00:02"}
# EU timestamp later → Balance = $70
# Lost US transaction!
# Strategy 2: Vector clocks (keep both)
us_balance = {"value": 50, "version": [1, 0]}
eu_balance = {"value": 70, "version": [0, 1]}
# Both are concurrent → Flag for manual resolution
# No data loss, but need human intervention
# Strategy 3: Application-specific merge
# Banking: Merge = apply both withdrawals
# Balance = $100 - $50 - $30 = $20
# Correct final state
# But requires domain knowledge
Trade-offs:
System luôn available
Good UX
Good cho: Social media, recommendations, analytics
Data có thể inconsistent
Need conflict resolution
Eventual consistency (not immediate)
Banking App:
Choice: CP
Why? Data accuracy > Availability
User có thể đợi, không thể accept wrong balance
Social Media Feed:
Choice: AP
Why? Availability > Strict consistency
User see slightly stale likes là OK
E-commerce Inventory:
Choice: CP
Why? Cannot oversell products
Better show "unavailable" than sell đã hết
Tự hỏi:
1. Consistency critical không?
- Wrong data = big problem? (Banking: YES)
- Stale data acceptable? (Social feed: YES)
2. Availability critical không?
- Downtime acceptable? (Internal tool: YES)
- Must always work? (Public API: YES)
3. Partition frequency?
- Single datacenter: Rare
- Multi-region: Common
CP System:
When: Financial transactions, inventory, bookings
Accept: Downtime during partitions
Gain: Data accuracy
AP System:
When: Social features, analytics, recommendations
Accept: Eventual consistency
Gain: Always available
CA System:
Impossible in distributed systems
(Network partition luôn happen)
Reality: Không phải all-or-nothing. Có thể mix.
# Different consistency cho different data
class UserService:
def update_profile(self, user_id, data):
# Profile updates: AP (eventual consistency OK)
local_db.write(data)
async_replicate(data)
def update_balance(self, user_id, amount):
# Balance updates: CP (strong consistency required)
if not all_replicas_reachable():
raise UnavailableError()
for replica in all_replicas:
replica.write(amount)
# Per-feature consistency choice
features = {
"user_profile": "AP", # Stale OK
"account_balance": "CP", # Must be accurate
"friend_list": "AP", # Stale OK
"purchase_history": "CP" # Must be accurate
}
Assume everything fails.
# Optimistic (dangerous)
def call_service():
return service.request()
# Pessimistic (safe)
def call_service():
try:
return service.request(timeout=2)
except TimeoutError:
log_error("Service timeout")
return fallback_value()
except ConnectionError:
log_error("Service unreachable")
return cached_value()
except Exception as e:
log_error(f"Unexpected error: {e}")
return default_value()
Don't wait forever. Fail quickly và retry.
# Slow failure (30s timeout)
response = requests.get(url, timeout=30)
# Fast failure (2s timeout)
response = requests.get(url, timeout=2)
# Exponential backoff retry
for attempt in range(max_retries):
try:
return requests.get(url, timeout=2)
except TimeoutError:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # 1s, 2s, 4s, 8s
else:
raise
Operations phải safe để retry.
# Not idempotent
def process_order(order):
send_email(order.email) # Retry = duplicate emails!
charge_card(order.amount) # Retry = double charge!
# Idempotent
def process_order(order, idempotency_key):
if already_processed(idempotency_key):
return get_cached_result(idempotency_key)
result = {
"email_sent": send_email_once(order.email, idempotency_key),
"payment": charge_card_once(order.amount, idempotency_key)
}
cache_result(idempotency_key, result)
return result
Không phải mọi thứ cần immediate consistency.
# Strong consistency (expensive)
def like_post(post_id):
db.transaction:
count = db.get_like_count(post_id)
db.set_like_count(post_id, count + 1)
# Every like requires DB write
# Eventual consistency (cheap)
def like_post(post_id):
cache.increment(f"likes:{post_id}")
queue.add({"action": "like", "post_id": post_id})
# Batch update DB every 10 seconds
# User sees: 1000 likes
# Reality: 1002 likes (2 second lag)
# Acceptable? YES for social features
Distributed systems failures:
1. Network partition: Nodes không communicate
2. High latency: Slow ≠ Down
3. Packet loss: Messages disappear
4. Message duplication: Same message nhiều lần
5. Partial failure: Some succeed, some fail
CAP theorem:
Cannot have all 3: Consistency, Availability, Partition Tolerance
Partition tolerance là mandatory (network luôn fail)
→ Choose CP (consistency) or AP (availability)
CP: Banking, payments, inventory
AP: Social media, analytics, recommendations
Design principles:
1. Design for failure (assume everything fails)
2. Fail fast (short timeouts, quick retries)
3. Idempotency (safe to retry)
4. Eventual consistency (acceptable cho nhiều cases)
Trade-off thinking:
No perfect solution.
Every choice sacrifices something:
- CP sacrifices availability
- AP sacrifices consistency
- Idempotency sacrifices simplicity
- Retries sacrifice latency
Choose based on business requirements, không phải technical preference.
Murphy's Law is real:
"If it can fail, it will fail."
Trong distributed systems:
"Everything will fail. Simultaneously. At worst time."
Design accordingly.
Câu hỏi tự kiểm tra:
Khi design distributed system, tự hỏi:
Nếu không trả lời được → Chưa ready cho production.
Distributed systems khó. Nhưng với đúng mindset (design for failure) và đúng patterns (timeouts, retries, idempotency), bạn có thể build systems resilient và reliable.
Fail gracefully > Never fail.