SystemDesign Core
RoadmapDocsBlogAbout
Bắt đầu học

© 2026 System Design Core. All rights reserved.

RoadmapDocsGitHub

Phase 6 — System Design Mastery

Architect Mindset & Production Thinking - From Engineer To Trusted Architect

Học cách thinking như một architect: production incident mindset, failure thinking, monitoring strategy, capacity planning, migration approach và technical leadership - từ engineer đến trusted architect

Bài học trong phase

  • Bài 1

    System Design Thinking Framework - Từ Pattern Đến Problem-Solving Mindset

  • Bài 2

    SNAKE Framework - Methodology Để Ace System Design Interview

  • Bài 3

    Trade-offs & Architecture Decision Making - Tư Duy Senior Architect

  • Bài 4

    Real Interview Strategy - Áp Dụng SNAKE Vào System Design Thực Tế

  • Bài 5

    Advanced Architecture Topics - Kiến Thức Differentiate Senior vs Staff Engineer

  • Bài 6

    Architect Mindset & Production Thinking - From Engineer To Trusted Architect

Tổng quan phase
  1. Roadmap
  2. /
  3. Phase 6 — System Design Mastery
  4. /
  5. Architect Mindset & Production Thinking - From Engineer To Trusted Architect

Architect Mindset & Production Thinking - From Engineer To Trusted Architect

Học cách thinking như một architect: production incident mindset, failure thinking, monitoring strategy, capacity planning, migration approach và technical leadership - từ engineer đến trusted architect

Chia sẻ bài học

Architect Mindset & Production Thinking - From Engineer To Trusted Architect

Sau khi master tất cả technical knowledge - patterns, distributed systems, scalability - câu hỏi cuối cùng là: "Điều gì tạo nên một trusted architect?"

Câu trả lời không phải technical skill. Mà là mindset shift từ "write code" sang "own the system".

Engineer viết code để solve problem. Architect design system và chịu tr책nhiệm khi production down lúc 3AM.

Lesson này không dạy thêm patterns. Nó dạy cách thinking như người phải on-call, phải explain outage cho CEO, phải plan migration cho 100M users.

Reality Check: You Own The System

Engineer vs Architect Mindset

Engineer thinking:
"Feature works on my laptop ✅"
"Passed all tests ✅"
"Deployed to production ✅"
→ Move to next ticket

Architect thinking:
"Will this work under 10x load?"
"What if database fails?"
"How do we rollback?"
"What's monitoring strategy?"
"Who's on-call when this breaks?"
→ System ownership

Core difference: Architects think beyond the happy path.

The 3AM Test

Scenario: Production alert at 3AM

Engineer response:
"Not my code, not my problem"
"Let's check tomorrow"

Architect response:
"What's user impact?"
"How do we mitigate now?"
"Root cause analysis tomorrow"
"How do we prevent this?"

Architect mindset = Ownership mindset.

1. Production Incident Thinking - Expect Failure

Everything Will Fail

Fundamental truth: All systems fail. Your job is minimizing impact.

flowchart TB
    Normal[Normal Operation] --> Incident[Incident Detected]
    Incident --> Assess[Assess Impact]
    Assess --> Mitigate[Mitigate ASAP]
    Mitigate --> RCA[Root Cause Analysis]
    RCA --> Prevent[Prevention Measures]
    Prevent --> Normal

Incident Response Framework

Phase 1: Detection (Minutes matter)

// Bad: Discover từ user complaints
User tweet: "Your site is down!"
→ 30 minutes before team knows

// Good: Proactive monitoring
Alert: "API error rate: 5% (threshold: 1%)"
Slack: "@oncall immediate action required"
→ Team knows trong 2 minutes

Phase 2: Assessment (Understand before act)

Critical questions:
1. User impact? (How many users affected?)
2. Business impact? (Revenue loss? SLA breach?)
3. Scope? (One service? Entire system?)
4. Trend? (Getting worse? Stable?)

Example:
- 15% of checkout requests failing
- ~500 users affected
- Payment service only
- Error rate stable last 10 min
→ High severity, contained scope

Phase 3: Mitigation (Fix now, understand later)

Priority: Stop the bleeding

Options (in order):
1. Rollback (safest, fastest)
2. Feature flag off (disable new code)
3. Traffic routing (shift to healthy instances)
4. Graceful degradation (disable non-critical features)
5. Emergency fix (last resort)

Example:
Payment service down?
→ Rollback to previous version (5 min)
Not: Debug root cause in production (hours)

Phase 4: Communication (Keep stakeholders informed)

Incident update template:

Status: INVESTIGATING
Impact: 15% checkout failures
Started: 14:30 UTC
Current action: Rolling back payment service to v2.3.1
Next update: 14:45 UTC

Audience:
- Internal team (Slack #incidents)
- Support team (brief customers)
- Management (business impact)
- Customers if needed (status page)

Phase 5: Root Cause Analysis (Learn from failure)

# Post-Mortem: Payment Service Outage
Date: 2024-02-15
Duration: 14:30 - 14:52 UTC (22 minutes)
Impact: 15% checkout failures, ~500 affected users

## Timeline
14:30 - Deployed v2.3.2 with new retry logic
14:32 - Error rate started climbing
14:35 - Alerts fired
14:37 - Oncall acknowledged
14:40 - Decision to rollback
14:45 - Rollback completed
14:52 - Service recovered

## Root Cause
Retry logic introduced infinite loop on certain errors.
Under load, exhausted connection pool.

## Impact
- 500 users experienced checkout failures
- Estimated revenue loss: $5,000
- No data loss or corruption

## Action Items
[ ] Add circuit breaker to retry logic (Owner: @alice, Due: 2/20)
[ ] Load testing for retry scenarios (Owner: @bob, Due: 2/22)
[ ] Connection pool monitoring (Owner: @carol, Due: 2/18)
[ ] Review deployment checklist (Owner: @dave, Due: 2/25)

## What Went Well
- Fast detection (2 min from deploy to alert)
- Clear rollback procedure
- Good communication

## What To Improve
- Load testing didn't catch this scenario
- Retry logic not reviewed carefully enough
- Need better connection pool visibility

Blameless Culture

Bad post-mortem:
"Bob deployed buggy code"
"Alice didn't review properly"
→ People hide mistakes

Good post-mortem:
"Deployment process didn't catch infinite loop"
"Need better load testing coverage"
"Review checklist should include retry logic patterns"
→ Focus on system improvements

2. Failure Mindset - Design For Failure

Murphy's Law For Architects

"Everything that can go wrong, will go wrong."

Architect's job: Assume failure, design resilience

Common failures:
- Server crash
- Network partition
- Database slow/down
- Disk full
- Memory leak
- Dependency timeout
- Data corruption
- Config error
- Human error (deployment)
- DDoS attack

Failure Mode Analysis

Before designing system, list failure scenarios:

# E-commerce Checkout - Failure Scenarios

## Component Failures
- Payment gateway down → Retry + queue
- Inventory service slow → Circuit breaker + cache
- Database connection pool exhausted → Connection limits
- Redis cache down → Fallback to database

## Network Failures
- Timeout to payment service → Async processing
- Packet loss → Retry with exponential backoff
- Network partition → Eventual consistency

## Data Failures
- Duplicate payment → Idempotency keys
- Race condition on inventory → Optimistic locking
- Data corruption → Checksums + validation

## Operational Failures
- Bad deployment → Gradual rollout + rollback plan
- Config change error → Config validation + dry-run
- Certificate expiry → Automated renewal + alerting

Design For Degradation

Not all features equally critical:

// Example: E-commerce homepage

Critical (must work):
- Browse products ✅
- Search ✅
- Add to cart ✅
- Checkout ✅

Non-critical (can degrade):
- Recommendations ⚠️ → Show static list
- Reviews ⚠️ → Hide section
- Personalization ⚠️ → Generic experience
- Real-time inventory ⚠️ → Show "In stock" always

// Implementation
async function getHomepage() {
  const products = await getProducts() // Critical
  
  let recommendations = []
  try {
    recommendations = await getRecommendations()
  } catch (error) {
    // Degrade gracefully
    recommendations = DEFAULT_RECOMMENDATIONS
    logger.warn('Recommendations service down, using defaults')
  }
  
  return { products, recommendations }
}

Blast Radius Thinking

Limit failure impact:

flowchart TB
    subgraph "Small Blast Radius ✅"
    S1[Service A] --> S2[Service B]
    S3[Service C] --> S4[Service D]
    end
    
    subgraph "Large Blast Radius ❌"
    S5[Service E] --> S6[Central DB]
    S7[Service F] --> S6
    S8[Service G] --> S6
    S9[Service H] --> S6
    end
Small blast radius:
- Service B fails → only Service A affected
- Isolated failure domain

Large blast radius:
- Central DB fails → ALL services down
- Single point of failure

Strategies to reduce blast radius:

  • Microservices isolation
  • Database sharding
  • Multi-region deployment
  • Bulkhead pattern
  • Circuit breakers

3. Monitoring & Alerting Strategy - Know Before Users Do

Monitoring Pyramid

              Alerts
              /    \
         Dashboards  \
         /      \     \
    Metrics   Logs   Traces
    
Bottom: Collect everything
Middle: Visualize important
Top: Alert critical only

What To Monitor

1. User-Facing Metrics (Most Important)

// These affect users directly
const userMetrics = {
  availability: '99.9%',        // Can users access?
  latency_p50: '100ms',         // How fast?
  latency_p95: '300ms',         // Slow requests?
  latency_p99: '1000ms',        // Worst case?
  error_rate: '0.1%',           // How many failures?
  success_rate: '99.9%'         // Overall health
}

2. System Health Metrics

const systemMetrics = {
  cpu_usage: '45%',
  memory_usage: '60%',
  disk_usage: '70%',
  network_io: '500 Mbps',
  connection_pool: '80/100 active'
}

3. Business Metrics

const businessMetrics = {
  orders_per_minute: 150,
  revenue_per_hour: '$5000',
  active_users: 2500,
  checkout_conversion: '3.2%'
}

The Golden Signals (Google SRE)

Monitor these 4 metrics for every service:

1. Latency - How long requests take?
2. Traffic - How much demand?
3. Errors - How many failures?
4. Saturation - How "full" is the system?

Example: API service
- Latency: p95 = 200ms
- Traffic: 1000 req/s
- Errors: 0.5% error rate
- Saturation: CPU 60%, memory 70%

Alerting Philosophy

Alert on symptoms, not causes:

Bad alerts (causes):
- CPU > 80%
- Memory > 90%
- Disk > 85%

Why bad? Users don't care about CPU.

Good alerts (symptoms):
- Error rate > 1% for 5 minutes
- P95 latency > 1s for 5 minutes
- Availability < 99.9% last hour

Why good? These affect users.

Alert Fatigue Prevention

// Problem: Too many alerts
Alert: CPU > 80%  // Every day, not actionable
Alert: Disk > 70%  // Not urgent
Alert: Memory > 60%  // Normal

Team: Ignores all alerts

// Solution: Alert only on user impact
const alertConfig = {
  error_rate: {
    threshold: '1%',
    duration: '5 minutes',
    severity: 'critical',
    action: 'Page on-call immediately'
  },
  
  latency_p95: {
    threshold: '1000ms',
    duration: '10 minutes',
    severity: 'warning',
    action: 'Slack notification'
  }
}

Runbooks For Common Issues

# Runbook: High Error Rate Alert

## Alert
Error rate > 1% for 5 minutes

## Immediate Actions
1. Check status page: https://status.company.com
2. Check recent deployments (last 2 hours)
3. Check dependency health (payment, database)

## Diagnosis Steps
1. View error logs: 
   `kubectl logs -l app=api --tail=100 | grep ERROR`
2. Check traces for slow requests
3. Review metrics dashboard

## Common Causes
- Recent deployment → Rollback
- Dependency timeout → Circuit breaker activated?
- Database slow → Check slow query log
- Rate limit hit → Check traffic spike

## Mitigation
- Rollback: `./scripts/rollback.sh`
- Disable feature: `feature-flag payment-v2 off`
- Scale up: `kubectl scale deployment api --replicas=10`

## Escalation
If not resolved in 15 minutes:
- Escalate to @architect
- Update incident channel
- Contact vendors if dependency issue

4. Capacity Planning - Scale Before You Need To

Growth Projection

// Current metrics (Feb 2024)
const current = {
  daily_users: 100_000,
  peak_requests_per_second: 1_000,
  database_size: '500 GB',
  monthly_cost: '$10_000'
}

// Growth rate: 20% per month
const projections = {
  '3_months': {
    users: 172_000,    // 1.2^3 = 1.72x
    rps: 1_720,
    db_size: '860 GB',
    cost: '$17_200'
  },
  
  '6_months': {
    users: 298_000,    // 1.2^6 = 2.98x
    rps: 2_980,
    db_size: '1.5 TB',
    cost: '$29_800'
  },
  
  '12_months': {
    users: 890_000,    // 1.2^12 = 8.9x
    rps: 8_900,
    db_size: '4.5 TB',
    cost: '$89_000'
  }
}

Resource Planning Matrix

# Capacity Plan - Next 6 Months

## Current Capacity
- API servers: 10 instances, 60% CPU avg
- Database: 1 primary + 2 replicas, 70% connections
- Cache: 50GB Redis, 40% memory
- Max capacity: ~1500 RPS before degradation

## 6-Month Projection
- Expected load: 3000 RPS (2x current max)
- Peak traffic: 4500 RPS (holiday season)

## Action Plan
| Timeline | Action | Cost | Owner |
|----------|--------|------|-------|
| Month 1 | Add 10 API instances | +$2k/mo | DevOps |
| Month 2 | Database read replicas (2 → 4) | +$3k/mo | DBA |
| Month 2 | Redis cluster upgrade (50GB → 100GB) | +$1k/mo | DevOps |
| Month 3 | Database sharding preparation | $0 | Backend |
| Month 4 | Implement database sharding | $0 | Backend |
| Month 5 | CDN upgrade for static assets | +$1k/mo | Frontend |
| Month 6 | Load test at 5000 RPS | $0 | QA |

## Risk Assessment
- ⚠️ Database sharding complex, allow 2 months
- ⚠️ Need load testing environment
- API scaling straightforward

Load Testing Strategy

// Gradual load test
const loadTest = {
  baseline: {
    users: 1000,
    duration: '10 minutes',
    expected: 'p95 < 200ms, 0% errors'
  },
  
  target: {
    users: 3000,
    duration: '30 minutes',
    expected: 'p95 < 500ms, < 0.1% errors'
  },
  
  stress: {
    users: 5000,
    duration: '10 minutes',
    expected: 'Find breaking point'
  },
  
  spike: {
    users: '0 → 5000 in 1 minute',
    duration: '15 minutes',
    expected: 'System recovers gracefully'
  }
}

// What to monitor during load test
const metrics = [
  'Response time (p50, p95, p99)',
  'Error rate',
  'CPU, memory, disk usage',
  'Database connections',
  'Queue depth',
  'Cache hit rate'
]

Budget For Growth

Rule of thumb: Infrastructure cost = 5-10% of revenue

Example:
- Monthly revenue: $1M
- Acceptable infra cost: $50k-$100k
- Current cost: $30k
- Room for growth: $20k-$70k

Strategy:
- Optimize before scale (reduce cost 20%)
- Plan gradual increases
- Reserve budget for incidents

5. Migration Strategy - Change Production Safely

Migration Challenges

Why migrations are risky:
- Zero downtime requirement
- Large data volume
- Can't rollback easily
- User impact if failed
- Business pressure (revenue)

Example scenarios:
- Database migration (PostgreSQL → MySQL)
- Cloud migration (AWS → GCP)
- Architecture change (monolith → microservices)
- Data model change (schema evolution)

Strangler Fig Pattern

Gradually replace old system với new system:

flowchart LR
    Users --> Router{Router}
    Router -->|Old Features| Old[Old System]
    Router -->|New Features| New[New System]
    Old -.Data Sync.-> New
Phase 1: Build new system alongside old
Phase 2: Route new traffic to new system
Phase 3: Migrate old data gradually
Phase 4: Redirect more traffic to new
Phase 5: Deprecate old system

Timeline: Months to years, không phải big-bang

Database Migration Example

// Phase 1: Dual Write
async function updateUser(userId, data) {
  // Write to old database
  await oldDB.users.update(userId, data)
  
  // Also write to new database
  try {
    await newDB.users.update(userId, data)
  } catch (error) {
    // Log but don't fail
    logger.error('New DB write failed', error)
  }
}

// Phase 2: Verify Data Consistency
async function verifyData() {
  const oldData = await oldDB.users.find(userId)
  const newData = await newDB.users.find(userId)
  
  if (!isEqual(oldData, newData)) {
    alert('Data inconsistency detected')
  }
}

// Phase 3: Dual Read (verify new DB)
async function getUser(userId) {
  const data = await oldDB.users.find(userId)
  
  // Compare with new DB
  const newData = await newDB.users.find(userId)
  if (!isEqual(data, newData)) {
    logger.warn('Data mismatch', { userId, oldData: data, newData })
  }
  
  return data  // Still return old DB data
}

// Phase 4: Switch Reads to New DB
async function getUser(userId) {
  const data = await newDB.users.find(userId)
  
  // Fallback to old DB if failed
  if (!data) {
    logger.error('New DB read failed, falling back')
    return await oldDB.users.find(userId)
  }
  
  return data
}

// Phase 5: Stop Writing to Old DB
async function updateUser(userId, data) {
  await newDB.users.update(userId, data)
  // Old DB write removed
}

// Phase 6: Deprecate Old DB
// Remove old database entirely

Feature Flag for Migration

const featureFlags = {
  use_new_payment_service: {
    enabled: true,
    rollout: '10%',  // Start with 10% traffic
    rules: [
      { segment: 'internal_users', enabled: true },
      { segment: 'beta_testers', enabled: true }
    ]
  }
}

async function processPayment(order) {
  if (featureFlags.use_new_payment_service.enabled) {
    if (shouldEnableForUser(order.userId)) {
      try {
        return await newPaymentService.process(order)
      } catch (error) {
        // Fallback to old service
        logger.error('New payment failed, fallback', error)
        return await oldPaymentService.process(order)
      }
    }
  }
  
  return await oldPaymentService.process(order)
}

Migration Checklist

# Migration Checklist

## Pre-Migration
[ ] Load test new system at expected traffic
[ ] Feature flags implemented
[ ] Rollback plan documented
[ ] Data validation scripts ready
[ ] Monitoring dashboards prepared
[ ] On-call team briefed
[ ] Stakeholders notified
[ ] Maintenance window scheduled (if needed)

## During Migration
[ ] Start with 1% traffic
[ ] Monitor error rates closely
[ ] Verify data consistency every hour
[ ] Gradual rollout: 1% → 5% → 10% → 25% → 50% → 100%
[ ] Each step: monitor 24-48 hours before next
[ ] Document any issues encountered

## Post-Migration
[ ] Full monitoring for 1 week
[ ] Data validation report
[ ] Performance comparison (old vs new)
[ ] Post-mortem meeting
[ ] Document lessons learned
[ ] Plan deprecation of old system

6. Technical Leadership - Influence Without Authority

Architect As Leader

Technical leadership ≠ Management

You don't have authority to:
- Assign tasks
- Fire people
- Approve budget

You DO have influence through:
- Technical expertise
- Earned trust
- Clear communication
- Good decisions track record

Build Trust Through Delivery

How to earn trust as architect:

Deliver results:
- Designs that actually work in production
- Solutions that scale when needed
- Smooth migrations without incidents

Be reliable:
- Available during incidents
- Follow through on commitments
- Respond to questions quickly

Admit mistakes:
- "I was wrong about X"
- "Better approach is Y"
- Learn publicly

Give credit:
- "Sarah's idea to use caching saved us"
- Highlight team contributions
- Share success

Technical Decision Making

Framework for architecture decisions:

# Architecture Decision Record (ADR)

## Context
We need to redesign our notification system. 
Current system:
- Sends 1M notifications/day
- Growing 50% per quarter
- Frequent delays and failures

## Decision
Adopt event-driven architecture with Kafka

## Alternatives Considered
1. **RabbitMQ + workers** 
   - Pros: Team familiar, simpler
   - Cons: Scale limitations, no replay
   
2. **AWS SNS/SQS**
   - Pros: Managed, reliable
   - Cons: Vendor lock-in, cost at scale
   
3. **Kafka** (CHOSEN)
   - Pros: High throughput, replay, ecosystem
   - Cons: Operational complexity, learning curve

## Consequences
Positive:
- Handle 10M+ notifications/day
- Message replay for debugging
- Real-time analytics possible

Negative:
- Team needs Kafka training
- More operational overhead
- Migration takes 3 months

## Timeline
- Month 1: Kafka setup + training
- Month 2: Build new notification service
- Month 3: Gradual migration

Influence Through Data

// Instead of: "We need to rewrite this!"
// Use data to make the case:

const currentMetrics = {
  deployment_frequency: '1/week',
  lead_time: '2 weeks',
  mttr: '4 hours',
  change_failure_rate: '15%'
}

const industryBenchmark = {
  elite_performers: {
    deployment_frequency: 'Multiple/day',
    lead_time: '< 1 day',
    mttr: '< 1 hour',
    change_failure_rate: '0-15%'
  }
}

const proposal = `
Problem: Our lead time is 14x slower than elite teams
Impact: Competitors ship features faster
Solution: Move to microservices + CI/CD
Expected improvement: Lead time 2 weeks → 2 days
Cost: 3 months, 4 engineers
ROI: Ship 7x faster, competitive advantage
`

Communication Skills

Explain technical to non-technical:

Technical jargon:
"We need to implement horizontal pod autoscaling 
with Kubernetes HPA based on custom metrics from 
Prometheus to handle traffic spikes"

Business language:
"Currently, traffic spikes crash our site. 
This causes lost revenue.
Solution: Automatically add servers when busy.
Cost: $5k/month
Benefit: No more outages, protect $50k/hour revenue"

Write clear technical docs:

# Bad Documentation
Uses Redis for caching. Configure TTL appropriately.

# Good Documentation
## Caching Strategy

We use Redis to cache API responses, reducing database load by 80%.

### When to Cache
- Cache: User profiles, product lists (rarely change)
- Don't cache: Real-time inventory, prices (change often)

### TTL Guidelines
- User profiles: 1 hour (okay if slightly stale)
- Product lists: 5 minutes (update regularly)
- Search results: 30 seconds (balance fresh vs performance)

### Example
```javascript
await redis.set(`user:${id}`, data, { ttl: 3600 })
```

### Monitoring
- Hit rate: https://dashboard/redis
- Alert if hit rate < 80%

Code Review Excellence

// Bad review
"This is wrong. Use Redis instead."

// Good review
/*
Concern: In-memory cache won't scale across multiple instances.

Issue:
- Each instance has separate cache
- Cache miss rate will be high
- Duplicated memory usage

Suggestion:
Consider shared Redis cache:
- Single source of truth
- Better hit rate
- Scales horizontally

If Redis not possible:
- Document limitations
- Add metrics for cache hit rate
- Plan migration when needed

Happy to discuss trade-offs!
*/

Mentoring Engineers

Junior engineer: "Should I use MongoDB or PostgreSQL?"

Bad answer:
"Use PostgreSQL, it's better."

Good answer:
"Great question! Let's think through trade-offs:

What's your data like?
- Structured, relations? → PostgreSQL
- Unstructured, flexible schema? → MongoDB

What operations?
- Complex queries, joins? → PostgreSQL
- Simple CRUD, fast writes? → MongoDB

What's your team familiar with?
- SQL experience? → PostgreSQL easier

For your use case (user profiles with relations),
I'd recommend PostgreSQL because:
1. You have structured data
2. Need join queries (users + orders)
3. Team knows SQL

But either could work. Want to prototype both?"

7. Architecture Review Process - Prevent Problems Early

Review Triggers

When to require architecture review:

Always review:
- New service/system
- Major refactor
- Database schema change
- New technology adoption
- Security-sensitive changes
- Cross-team dependencies

⚠️ Consider review:
- Performance-critical path
- Complex algorithms
- External API integration

No review needed:
- Bug fixes
- UI changes
- Copy updates
- Small refactors

Review Template

# Architecture Review: Payment Service Redesign

## Requestor
Sarah Chen, Backend Lead

## Overview
Redesign payment service to support international payments

## Current System
- Single payment provider (Stripe)
- USD only
- Synchronous processing
- ~1000 transactions/day

## Proposed System
- Multiple payment providers (Stripe, PayPal, Adyen)
- Multi-currency support
- Async processing via Kafka
- Expected 10k transactions/day

## Architecture Diagram
[Insert diagram]

## Key Design Decisions

### 1. Multiple Payment Providers
Decision: Support 3+ providers
Rationale: Regional preferences, redundancy
Trade-off: Complexity vs flexibility

### 2. Async Processing
Decision: Use Kafka for payment events
Rationale: Decouple, handle spikes, retry
Trade-off: Eventual consistency

### 3. Provider Abstraction
Decision: Abstract provider behind interface
Rationale: Easy to add/remove providers
Trade-off: Extra abstraction layer

## Concerns & Risks
1. **Currency conversion**
   - Risk: Exchange rate changes during processing
   - Mitigation: Lock rate at checkout

2. **Provider failover**
   - Risk: Primary provider down
   - Mitigation: Automatic fallback to secondary

3. **Data consistency**
   - Risk: Event lost in Kafka
   - Mitigation: At-least-once delivery, idempotency

## Scale Considerations
- Current: 1k transactions/day
- 1 year: 50k transactions/day (50x)
- Design handles: 100k transactions/day

## Security
- PCI compliance maintained
- No card data stored
- Provider tokens encrypted
- Audit logging

## Monitoring
- Transaction success rate per provider
- Latency p95, p99
- Currency conversion accuracy
- Failed payment alerts

## Timeline
- Week 1-2: Build provider abstraction
- Week 3-4: Integrate providers
- Week 5-6: Kafka setup + event handlers
- Week 7-8: Migration + testing
- Week 9: Gradual rollout

## Questions for Reviewers
1. Provider abstraction too complex?
2. Kafka overkill for current scale?
3. Missing failure scenarios?

## Reviewers
- @architect-alice (infra)
- @architect-bob (security)
- @devops-lead (operations)

Review Guidelines

For Reviewers:

Focus on:
- Failure scenarios missed?
- Scale considerations?
- Security issues?
- Operational complexity?
- Better alternatives?

Avoid:
- Bikeshedding (naming, formatting)
- "I would do it differently" without reasoning
- Perfectionism ("This isn't optimal")

🎯 Goal:
Help improve design, not block progress

For Design Author:

Be open to:
- Critical feedback
- Alternative approaches
- Simpler solutions

Avoid:
- Defensive reactions
- "Trust me, it will work"
- Dismissing concerns

🎯 Goal:
Learn, improve design, build consensus

Approval Criteria

## Definition of Approved

Design can be approved when:
- [ ] Failure scenarios identified and mitigated
- [ ] Scale considerations addressed
- [ ] Security reviewed
- [ ] Operational impact understood
- [ ] Monitoring strategy defined
- [ ] Rollback plan exists
- [ ] 2+ architects approved
- [ ] No blocking concerns

⚠️ Approved with conditions:
- Minor concerns to address during implementation
- Follow-up reviews scheduled

Not approved:
- Major risks not mitigated
- Scale concerns not addressed
- Blocking security issues

8. Final Mindset - You Own The System

Production Ownership Principles

1. Think Long-Term

Code you write today:
- You'll maintain for years
- Others will extend
- Will run at 10x scale
- Will fail eventually

Write code you're proud to own.

2. Empathy For Oncall

Before merging:
"Will this wake someone at 3AM?"
"Is this debuggable?"
"Can it be rolled back?"
"Are errors actionable?"

If answer is "maybe" → improve design.

3. Documentation As Love Letter

Document for your future self (6 months later):
- Why this design?
- What are trade-offs?
- How to debug?
- How to scale?

You'll thank yourself.

4. Measure Everything

"In God we trust. All others must bring data."
- W. Edwards Deming

No metrics = No visibility = Cannot improve

Add metrics first, optimize later.

5. Automate Toil

Repetitive tasks:
- Deployments
- Rollbacks
- Scaling
- Monitoring setup
- Incident response

Automate once, benefit forever.

Career Growth Path

flowchart TB
    Junior[Junior Engineer<br/>Write code] --> Mid[Mid Engineer<br/>Own features]
    Mid --> Senior[Senior Engineer<br/>Design systems]
    Senior --> Staff[Staff Engineer<br/>Cross-team impact]
    Staff --> Principal[Principal Engineer<br/>Company-wide strategy]

Each level:

Junior → Mid:
- Code quality
- Feature ownership
- Learn system design

Mid → Senior:
- Design systems
- Trade-off thinking
- Mentor others

Senior → Staff:
- Cross-team impact
- Production thinking
- Technical strategy

Staff → Principal:
- Company-wide architecture
- Hire & build teams
- Set technical direction

What Differentiates Great Architects

Average Architect:
- Knows patterns
- Follows best practices
- Designs systems

Great Architect:
- Thinks about failure
- Simplifies complexity
- Enables others
- Owns production
- Communicates clearly
- Earns trust
- Ships results

The 10 Commandments of Production Systems

1. **Monitor everything** - You cannot improve what you cannot measure

2. **Design for failure** - Everything will fail. Plan for it.

3. **Start simple** - Add complexity only when needed

4. **Think in trade-offs** - No perfect solution. Choose wisely.

5. **Document decisions** - Future you will ask "Why?"

6. **Automate toil** - Humans for thinking, machines for repetition

7. **Security by default** - Easier to relax than tighten later

8. **Test at scale** - Load test before users do it for you

9. **Gradual rollouts** - Big bang deployments = big bang failures

10. **Own your code** - You build it, you run it

Key Takeaways

Engineer → Architect Mindset Shifts

1. From Code → System

  • Engineer: "Feature works on my machine"
  • Architect: "System works at scale in production"

2. From Happy Path → Failure Scenarios

  • Engineer: "If everything works..."
  • Architect: "When things fail..."

3. From Solo → Team

  • Engineer: Write code alone
  • Architect: Enable team to succeed

4. From Build → Own

  • Engineer: Ship and move on
  • Architect: Own it in production

5. From Technical → Business

  • Engineer: "This is technically better"
  • Architect: "This provides business value"

Production Thinking Checklist

Before shipping to production:

[ ] What's the blast radius if this fails?
[ ] Can we rollback in under 5 minutes?
[ ] Do we have monitoring for this?
[ ] Are errors actionable?
[ ] Did we load test at 2x expected traffic?
[ ] Is there a runbook for common issues?
[ ] Can the oncall engineer debug this at 3AM?
[ ] What's the impact if this service is down?
[ ] Do we have a gradual rollout plan?
[ ] Have we considered all failure modes?

Technical Leadership Principles

Communication:

  • Explain technical in business terms
  • Write docs others can understand
  • Overcommunicate in incidents

Decision Making:

  • Use data, not opinions
  • Document trade-offs
  • Include team in decisions

Influence:

  • Build trust through delivery
  • Give credit, take blame
  • Mentor generously

Ownership:

  • Available during incidents
  • Follow through on commitments
  • Think long-term

Final Thought

The journey from engineer to architect is not about learning more patterns. It's about taking ownership.

You don't need to know every technology. You need to:

  • Think in trade-offs
  • Design for failure
  • Own production
  • Enable your team
  • Communicate clearly
  • Ship results

Great architects are not built in a day. They're built through:

  • Responding to production incidents
  • Making mistakes and learning
  • Earning trust over time
  • Shipping systems that scale
  • Helping others succeed

You're ready when you realize: The system is yours. Own it.

Advanced Architecture Topics - Kiến Thức Differentiate Senior vs Staff Engineer
Donate / Ủng hộ dự án