Học cách thinking như một architect: production incident mindset, failure thinking, monitoring strategy, capacity planning, migration approach và technical leadership - từ engineer đến trusted architect
Chia sẻ bài học
Sau khi master tất cả technical knowledge - patterns, distributed systems, scalability - câu hỏi cuối cùng là: "Điều gì tạo nên một trusted architect?"
Câu trả lời không phải technical skill. Mà là mindset shift từ "write code" sang "own the system".
Engineer viết code để solve problem. Architect design system và chịu tr책nhiệm khi production down lúc 3AM.
Lesson này không dạy thêm patterns. Nó dạy cách thinking như người phải on-call, phải explain outage cho CEO, phải plan migration cho 100M users.
Engineer thinking:
"Feature works on my laptop ✅"
"Passed all tests ✅"
"Deployed to production ✅"
→ Move to next ticket
Architect thinking:
"Will this work under 10x load?"
"What if database fails?"
"How do we rollback?"
"What's monitoring strategy?"
"Who's on-call when this breaks?"
→ System ownership
Core difference: Architects think beyond the happy path.
Scenario: Production alert at 3AM
Engineer response:
"Not my code, not my problem"
"Let's check tomorrow"
Architect response:
"What's user impact?"
"How do we mitigate now?"
"Root cause analysis tomorrow"
"How do we prevent this?"
Architect mindset = Ownership mindset.
Fundamental truth: All systems fail. Your job is minimizing impact.
flowchart TB
Normal[Normal Operation] --> Incident[Incident Detected]
Incident --> Assess[Assess Impact]
Assess --> Mitigate[Mitigate ASAP]
Mitigate --> RCA[Root Cause Analysis]
RCA --> Prevent[Prevention Measures]
Prevent --> Normal
Phase 1: Detection (Minutes matter)
// Bad: Discover từ user complaints
User tweet: "Your site is down!"
→ 30 minutes before team knows
// Good: Proactive monitoring
Alert: "API error rate: 5% (threshold: 1%)"
Slack: "@oncall immediate action required"
→ Team knows trong 2 minutes
Phase 2: Assessment (Understand before act)
Critical questions:
1. User impact? (How many users affected?)
2. Business impact? (Revenue loss? SLA breach?)
3. Scope? (One service? Entire system?)
4. Trend? (Getting worse? Stable?)
Example:
- 15% of checkout requests failing
- ~500 users affected
- Payment service only
- Error rate stable last 10 min
→ High severity, contained scope
Phase 3: Mitigation (Fix now, understand later)
Priority: Stop the bleeding
Options (in order):
1. Rollback (safest, fastest)
2. Feature flag off (disable new code)
3. Traffic routing (shift to healthy instances)
4. Graceful degradation (disable non-critical features)
5. Emergency fix (last resort)
Example:
Payment service down?
→ Rollback to previous version (5 min)
Not: Debug root cause in production (hours)
Phase 4: Communication (Keep stakeholders informed)
Incident update template:
Status: INVESTIGATING
Impact: 15% checkout failures
Started: 14:30 UTC
Current action: Rolling back payment service to v2.3.1
Next update: 14:45 UTC
Audience:
- Internal team (Slack #incidents)
- Support team (brief customers)
- Management (business impact)
- Customers if needed (status page)
Phase 5: Root Cause Analysis (Learn from failure)
# Post-Mortem: Payment Service Outage
Date: 2024-02-15
Duration: 14:30 - 14:52 UTC (22 minutes)
Impact: 15% checkout failures, ~500 affected users
## Timeline
14:30 - Deployed v2.3.2 with new retry logic
14:32 - Error rate started climbing
14:35 - Alerts fired
14:37 - Oncall acknowledged
14:40 - Decision to rollback
14:45 - Rollback completed
14:52 - Service recovered
## Root Cause
Retry logic introduced infinite loop on certain errors.
Under load, exhausted connection pool.
## Impact
- 500 users experienced checkout failures
- Estimated revenue loss: $5,000
- No data loss or corruption
## Action Items
[ ] Add circuit breaker to retry logic (Owner: @alice, Due: 2/20)
[ ] Load testing for retry scenarios (Owner: @bob, Due: 2/22)
[ ] Connection pool monitoring (Owner: @carol, Due: 2/18)
[ ] Review deployment checklist (Owner: @dave, Due: 2/25)
## What Went Well
- Fast detection (2 min from deploy to alert)
- Clear rollback procedure
- Good communication
## What To Improve
- Load testing didn't catch this scenario
- Retry logic not reviewed carefully enough
- Need better connection pool visibility
Bad post-mortem:
"Bob deployed buggy code"
"Alice didn't review properly"
→ People hide mistakes
Good post-mortem:
"Deployment process didn't catch infinite loop"
"Need better load testing coverage"
"Review checklist should include retry logic patterns"
→ Focus on system improvements
"Everything that can go wrong, will go wrong."
Architect's job: Assume failure, design resilience
Common failures:
- Server crash
- Network partition
- Database slow/down
- Disk full
- Memory leak
- Dependency timeout
- Data corruption
- Config error
- Human error (deployment)
- DDoS attack
Before designing system, list failure scenarios:
# E-commerce Checkout - Failure Scenarios
## Component Failures
- Payment gateway down → Retry + queue
- Inventory service slow → Circuit breaker + cache
- Database connection pool exhausted → Connection limits
- Redis cache down → Fallback to database
## Network Failures
- Timeout to payment service → Async processing
- Packet loss → Retry with exponential backoff
- Network partition → Eventual consistency
## Data Failures
- Duplicate payment → Idempotency keys
- Race condition on inventory → Optimistic locking
- Data corruption → Checksums + validation
## Operational Failures
- Bad deployment → Gradual rollout + rollback plan
- Config change error → Config validation + dry-run
- Certificate expiry → Automated renewal + alerting
Not all features equally critical:
// Example: E-commerce homepage
Critical (must work):
- Browse products ✅
- Search ✅
- Add to cart ✅
- Checkout ✅
Non-critical (can degrade):
- Recommendations ⚠️ → Show static list
- Reviews ⚠️ → Hide section
- Personalization ⚠️ → Generic experience
- Real-time inventory ⚠️ → Show "In stock" always
// Implementation
async function getHomepage() {
const products = await getProducts() // Critical
let recommendations = []
try {
recommendations = await getRecommendations()
} catch (error) {
// Degrade gracefully
recommendations = DEFAULT_RECOMMENDATIONS
logger.warn('Recommendations service down, using defaults')
}
return { products, recommendations }
}
Limit failure impact:
flowchart TB
subgraph "Small Blast Radius ✅"
S1[Service A] --> S2[Service B]
S3[Service C] --> S4[Service D]
end
subgraph "Large Blast Radius ❌"
S5[Service E] --> S6[Central DB]
S7[Service F] --> S6
S8[Service G] --> S6
S9[Service H] --> S6
end
Small blast radius:
- Service B fails → only Service A affected
- Isolated failure domain
Large blast radius:
- Central DB fails → ALL services down
- Single point of failure
Strategies to reduce blast radius:
Alerts
/ \
Dashboards \
/ \ \
Metrics Logs Traces
Bottom: Collect everything
Middle: Visualize important
Top: Alert critical only
1. User-Facing Metrics (Most Important)
// These affect users directly
const userMetrics = {
availability: '99.9%', // Can users access?
latency_p50: '100ms', // How fast?
latency_p95: '300ms', // Slow requests?
latency_p99: '1000ms', // Worst case?
error_rate: '0.1%', // How many failures?
success_rate: '99.9%' // Overall health
}
2. System Health Metrics
const systemMetrics = {
cpu_usage: '45%',
memory_usage: '60%',
disk_usage: '70%',
network_io: '500 Mbps',
connection_pool: '80/100 active'
}
3. Business Metrics
const businessMetrics = {
orders_per_minute: 150,
revenue_per_hour: '$5000',
active_users: 2500,
checkout_conversion: '3.2%'
}
Monitor these 4 metrics for every service:
1. Latency - How long requests take?
2. Traffic - How much demand?
3. Errors - How many failures?
4. Saturation - How "full" is the system?
Example: API service
- Latency: p95 = 200ms
- Traffic: 1000 req/s
- Errors: 0.5% error rate
- Saturation: CPU 60%, memory 70%
Alert on symptoms, not causes:
Bad alerts (causes):
- CPU > 80%
- Memory > 90%
- Disk > 85%
Why bad? Users don't care about CPU.
Good alerts (symptoms):
- Error rate > 1% for 5 minutes
- P95 latency > 1s for 5 minutes
- Availability < 99.9% last hour
Why good? These affect users.
// Problem: Too many alerts
Alert: CPU > 80% // Every day, not actionable
Alert: Disk > 70% // Not urgent
Alert: Memory > 60% // Normal
Team: Ignores all alerts
// Solution: Alert only on user impact
const alertConfig = {
error_rate: {
threshold: '1%',
duration: '5 minutes',
severity: 'critical',
action: 'Page on-call immediately'
},
latency_p95: {
threshold: '1000ms',
duration: '10 minutes',
severity: 'warning',
action: 'Slack notification'
}
}
# Runbook: High Error Rate Alert
## Alert
Error rate > 1% for 5 minutes
## Immediate Actions
1. Check status page: https://status.company.com
2. Check recent deployments (last 2 hours)
3. Check dependency health (payment, database)
## Diagnosis Steps
1. View error logs:
`kubectl logs -l app=api --tail=100 | grep ERROR`
2. Check traces for slow requests
3. Review metrics dashboard
## Common Causes
- Recent deployment → Rollback
- Dependency timeout → Circuit breaker activated?
- Database slow → Check slow query log
- Rate limit hit → Check traffic spike
## Mitigation
- Rollback: `./scripts/rollback.sh`
- Disable feature: `feature-flag payment-v2 off`
- Scale up: `kubectl scale deployment api --replicas=10`
## Escalation
If not resolved in 15 minutes:
- Escalate to @architect
- Update incident channel
- Contact vendors if dependency issue
// Current metrics (Feb 2024)
const current = {
daily_users: 100_000,
peak_requests_per_second: 1_000,
database_size: '500 GB',
monthly_cost: '$10_000'
}
// Growth rate: 20% per month
const projections = {
'3_months': {
users: 172_000, // 1.2^3 = 1.72x
rps: 1_720,
db_size: '860 GB',
cost: '$17_200'
},
'6_months': {
users: 298_000, // 1.2^6 = 2.98x
rps: 2_980,
db_size: '1.5 TB',
cost: '$29_800'
},
'12_months': {
users: 890_000, // 1.2^12 = 8.9x
rps: 8_900,
db_size: '4.5 TB',
cost: '$89_000'
}
}
# Capacity Plan - Next 6 Months
## Current Capacity
- API servers: 10 instances, 60% CPU avg
- Database: 1 primary + 2 replicas, 70% connections
- Cache: 50GB Redis, 40% memory
- Max capacity: ~1500 RPS before degradation
## 6-Month Projection
- Expected load: 3000 RPS (2x current max)
- Peak traffic: 4500 RPS (holiday season)
## Action Plan
| Timeline | Action | Cost | Owner |
|----------|--------|------|-------|
| Month 1 | Add 10 API instances | +$2k/mo | DevOps |
| Month 2 | Database read replicas (2 → 4) | +$3k/mo | DBA |
| Month 2 | Redis cluster upgrade (50GB → 100GB) | +$1k/mo | DevOps |
| Month 3 | Database sharding preparation | $0 | Backend |
| Month 4 | Implement database sharding | $0 | Backend |
| Month 5 | CDN upgrade for static assets | +$1k/mo | Frontend |
| Month 6 | Load test at 5000 RPS | $0 | QA |
## Risk Assessment
- ⚠️ Database sharding complex, allow 2 months
- ⚠️ Need load testing environment
- API scaling straightforward
// Gradual load test
const loadTest = {
baseline: {
users: 1000,
duration: '10 minutes',
expected: 'p95 < 200ms, 0% errors'
},
target: {
users: 3000,
duration: '30 minutes',
expected: 'p95 < 500ms, < 0.1% errors'
},
stress: {
users: 5000,
duration: '10 minutes',
expected: 'Find breaking point'
},
spike: {
users: '0 → 5000 in 1 minute',
duration: '15 minutes',
expected: 'System recovers gracefully'
}
}
// What to monitor during load test
const metrics = [
'Response time (p50, p95, p99)',
'Error rate',
'CPU, memory, disk usage',
'Database connections',
'Queue depth',
'Cache hit rate'
]
Rule of thumb: Infrastructure cost = 5-10% of revenue
Example:
- Monthly revenue: $1M
- Acceptable infra cost: $50k-$100k
- Current cost: $30k
- Room for growth: $20k-$70k
Strategy:
- Optimize before scale (reduce cost 20%)
- Plan gradual increases
- Reserve budget for incidents
Why migrations are risky:
- Zero downtime requirement
- Large data volume
- Can't rollback easily
- User impact if failed
- Business pressure (revenue)
Example scenarios:
- Database migration (PostgreSQL → MySQL)
- Cloud migration (AWS → GCP)
- Architecture change (monolith → microservices)
- Data model change (schema evolution)
Gradually replace old system với new system:
flowchart LR
Users --> Router{Router}
Router -->|Old Features| Old[Old System]
Router -->|New Features| New[New System]
Old -.Data Sync.-> New
Phase 1: Build new system alongside old
Phase 2: Route new traffic to new system
Phase 3: Migrate old data gradually
Phase 4: Redirect more traffic to new
Phase 5: Deprecate old system
Timeline: Months to years, không phải big-bang
// Phase 1: Dual Write
async function updateUser(userId, data) {
// Write to old database
await oldDB.users.update(userId, data)
// Also write to new database
try {
await newDB.users.update(userId, data)
} catch (error) {
// Log but don't fail
logger.error('New DB write failed', error)
}
}
// Phase 2: Verify Data Consistency
async function verifyData() {
const oldData = await oldDB.users.find(userId)
const newData = await newDB.users.find(userId)
if (!isEqual(oldData, newData)) {
alert('Data inconsistency detected')
}
}
// Phase 3: Dual Read (verify new DB)
async function getUser(userId) {
const data = await oldDB.users.find(userId)
// Compare with new DB
const newData = await newDB.users.find(userId)
if (!isEqual(data, newData)) {
logger.warn('Data mismatch', { userId, oldData: data, newData })
}
return data // Still return old DB data
}
// Phase 4: Switch Reads to New DB
async function getUser(userId) {
const data = await newDB.users.find(userId)
// Fallback to old DB if failed
if (!data) {
logger.error('New DB read failed, falling back')
return await oldDB.users.find(userId)
}
return data
}
// Phase 5: Stop Writing to Old DB
async function updateUser(userId, data) {
await newDB.users.update(userId, data)
// Old DB write removed
}
// Phase 6: Deprecate Old DB
// Remove old database entirely
const featureFlags = {
use_new_payment_service: {
enabled: true,
rollout: '10%', // Start with 10% traffic
rules: [
{ segment: 'internal_users', enabled: true },
{ segment: 'beta_testers', enabled: true }
]
}
}
async function processPayment(order) {
if (featureFlags.use_new_payment_service.enabled) {
if (shouldEnableForUser(order.userId)) {
try {
return await newPaymentService.process(order)
} catch (error) {
// Fallback to old service
logger.error('New payment failed, fallback', error)
return await oldPaymentService.process(order)
}
}
}
return await oldPaymentService.process(order)
}
# Migration Checklist
## Pre-Migration
[ ] Load test new system at expected traffic
[ ] Feature flags implemented
[ ] Rollback plan documented
[ ] Data validation scripts ready
[ ] Monitoring dashboards prepared
[ ] On-call team briefed
[ ] Stakeholders notified
[ ] Maintenance window scheduled (if needed)
## During Migration
[ ] Start with 1% traffic
[ ] Monitor error rates closely
[ ] Verify data consistency every hour
[ ] Gradual rollout: 1% → 5% → 10% → 25% → 50% → 100%
[ ] Each step: monitor 24-48 hours before next
[ ] Document any issues encountered
## Post-Migration
[ ] Full monitoring for 1 week
[ ] Data validation report
[ ] Performance comparison (old vs new)
[ ] Post-mortem meeting
[ ] Document lessons learned
[ ] Plan deprecation of old system
Technical leadership ≠ Management
You don't have authority to:
- Assign tasks
- Fire people
- Approve budget
You DO have influence through:
- Technical expertise
- Earned trust
- Clear communication
- Good decisions track record
How to earn trust as architect:
Deliver results:
- Designs that actually work in production
- Solutions that scale when needed
- Smooth migrations without incidents
Be reliable:
- Available during incidents
- Follow through on commitments
- Respond to questions quickly
Admit mistakes:
- "I was wrong about X"
- "Better approach is Y"
- Learn publicly
Give credit:
- "Sarah's idea to use caching saved us"
- Highlight team contributions
- Share success
Framework for architecture decisions:
# Architecture Decision Record (ADR)
## Context
We need to redesign our notification system.
Current system:
- Sends 1M notifications/day
- Growing 50% per quarter
- Frequent delays and failures
## Decision
Adopt event-driven architecture with Kafka
## Alternatives Considered
1. **RabbitMQ + workers**
- Pros: Team familiar, simpler
- Cons: Scale limitations, no replay
2. **AWS SNS/SQS**
- Pros: Managed, reliable
- Cons: Vendor lock-in, cost at scale
3. **Kafka** (CHOSEN)
- Pros: High throughput, replay, ecosystem
- Cons: Operational complexity, learning curve
## Consequences
Positive:
- Handle 10M+ notifications/day
- Message replay for debugging
- Real-time analytics possible
Negative:
- Team needs Kafka training
- More operational overhead
- Migration takes 3 months
## Timeline
- Month 1: Kafka setup + training
- Month 2: Build new notification service
- Month 3: Gradual migration
// Instead of: "We need to rewrite this!"
// Use data to make the case:
const currentMetrics = {
deployment_frequency: '1/week',
lead_time: '2 weeks',
mttr: '4 hours',
change_failure_rate: '15%'
}
const industryBenchmark = {
elite_performers: {
deployment_frequency: 'Multiple/day',
lead_time: '< 1 day',
mttr: '< 1 hour',
change_failure_rate: '0-15%'
}
}
const proposal = `
Problem: Our lead time is 14x slower than elite teams
Impact: Competitors ship features faster
Solution: Move to microservices + CI/CD
Expected improvement: Lead time 2 weeks → 2 days
Cost: 3 months, 4 engineers
ROI: Ship 7x faster, competitive advantage
`
Explain technical to non-technical:
Technical jargon:
"We need to implement horizontal pod autoscaling
with Kubernetes HPA based on custom metrics from
Prometheus to handle traffic spikes"
Business language:
"Currently, traffic spikes crash our site.
This causes lost revenue.
Solution: Automatically add servers when busy.
Cost: $5k/month
Benefit: No more outages, protect $50k/hour revenue"
Write clear technical docs:
# Bad Documentation
Uses Redis for caching. Configure TTL appropriately.
# Good Documentation
## Caching Strategy
We use Redis to cache API responses, reducing database load by 80%.
### When to Cache
- Cache: User profiles, product lists (rarely change)
- Don't cache: Real-time inventory, prices (change often)
### TTL Guidelines
- User profiles: 1 hour (okay if slightly stale)
- Product lists: 5 minutes (update regularly)
- Search results: 30 seconds (balance fresh vs performance)
### Example
```javascript
await redis.set(`user:${id}`, data, { ttl: 3600 })
```
### Monitoring
- Hit rate: https://dashboard/redis
- Alert if hit rate < 80%
// Bad review
"This is wrong. Use Redis instead."
// Good review
/*
Concern: In-memory cache won't scale across multiple instances.
Issue:
- Each instance has separate cache
- Cache miss rate will be high
- Duplicated memory usage
Suggestion:
Consider shared Redis cache:
- Single source of truth
- Better hit rate
- Scales horizontally
If Redis not possible:
- Document limitations
- Add metrics for cache hit rate
- Plan migration when needed
Happy to discuss trade-offs!
*/
Junior engineer: "Should I use MongoDB or PostgreSQL?"
Bad answer:
"Use PostgreSQL, it's better."
Good answer:
"Great question! Let's think through trade-offs:
What's your data like?
- Structured, relations? → PostgreSQL
- Unstructured, flexible schema? → MongoDB
What operations?
- Complex queries, joins? → PostgreSQL
- Simple CRUD, fast writes? → MongoDB
What's your team familiar with?
- SQL experience? → PostgreSQL easier
For your use case (user profiles with relations),
I'd recommend PostgreSQL because:
1. You have structured data
2. Need join queries (users + orders)
3. Team knows SQL
But either could work. Want to prototype both?"
When to require architecture review:
Always review:
- New service/system
- Major refactor
- Database schema change
- New technology adoption
- Security-sensitive changes
- Cross-team dependencies
⚠️ Consider review:
- Performance-critical path
- Complex algorithms
- External API integration
No review needed:
- Bug fixes
- UI changes
- Copy updates
- Small refactors
# Architecture Review: Payment Service Redesign
## Requestor
Sarah Chen, Backend Lead
## Overview
Redesign payment service to support international payments
## Current System
- Single payment provider (Stripe)
- USD only
- Synchronous processing
- ~1000 transactions/day
## Proposed System
- Multiple payment providers (Stripe, PayPal, Adyen)
- Multi-currency support
- Async processing via Kafka
- Expected 10k transactions/day
## Architecture Diagram
[Insert diagram]
## Key Design Decisions
### 1. Multiple Payment Providers
Decision: Support 3+ providers
Rationale: Regional preferences, redundancy
Trade-off: Complexity vs flexibility
### 2. Async Processing
Decision: Use Kafka for payment events
Rationale: Decouple, handle spikes, retry
Trade-off: Eventual consistency
### 3. Provider Abstraction
Decision: Abstract provider behind interface
Rationale: Easy to add/remove providers
Trade-off: Extra abstraction layer
## Concerns & Risks
1. **Currency conversion**
- Risk: Exchange rate changes during processing
- Mitigation: Lock rate at checkout
2. **Provider failover**
- Risk: Primary provider down
- Mitigation: Automatic fallback to secondary
3. **Data consistency**
- Risk: Event lost in Kafka
- Mitigation: At-least-once delivery, idempotency
## Scale Considerations
- Current: 1k transactions/day
- 1 year: 50k transactions/day (50x)
- Design handles: 100k transactions/day
## Security
- PCI compliance maintained
- No card data stored
- Provider tokens encrypted
- Audit logging
## Monitoring
- Transaction success rate per provider
- Latency p95, p99
- Currency conversion accuracy
- Failed payment alerts
## Timeline
- Week 1-2: Build provider abstraction
- Week 3-4: Integrate providers
- Week 5-6: Kafka setup + event handlers
- Week 7-8: Migration + testing
- Week 9: Gradual rollout
## Questions for Reviewers
1. Provider abstraction too complex?
2. Kafka overkill for current scale?
3. Missing failure scenarios?
## Reviewers
- @architect-alice (infra)
- @architect-bob (security)
- @devops-lead (operations)
For Reviewers:
Focus on:
- Failure scenarios missed?
- Scale considerations?
- Security issues?
- Operational complexity?
- Better alternatives?
Avoid:
- Bikeshedding (naming, formatting)
- "I would do it differently" without reasoning
- Perfectionism ("This isn't optimal")
🎯 Goal:
Help improve design, not block progress
For Design Author:
Be open to:
- Critical feedback
- Alternative approaches
- Simpler solutions
Avoid:
- Defensive reactions
- "Trust me, it will work"
- Dismissing concerns
🎯 Goal:
Learn, improve design, build consensus
## Definition of Approved
Design can be approved when:
- [ ] Failure scenarios identified and mitigated
- [ ] Scale considerations addressed
- [ ] Security reviewed
- [ ] Operational impact understood
- [ ] Monitoring strategy defined
- [ ] Rollback plan exists
- [ ] 2+ architects approved
- [ ] No blocking concerns
⚠️ Approved with conditions:
- Minor concerns to address during implementation
- Follow-up reviews scheduled
Not approved:
- Major risks not mitigated
- Scale concerns not addressed
- Blocking security issues
1. Think Long-Term
Code you write today:
- You'll maintain for years
- Others will extend
- Will run at 10x scale
- Will fail eventually
Write code you're proud to own.
2. Empathy For Oncall
Before merging:
"Will this wake someone at 3AM?"
"Is this debuggable?"
"Can it be rolled back?"
"Are errors actionable?"
If answer is "maybe" → improve design.
3. Documentation As Love Letter
Document for your future self (6 months later):
- Why this design?
- What are trade-offs?
- How to debug?
- How to scale?
You'll thank yourself.
4. Measure Everything
"In God we trust. All others must bring data."
- W. Edwards Deming
No metrics = No visibility = Cannot improve
Add metrics first, optimize later.
5. Automate Toil
Repetitive tasks:
- Deployments
- Rollbacks
- Scaling
- Monitoring setup
- Incident response
Automate once, benefit forever.
flowchart TB
Junior[Junior Engineer<br/>Write code] --> Mid[Mid Engineer<br/>Own features]
Mid --> Senior[Senior Engineer<br/>Design systems]
Senior --> Staff[Staff Engineer<br/>Cross-team impact]
Staff --> Principal[Principal Engineer<br/>Company-wide strategy]
Each level:
Junior → Mid:
- Code quality
- Feature ownership
- Learn system design
Mid → Senior:
- Design systems
- Trade-off thinking
- Mentor others
Senior → Staff:
- Cross-team impact
- Production thinking
- Technical strategy
Staff → Principal:
- Company-wide architecture
- Hire & build teams
- Set technical direction
Average Architect:
- Knows patterns
- Follows best practices
- Designs systems
Great Architect:
- Thinks about failure
- Simplifies complexity
- Enables others
- Owns production
- Communicates clearly
- Earns trust
- Ships results
1. **Monitor everything** - You cannot improve what you cannot measure
2. **Design for failure** - Everything will fail. Plan for it.
3. **Start simple** - Add complexity only when needed
4. **Think in trade-offs** - No perfect solution. Choose wisely.
5. **Document decisions** - Future you will ask "Why?"
6. **Automate toil** - Humans for thinking, machines for repetition
7. **Security by default** - Easier to relax than tighten later
8. **Test at scale** - Load test before users do it for you
9. **Gradual rollouts** - Big bang deployments = big bang failures
10. **Own your code** - You build it, you run it
1. From Code → System
2. From Happy Path → Failure Scenarios
3. From Solo → Team
4. From Build → Own
5. From Technical → Business
Before shipping to production:
[ ] What's the blast radius if this fails?
[ ] Can we rollback in under 5 minutes?
[ ] Do we have monitoring for this?
[ ] Are errors actionable?
[ ] Did we load test at 2x expected traffic?
[ ] Is there a runbook for common issues?
[ ] Can the oncall engineer debug this at 3AM?
[ ] What's the impact if this service is down?
[ ] Do we have a gradual rollout plan?
[ ] Have we considered all failure modes?
Communication:
Decision Making:
Influence:
Ownership:
The journey from engineer to architect is not about learning more patterns. It's about taking ownership.
You don't need to know every technology. You need to:
Great architects are not built in a day. They're built through:
You're ready when you realize: The system is yours. Own it.