Principal SRE - Interview Question
Sharing some of the interview questions for the role of Principal SRE from Apple:
1) Tell me about a production outage you’ve handled that had ambiguous symptoms. How did you narrow it down?
2) How do you decide what belongs in an SLO, and how do you avoid overengineering it?
3) You inherit a platform with strong uptime but high operational toil. What do you change first?
4) How do you handle a disagreement with product leadership when reliability work competes with feature delivery?
5) A service is scaling rapidly and latency is degrading under load. Walk me through your approach.
6) What does “good observability” mean to you in practice?
7) Describe how you would lead a major incident.
8) What’s your philosophy on automation in production operations?
9) How do you evaluate whether an architecture is resilient enough?
10) You’re the on-call SRE for a globally used service. At 2:00 AM, error rates jump from 0.2% to 8%, latency doubles, and one region is still healthy while two are degraded. Walk me through exactly how you would triage and stabilize the incident.
11) You own a globally distributed API with multiple dependencies. One region is healthy, two are seeing elevated p99 latency, and error rates are still under the availability SLO. Product says users are complaining that the app “feels broken.” How do you determine whether this is a reliability issue, a capacity issue, or a dependency issue, and what do you do first?
12) A deployment passed all automated tests, but 15 minutes after release the service started failing intermittently. How would you decide whether this is a rollback, a feature flag disable, or a deeper production investigation?
13) A customer-facing service is within its SLO on availability, but users are complaining that it feels slow and unreliable. How do you decide whether this is actually an SLO problem, and what would you measure first?
14) A critical service has a 99.9% SLO, but over the last quarter it has burned through the error budget twice as fast as expected. How would you investigate the causes, and what actions would you recommend to leadership?
15) What does “reliability” mean to you in the context of an SRE role, and how do you measure it?
16) Explain the relationship between SLOs and error budgets. Why are error budgets critical in SRE?
17) During a major outage, how do you prioritize incidents? What criteria do you use to decide what to fix first?
18) What is the difference between an SLI, SLO, and SLA? Give an example for each.
19) What makes a good SLI? What characteristics should an SLI have to be effective?
20) How do you choose SLIs and SLOs for a system that has both synchronous and asynchronous workloads?
21) How do you calculate error budget burn rate, and why is burn rate more important than total error budget?
22) Your service has 99.9% SLO on latency. You get an alert: p99 latency > 500ms for 5 min. What’s the first 3 things you check? Don’t say "check dashboards". Name specific signals.
23) You’re the on-call SRE for a globally used service. At 2:00 AM, error rates jump from 0.2% to 8%, latency doubles, and one region is still healthy while two are degraded. Walk me through exactly how you would triage and stabilize the incident.
24) A customer-facing service is within its SLO on availability, but users are complaining that it feels slow and unreliable. How do you decide whether this is actually an SLO problem, and what would you measure first?
Comments
Post a Comment