SRE Interview Questions and Answers - Part I


                                             

What is SRE and how is it different from DevOps?

SRE stands for Site Reliability Engineering which primarily focus on managing the application and its infrastructure in PRODUCTION. Their aim is to focus on improving the reliability and resiliency of the applications, improve the monitoring and observability of the application, focus on SHIFT LEFT approach to address the issue at the development stage of the software, monitor the promised SLA, SLO and SLI. Approach every problem from a software development approach. Identify and eliminate toils. Focus on automation and run books to improve the reliability and resiliency of the application and systems and involve in Root Cause Analysis and Post mortem calls after a major incident.

What are SLIs, SLOs, and Error Budgets?

SLO - Service Level Objective refers to the internal objective set to achieve the promised SLA between the business organization and its customers/consumers.

SLI - Service Level Indicator to measurement of the SLO's through the Observability and Monitoring tools. E.g. If our internal objective for Availability is 95% and how much did we achieve is SLI 96%.

NOTE: SLO  and SLI and measured over a period of time like 1 month (30 days).

Error Budgets - Refer to acceptable/tolerated failures over a period of time. Any failure beyond the limit is considered or close to breach the SLO/SLA.

SLO for Availability is 95%
Error budget is 100 - SLO, which means 5%

If our SLI for Availability is 94%, then we have consumed 1% of the error budget.
If our SLI of Availability  is 90%, then we have consumed entire error budged and they are action items in place if the error budget is consumed.

How do you calculate service availability (e.g., 99.9 vs 99.99)?



Minutes in a month = 43200 minutes  (Total Time)

Downtime in a month = 15 minutes 

Availability for 1 month = (43200 - 15)/43200 = 99.965% 

Availability of 99.9% leads to 40 mins of downtime.
Availability of 99.99% leads to 4 mins of downtime.



What is toil? Share examples and how you reduced it.

Toil refers to any manual/repeated work which has no value to it. 

1) Manual scaling of EC2 instances are handled by Auto Scaling Group.
2) SSM run documents to perform system level restarts.
3) Automation of log collections during triaging the issues.
4) Guardrails for identifying configuration drift and mitigations.

What are MTTR, MTBF, and MTTA?

MTTR - Mean Time To Repair  - Average time taken to resolve an issue.
MTBF - Mean Time Between Failure - Average time between the failures. Like CPU failed on a system on a certain day and next CPU failed on some other day. Time interval between them is called MTBF.
MTTA - Mean Time To Acknowledge - Average time taken by an individual/team to respond to a failure/alert.




How do you prioritize incidents during major outages?

During the major outage, prioritizing the incidents plays an important role to understand the customer and business impact. This helps to drive to the resolution of the incidents and reduce the blast radius and mitigation time (RTO).

Prioritization of the incident must be based:

1) Customer impacts.
2) Business impacts.
3) Blast radius - Impacted AZ or Region.
4) Is the impact is 100% or partially degraded.
 
Once the initial prioritization is done, then we need to prioritize the incidents based on the critical application flow:

1) Critical API/Applications responsible for customer impact like Authentication, Billing, Checkout flows.
2) Databases and Storage.
3) Network and Compute layers.
4) Segregate CORE vs Non-CORE services.

Honor SLO, RTO, Error Budgets:

1) Immediate attention needed on the services with strict SLO and RTO.
2) Services close to burning error budget needs more attention and escalation.
3) Low tier services can wait until core services stabilize. 
  • Prioritize issues which are blocking the recovery.
  • Use real time telemetry to re-rank the incident priority.
    • Errors.
    • Latency.
    • Saturation.
    • Throughput.
    • Availability.



Comments

Popular posts from this blog

K8s - ETCD

SRE/DevOps Syllabus

K8s - Deployment and HPA replicas