SRE Interview Questions and Answers

What is latency and how do you reduce it?

Latency - Refers to time taken to respond the request (Processing time of the application to process the request).

Latency could be measured for a user request or application to application request or application to data request like (S3, Database, Redshift).

There could be various reasons for a latency issue.

1) Client side issue this is pretty straight forward to identify when our internal telemetry and observability looks GREEN.

2) Possible reasons for client side issue could be ISP, User Agent like browser, Geo location based issues.

If its server side issues, better to start with the probing to pin point the issues:

Let's say one of the microservices called "inventory" which displays products in the website is having latency issue. Because of this, the products are not listed properly or items metadata are not updated properly.

1) Our observability for the "inventory" application shows a spike in latency.

2) Check the start timestamp of the latency.

3) Check if the latency timestamp is known pattern. E.g., If the spike started at 8:00 AM PST, check the same latency pattern for the last 3 days to confirm if similar spike seen at the same time.

4) This will help us to identify if the latency spike at that time is a known issue (false positive) or a REAL ISSUE.

5) Let's assume the latency is spike is seen for the first time in the last 3 days. Now, we need to see if there was any application deployment or new release happened at that time.

6) If a new version is deployed and that is causing the latency spike, then start ROLLBACK the change and engage the development team to look into the issue.

7) In our case, no new deployments are made. Then, start looking at the application metrics like CPU, MEMROY, THROUGHPUT and LOAD BALANCER metrics.

8) If you see any spike in RAM, CPU that correlates with the THROUGHPUT which shows that the application started seeing huge spike in the traffic and causing RESOURCE SATURATION. If that is the case consider SCALING OUT the pods/instance to handle the surge is traffic.

9) If all the metrics are under the threshold, investigate one of the traffic flow which shows latency and see where exactly the latency is occurring.

10) If the application makes call to a backend services like Databases, check if the latency is happening at the DB layer. This is could be due to row lock contention, SQL queries causing resource exhaustion, READ replicas are overloaded.

11) Depending on the DB layer we can apply the solution to fix the issue like implementing CACHING layer or increase SIZE of the READ REPLICA.

12) Why am I not checking the NETWORK layer? In the beginning we started triaging the issue with one specific application. If its a network layer issue or CDN layer issue, then impact would have been much higher and have impacted multiple applications.

How do you identify performance bottlenecks?

Identifying performance bottlenecks are can be classified into 2 categories:

1) Infrastructure bottlenecks.

2) Application bottlenecks.

Both are achievable using Monitoring and Observability.

Monitoring is a act of tracking known metrics and alerting on thresholds like CPU, MEMORY, THROUGHPUT, ERRORS, LATENCY, SATURATION.

Observability refers to the ability to understand the internal stat of the application and how the calls are made to the downstream service and their response codes.

What golden signals do you always track?

As per SRE Book, Golden Signals are SATURATION, ERROR, LATENCY, THROUGHPUT/TRAFFIC.

THROUGHPUT/TRAFFIC - Refers to how much demand placed in a system.

-> Transactions per second - (TPS)

-> Queries per second.

-> Job per second.

SATURATION - Refers to how much resource on the system utilized. Like CPU, MEMORY, NETWORK utilization.

ERRORS - Refers to failed requests like HTTP 5xx, 4xx errors.

LATENCY - Refers to the time take to process a request.

What is rate limiting and where is it used?

Rate limiting is widely used term/technology to prevent a service or system from being polluted and cause failure.

Every service has a capacity sizing, which means expected traffic to the server (TPS). There are some cases where there is a expected surge in the traffic in that case the application capacity needs to be adjusted.

Imagine a scenario where the service is continuously hit from a specific IP, vendor, geo-location which is UNEXPECTED. This could lead to service degradation and caused failures to legitimate request.

RATE limiting can help in deciding if the request is legit or not. Rate limiter is made of rules and the incoming request is validated against the rule before forwarding the to the application.

So, examples are when a user keeps hitting a website, his IP may be blocked for a time and re-directed to CAPTCHA.

What is chaos engineering and why is it useful?

Chaos engineering in one of the SRE principles where we break a system/application intentionally to verify the resiliency posture of the system/application.

Resiliency refers to the ability of the system/application to perform at the expected threshold during unforeseen situations.

AZ failure

Traffic rerouted to healthy AZ
Auto Scaling replaces instances
Error budget consumed but service continues

✔ Resilient behavior

Search This Blog

RSInfoMinds

SRE Interview Questions and Answers - Part II

AZ failure

Comments

Post a Comment

Popular posts from this blog

K8s - ETCD

Agentic AI - Series 3

Agentic AI - Series 4