SRE Interview Questions and Answers - Part II

                                                     

What is latency and how do you reduce it?

Latency - Refers to time taken to respond the request (Processing time of the application to process the request).

Latency could be measured for a user request or application to application request or application to data request like (S3, Database, Redshift).

There could be various reasons for a latency issue.

1) Client side issue this is pretty straight forward to identify when our internal telemetry and observability looks GREEN.

2) Possible reasons for client side issue could be ISP, User Agent like browser, Geo location based issues.

If its server side issues, better to start with the probing to pin point the issues: 

Let's say one of the microservices called "inventory" which displays products in the website is having latency issue. Because of this, the products are not listed properly or items metadata are not updated properly.



1) Our observability for the "inventory" application shows a spike in latency.

2) Check the start timestamp of the latency.

3) Check if the latency timestamp is known pattern. E.g., If the spike started at 8:00 AM PST, check the same latency pattern for the last 3 days to confirm if similar spike seen at the same time. 

4) This will help us to identify if the latency spike at that time is a known issue (false positive) or a REAL ISSUE.

5) Let's assume the latency is spike is seen for the first time in the last 3 days. Now, we need to see if there was any application deployment or new release happened at that time.

6) If a new version is deployed and that is causing the latency spike, then start ROLLBACK the change and engage the development team to look into the issue.

7) In our case, no new deployments are made. Then, start looking at the application metrics like CPU, MEMROY, THROUGHPUT and LOAD BALANCER metrics.

8) If you see any spike in RAM, CPU that correlates with the THROUGHPUT which shows that the application started seeing huge spike in the traffic and causing RESOURCE SATURATION. If that is the case consider SCALING OUT the pods/instance to handle the surge is traffic.

9) If all the metrics are under the threshold, investigate one of the traffic flow which shows latency and see where exactly the latency is occurring.

10) If the application makes call to a backend services like Databases, check if the latency is happening at the DB layer. This is could be due to row lock contention, SQL queries causing resource exhaustion, READ replicas are overloaded. 

11) Depending on the DB layer we can apply the solution to fix the issue like implementing CACHING layer or increase SIZE of the READ REPLICA.

12) Why am I not checking the NETWORK layer? In the beginning we started triaging the issue with one specific application. If its a network layer issue or CDN layer issue, then impact would have been much higher and have impacted multiple applications.


Comments

Popular posts from this blog

K8s - ETCD

SRE/DevOps Syllabus

K8s - Deployment and HPA replicas