Apple Interview QnA - Part I

 

A server’s CPU is pegged at 100% but top shows no process consuming that much. How do you debug?

When the CPU is at 100% then there is no point in running top. Because “top” command needs CPU and Memory to fetch the statistics.

We can start with running # sar or # vmstat command to find the CPU performance. CPU stats are divided into %user, %system, %idle and %wait.

If the CPU% is high on the user end which refers to an application which is consuming more space.

Then, I would start filtering out the process which is consuming more space using # ps -ef <followed_by_cpu_flags> to get which process is consuming more CPU.

If the process is owned by non-root application, we can restart the application. If the process is owned by root, then we can dig into the logs to see what is causing the issue. Ideal solution would be to reboot the server.

If the CPU% is high on wait end which refers CPU waiting for an event I/O to happen.

Check mpstat -P ALL 1 to see per-core usage and %irq, %soft

Also, good to check for zombie process and /proc/interrupts. Shows number of interrupts handled by the CPU since reboot. This is powerful place to check for high CPU when nothing shows up in the top or ps.

A screenshot of a computer

AI-generated content may be incorrect.

Look for lines with very large numbers, especially increasing rapidly.

In the above example, we see eth0 dominating across all the CPU’s, it may indicate driver issue, bad NIC.

Users report intermittent latency but monitoring shows “all green.” How do you prove whether it’s DNS, routing, or app-level?

Latency is time to respond to a request which is measured in ms or sec. Latency is broadly monitored as P90, P95 and P99 latency.

Depending upon the request flow from client to server and response back client plays a vital role in triaging latency issues.

We start with the application layer to ensure no new changes were rolled out for causing the issue. If that is the case, roll back to version N-1.

Next, is to find out which layer latency is observed. We will start at the client layer.

·       Diagnosing with the logs to find if the client experiencing latency has any common patterns like location, ISP provider, CDN cache, client device software like browser version.

·       If the issue is isolated to any of the above except client device software, then this needs to addressed by application team.

·       Application team must ensure their application works on the client device software.

·       If it’s a CDN issue, then CDN provider can switch traffic to other CDN’s to reduce latency.

If our monitoring shows “all green”, then it’s a possible issue at the client side or external network.

How would you debug a packet loss issue occurring only between two specific services in different VPCs?

This is a complex issue to triage as it occurs at the networking layer.

Start gathering details:

1)       Is the packet loss happening uni-directional or bi-directional?

2)       Is it intermittent or all the time?

3)       Is it isolated to specific instances or entire subnet?

But it’s worth to understand the traffic flow between VPC’s. Common pattern of communication between VPC’s happens via VPC peering or VPC transit gateway.

Typical $ping command will show the packet loss between networks.

We can enable VPC flow logs to get insights into the traffic, looks for “REJECT” entries.

Check for ENI adapter statistics if it’s isolated to a specific instance where packet loss is identified $ ethtool -s eth0 and look for Rx/Tx errors, dropped packets, interface queue overflows.

We can also use $ mtr (mtr -rwz -c 100 example.com) command which is a combination of (ping + traceroute) to identify the packet loss, latency and routing issues.

A screenshot of a computer

AI-generated content may be incorrect.

Ensure no NACL or Security group is blocking the traffic.

Ensure NAT GW is not overloaded.

Ensure the compute instances are not running out of resources.

Ensure there is no mismatch with MTU size between the compute instances.

A deployment is stuck during a rolling update. Half the pods are running the old version; half are failing on the new one. What’s your rollback and investigation approach?

Rolling update is one of the deployment strategies in K8s. This method rolls out new versions in batches. During the rolling update, there is a mix match of versions taking traffic at the same until the rolling update is completed.

While performing rolling update, we can fine tune few deployment parameters to balance availability, speed and safety.


maxUnavailable:

Controls how many pods can be unavailable during deployment.

Lower the number -> Higher the availability but increased rollout time.

maxSurge:

Control maximum number of extra pods can be created temporarily.

Helps to rollout fast without compromising availability but incurs additional cost and resource usage.

Now, in case of partial failed rollout, we can roll back the deployment using $ kubectl rollout undo deployment <deployment_name>

Investigate the failure buy describing pod using $ kubectl describe pod  pod_name

Look for Events, ImagePullBackOff, CrashLoopBackOff, or Readiness probe failed.

Check the pod logs: $ kubectl logs <failing pod>

Compare the deployment manifest issue with the old (N-1) to see what the new change was introduced and fix it. This could be like changes in the pod specification, readiness, probe, lifecycle hook and other things.

If the pods is multi-container check the logs of init, side car containers and take appropriate actions.

FYI – Tools like Argo CD can detect deployment issues and automatically rollback to N-1 manifest.

Comments

Popular posts from this blog

K8s - ETCD

SRE/DevOps Syllabus

K8s - Deployment and HPA replicas