Apple Interview QnA - Part I
A server’s CPU is pegged at 100% but top shows no process consuming
that much. How do you debug?
When the CPU is at 100% then there is no point in running top. Because
“top” command needs CPU and Memory to fetch the statistics.
We can start with running # sar or # vmstat command to find the CPU
performance. CPU stats are divided into %user, %system, %idle and %wait.
If the CPU% is high on the user end which refers to an
application which is consuming more space.
Then, I would start filtering out the process which is consuming
more space using # ps -ef <followed_by_cpu_flags> to get which process is
consuming more CPU.
If the process is owned by non-root application, we can restart the
application. If the process is owned by root, then we can dig into the logs to
see what is causing the issue. Ideal solution would be to reboot the server.
If the CPU% is high on wait end which refers CPU waiting for
an event I/O to happen.
Check mpstat -P ALL 1 to see per-core usage and %irq, %soft
Also, good to check for zombie process and /proc/interrupts. Shows number of interrupts handled by the CPU
since reboot. This is powerful place to check for high CPU when nothing shows
up in the top or ps.
Look for lines with very large numbers, especially
increasing rapidly.
In the above example, we see eth0 dominating across all the CPU’s,
it may indicate driver issue, bad NIC.
Users report intermittent latency but monitoring shows “all green.”
How do you prove whether it’s DNS, routing, or app-level?
Latency is time to respond to a request which is measured in ms
or sec. Latency is broadly monitored as P90, P95 and P99 latency.
Depending upon the request flow from client to server and response back
client plays a vital role in triaging latency issues.
We start with the application layer to ensure no new changes were rolled
out for causing the issue. If that is the case, roll back to version N-1.
Next, is to find out which layer latency is observed. We will start
at the client layer.
·
Diagnosing
with the logs to find if the client experiencing latency has any common patterns
like location, ISP provider, CDN cache, client device software like browser
version.
·
If
the issue is isolated to any of the above except client device software, then
this needs to addressed by application team.
·
Application
team must ensure their application works on the client device software.
·
If
it’s a CDN issue, then CDN provider can switch traffic to other CDN’s to reduce
latency.
If our monitoring shows “all green”, then it’s a possible issue at
the client side or external network.
How would you debug a packet loss issue occurring only between two
specific services in different VPCs?
This is a complex issue to triage as it occurs at the networking layer.
Start gathering details:
1)
Is
the packet loss happening uni-directional or bi-directional?
2)
Is
it intermittent or all the time?
3)
Is
it isolated to specific instances or entire subnet?
But it’s worth to understand the traffic flow between VPC’s. Common
pattern of communication between VPC’s happens via VPC peering or VPC transit
gateway.
Typical $ping command will show the packet loss between networks.
We can enable VPC flow logs to get insights into the traffic, looks
for “REJECT” entries.
Check for ENI adapter statistics if it’s isolated to a specific
instance where packet loss is identified $ ethtool -s eth0 and look for Rx/Tx errors,
dropped packets, interface queue overflows.
We can also use $ mtr (mtr -rwz -c 100 example.com) command which is a combination
of (ping + traceroute) to identify the packet loss, latency and routing issues.
Ensure no NACL
or Security group is blocking the traffic.
Ensure NAT GW
is not overloaded.
Ensure the compute instances are not running out of resources.
Ensure there is no mismatch with “MTU” size between the compute instances.
A deployment is stuck during a rolling update. Half the pods are
running the old version; half are failing on the new one. What’s your rollback
and investigation approach?
Rolling update is one of the deployment strategies in K8s. This method
rolls out new versions in batches. During the rolling update, there is a mix
match of versions taking traffic at the same until the rolling update is completed.
While performing rolling update, we can fine tune few deployment
parameters to balance availability, speed and safety.
maxUnavailable:
Controls how many pods can be unavailable during deployment.
Lower the number -> Higher the availability but increased
rollout time.
maxSurge:
Control maximum number of extra pods can be created temporarily.
Helps to rollout fast without compromising availability but incurs additional
cost and resource usage.
Now, in case of partial failed rollout, we can roll back the deployment
using $ kubectl rollout
undo deployment <deployment_name>
Investigate the failure buy describing pod using $ kubectl describe pod pod_name
Look for Events, ImagePullBackOff, CrashLoopBackOff, or Readiness
probe failed.
Check the pod logs: $ kubectl logs <failing pod>
Compare the deployment manifest issue with the old (N-1) to see
what the new change was introduced and fix it. This could be like changes in
the pod specification, readiness, probe, lifecycle hook and other things.
If the pods is multi-container check the logs of init, side car
containers and take appropriate actions.
FYI – Tools like Argo CD can detect deployment issues and automatically
rollback to N-1 manifest.
Comments
Post a Comment