Apple Interview QnA - Part II
A pod in Kubernetes cannot reach an external API, but curl works fine
from the node. What is your debugging flow?
This situation clearly shows the issue is in the pod layer, because
the endpoint is accessible from node where the pod is running.
Node and Pods don’t share the same network in real time. So, I would start with the below checks:
- Check if the endpoint is resolving from the pod. This is to eliminate if it is a network issue or DNS resolve issue.
- If DNS fails, check “CORE DNS” pods which is usually created on all the worker nodes.
- CORE DNS pods are usually run a replicaset. It is worth to check the pod health and resource consumption.
- Let’s say the DNS work fine and we are getting timeout while connecting the external API.
- This could be due to the network policy (EGREES) configured under the namespace.
- Also, it could be related to the PROXY configuration which blocks access to the external endpoint.
How to debug a “stuck” zombie process that refuses to die even with
kill -9?
Process is a program under execution. Process is made
up of threads. Threads share the same resources as memory among other threads
in the same process. When a process terminates, all its threads are
automatically terminated by the kernel.
Linux kernel maintains a “Process Table” that keeps track of all active process on the system. It contains information like PID, PPID, UID/GID, Process state(Running, Sleeping, Zombie), CPU/Memory usage, Open file descriptors, Scheduling priority, Signal handlers, Thread information.
When a process completes its work and exits, the Linux kernel is responsible
for cleaning up its entry in the process table—but only after the parent process has
acknowledged the exit.
Reaping -> Once the child process is done,
the parent needs to collects its result (child process exit status) so the system
can cleanup.
1. Identify the Zombie
ps -eo pid,ppid,stat,cmd | grep 'Z'
2. Check the Parent Process
ps -p -o pid,stat,cmd
3. Send SIGCHLD to Parent
kill -SIGCHLD
4. Restart the Parent process
A critical binary suddenly stops working after a package upgrade.
Walk me through the RCA
Start figuring out what kind of issues the binary faces after a
package upgrade.
1)
Process
crashing.
2)
Process
completely stopped.
3)
Process
running but erroring out.
Next, we need to find which package was upgraded? This can be fetched
by looking at the /var/log/yum.log or dpkg.log (search for upgrade keyword)
Check if the impact binary itself was upgraded or if its dependencies
were part of the upgrade.
# rpm -qR <package_name> shows package dependencies. See
if they were upgraded.
Also, execute the binary in debug mode using # strace command to catch
system call failures.
It is also recommended to check if any environment variable, file permission,
SELinux changed after the upgrade which breaks the applications.
See if the upgrade can be rolled back to mitigate the issue # yum
downgrade <package_name>
Disk I/O latency spikes randomly on a database host. What tools/commands
do you use to pinpoint the issue?
Disk I/O is divided into read and write operations. Commands like
iostat, vmstat, sar can help in understanding the utilization on read and write
operations.
From iostat command there are 2 important attributes which can take
us closer to the issue.
è High
value means disk is slow or overhead.
è Low value means disk is responsive and handling IO efficiently.
%util -> Refers to % of time the disk was busy.
è High % refers to disk is constantly busy and it is getting saturated.
Reduce parallel writes or batch them.
Use faster disks for better performance.
Separate workloads like moving logs, backup, db data to different volumes.
Disk queue depth refers to the number of I/O operations that can be queued and waiting to be processed by a storage device at any given time. It’s a measure of how much work the disk can handle concurrently.
Comments
Post a Comment