Apple Interview QnA - Part II

                                                     


A pod in Kubernetes cannot reach an external API, but curl works fine from the node. What is your debugging flow?

This situation clearly shows the issue is in the pod layer, because the endpoint is accessible from node where the pod is running.

Node and Pods don’t share the same network in real time. So, I would start with the below checks:

  •  Check if the endpoint is resolving from the pod. This is to eliminate if it is a network issue or DNS resolve issue.
  • If DNS fails, check “CORE DNS” pods which is usually created on all the worker nodes.
  • CORE DNS pods are usually run a replicaset. It is worth to check the pod health and resource consumption.
  • Let’s say the DNS work fine and we are getting timeout while connecting the external API.
  • This could be due to the network policy (EGREES) configured under the namespace.
  • Also, it could be related to the PROXY configuration which blocks access to the external endpoint.

How to debug a “stuck” zombie process that refuses to die even with kill -9?

Process is a program under execution. Process is made up of threads. Threads share the same resources as memory among other threads in the same process. When a process terminates, all its threads are automatically terminated by the kernel.

Linux kernel maintains a “Process Table” that keeps track of all active process on the system. It contains information like PID, PPID, UID/GID, Process state(Running, Sleeping, Zombie), CPU/Memory usage, Open file descriptors, Scheduling priority, Signal handlers, Thread information.

When a process completes its work and exits, the Linux kernel is responsible for cleaning up its entry in the process table—but only after the parent process has acknowledged the exit.

Reaping -> Once the child process is done, the parent needs to collects its result (child process exit status) so the system can cleanup.

1. Identify the Zombie

ps -eo pid,ppid,stat,cmd | grep 'Z' 

2. Check the Parent Process

ps -p -o pid,stat,cmd 

3. Send SIGCHLD to Parent

kill -SIGCHLD 

4. Restart the Parent process

A critical binary suddenly stops working after a package upgrade. Walk me through the RCA

Start figuring out what kind of issues the binary faces after a package upgrade.

1)       Process crashing.

2)       Process completely stopped.

3)       Process running but erroring out.

Next, we need to find which package was upgraded? This can be fetched by looking at the /var/log/yum.log or dpkg.log (search for upgrade keyword)

Check if the impact binary itself was upgraded or if its dependencies were part of the upgrade.

# rpm -qR <package_name> shows package dependencies. See if they were upgraded.

Also, execute the binary in debug mode using # strace command to catch system call failures.

It is also recommended to check if any environment variable, file permission, SELinux changed after the upgrade which breaks the applications.

See if the upgrade can be rolled back to mitigate the issue # yum downgrade <package_name>

Disk I/O latency spikes randomly on a database host. What tools/commands do you use to pinpoint the issue?

Disk I/O is divided into read and write operations. Commands like iostat, vmstat, sar can help in understanding the utilization on read and write operations.

From iostat command there are 2 important attributes which can take us closer to the issue.


await  -> Shows average time to complete I/O request.

  è   High value means disk is slow or overhead.

è      Low value means disk is responsive and handling IO efficiently.

      %util -> Refers to % of time the disk was busy.

  è   High % refers to disk is constantly busy and it is getting saturated.

      Reduce parallel writes or batch them.

      Use faster disks for better performance.

      Separate workloads like moving logs, backup, db data to different volumes.

       Disk queue depth refers to the number of I/O operations that can be queued and waiting to be processed by a storage device at any given time. It’s a measure of how much work the disk can handle concurrently.

Comments

Popular posts from this blog

K8s - ETCD

SRE/DevOps Syllabus

K8s - Deployment and HPA replicas