Posts

Showing posts from 2025

SRE Interview Questions and Answers - Part II

Image
                                                                   What is latency and how do you reduce it? Latency - Refers to time taken to respond the request (Processing time of the application to process the request). Latency could be measured for a user request or application to application request or application to data request like (S3, Database, Redshift). There could be various reasons for a latency issue. 1) Client side issue this is pretty straight forward to identify when our i nternal telemetry and observability looks GREEN. 2) Possible reasons for client side issue could be ISP, User Agent like browser, Geo location based issues. If its server side issues, better to start with the probing to pin point the issues:  Let's say one of the microservices called " inventory " ...

SRE Interview Questions and Answers - Part I

Image
                                                                What is SRE and how is it different from DevOps? SRE stands for Site Reliability Engineering which primarily focus on managing the application and its infrastructure in PRODUCTION. Their aim is to focus on improving the reliability and resiliency of the applications, improve the monitoring and observability of the application, focus on SHIFT LEFT approach to address the issue at the development stage of the software, monitor the promised SLA, SLO and SLI. Approach every problem from a software development approach. Identify and eliminate toils. Focus on automation and run books to improve the reliability and resiliency of the application and systems and involve in Root Cause Analysis and Post mortem calls after a major incident. What are...

Agentic AI - Guardrails

Image
  Agentic AI refers to AI systems that can autonomously plan, decide, and act—interacting with tools, APIs, and environments without constant human oversight. Guardrails are essential to ensure these agents operate safely, ethically, and within defined boundaries. 🤖 What Is Agentic AI? Unlike traditional AI that passively generates responses (e.g., chatbots or classifiers), Agentic AI systems are active participants in workflows. They can: 🔍 Search and retrieve internal or external data ⚙️ Trigger workflows or automate multi-step tasks 🧠 Make decisions based on goals and context 🧾 Write or modify code , schedule events, or make purchases 🔗 Interact with APIs, databases, and other systems ⚠️ Why Guardrails Are Critical for Agentic AI Because agentic systems can act independently, they pose greater systemic risk than traditional AI. Without proper controls, they might: 🕵️‍♂️ Access sensitive data unintentionally 🧨 Trigger unauthorized actions (e.g., deleti...

AWS Lambda Integration With EventBridge

Image
                                                                                  In our previous blog, we explored the concept of Lambda versioning. In real-world scenarios, Lambda functions are typically triggered either on a schedule or in response to specific events. In this post, we’ll walk through how to invoke a Lambda function using both scheduled triggers and event-driven mechanisms. This is our goal. Our lambda scans for any RUNNING, PENDING instances of type "T3.SMALL". If there are any instances of that type, it triggers an email. Involved services: 1) Lambda - Python code to scan for  RUNNING, PENDING instances of type "T3.SMALL" 2) Event Bridge - Scheduler and Event Based. 3) SNS - Notification Service. Here is the simple...

AWS Lambda - Traffic Shift

Image
  We all know what is AWS Lambda which is a serverless architecture. In this blog, we will discuss on what is lambda alias and how it can be used shift traffic between 2 version of AWS lambda. What is lambda versioning? AWS Lambda versions are immutable snapshots of your function’s code and configuration at a specific point in time. An immutable snapshot is a frozen, unchangeable copy of something at a specific point in time. Immutable = cannot be changed Snapshot = point-in-time copy Lets start by creating a simple lambda function with nodejs as runtime environment. I updated the code as below: Invoking the lambda function. Let's publish this version of lambda function as "Version-1". Now, we have our function looks like: demo-function -> Version-1 Let's update the demo-function with the below content. Publishing this version as "Version-2". Now, our lambda "demo-function" has 2 versions. demo-function -> Version-1               ...

Deployment Strategies In AWS ASG - Terminate and Launch

Image
    Terminate and Launch: Terminate the existing instance and launch a new instance with updated LT. In this strategy we will change the minimum number of instances to be running during the time of deployment.  I have 5 EC2 instances under the ASG which are running on version LT1. Setting minimum 10% -> Ensure 1 machine is actively taking traffic and others are updated. Setting minimum 50% -> Ensure 3 machines are active and 2 are updated. Setting minimum 90% -> Ensure 4 machines are active and 1 is updated. Increasing the minimum will increase the deployment time. Please find the below table for better understanding. Instances # Min Healthy Instances Minimum Must Machines Instances # Patched at a time 10 20% 2 8 10 30% 3 7 10 40% 4 6 10 50% 5 5 10 ...

Deployment Strategies In AWS ASG - Launch before Terminating

Image
  We all know what is AWS ASG ( Auto Scaling Group ) which scales EC2 instances based on CloudWatch metrics . Backbone of ASG is Launch Template ( LT ), which acts like a blue print for creating EC2 instances. Whenever, there is an update to the LT, existing instances must be updated safely without having much downtime or with near zero downtime . ASG offers various methods of instance refresh .  1) Launch before Terminating . 2) Terminate and Launch . 3) Custom Behavior . Launch and Terminate: As the name says, it creates new instance with latest LT before terminating existing instances.   Launch new instances and wait for them to be ready before terminating others. This allows you to go above your desired capacity by a given percentage and may temporarily increase costs. Let's say the ASG has desired capacity of 5 EC2 instance. "Launch before Terminating" strategy ensures it always has 5 EC2 instances exist. With "Launch before Terminating": At any given t...