Explain DevOps Project In Interview
Project Overview: - Frontend Backend Jira Ticket System Confluence – For Documentations Slack or MS Teams – For Internal Chat, Calls & Meetings Lastpass or 1Password - To store credentials and share with other team members Securely GitHub – To store project application codes
Frontend:
Tech Stack Details
ECS Farget Service - Container Route53 Load Balancer SSL Certificate ECR Registry Custom VPC
Backend: -
Tech Stack Details Lambda Functions (Nodejs-16) API Gateway RDS - MySQL Database with Replication S3 Bucket Route53 SSL Certificate Custom VPC
How alerts triggered? AWS SNS Topic with Email Subscriptions Alerts are integrated with Slack Channel
Infra Network Setup
Frontend to Backend - VPC Peering or Direct Connect Service to make two different account VPC private connections RDS – It should be in private subnet and secured RDS password should be stored in secret manager
Deployment End to End Process: -
Frontend: - CICD Pipeline stage should be like this: - Build Test Deploy Test Prod Follow proper Git branching strategy during deployment Branching strategy could be like, dev, hotfix, features, release branches Always follow code review process before merging into master branch
CICD Pipeline Work Build: - Build Docker image and pushed into ECR registry Test: - SonarQube should be there integrated to check Code quality. Deploy: - Deploy latest image from ECR to ECS container
Backend: - CICD Pipeline stage should be like this: - Build Test Deploy Test Prod Follow proper Git branching strategy during deployment Branching strategy could be like, dev, hotfix, features, release branches Always follow code review process before merging into master branch
CICD Pipeline Work Build: - Install required package to make a bundler for lambda functions Test: - SonarQube should be there integrated to check Code quality. Deploy: - Deploy latest changes related to Lambda, API Gateway & RDS
Monitoring: Grafana CloudWatch
What Is Covered in Monitoring? RDS: - CPU, Memory Utilization, DB Connections, Replica Lags Lambda Functions: - Errors, Durations, Invocations ECS Container: - CPU, Memory Utilization API Gateway: - 5xx error, Hit count, Latency
Day to Day Activities:
Monitor Infrastructure status by using Grafana and CloudWatch
Check Jira ticket status and work on pending task Production release management if any Setup CICD pipeline according to project requirement Follow best practices Git branching strategy in CICD for deployments Write Docker file as per the application Create/Manage infra on AWS using terraform. Add new users or provide access to users as per request in IAM. Always find a way to automate the tasks and do the enhance wherever I see the opportunity Daily standups, client meetings and internal team meetings Create infrastructure related documents in confluence
Real Time Issues & Troubleshooting
AWS: -
Increase EBS Volume Size for EC2 without Downtime Configure Auto Scaling for Better Optimized Setup Enable termination protection for RDS, Load Balancer, and EC2 Delete older files from S3 Bucket Server performance is very slow. Increase Instance types RDS database server working slow. Server and Database not able to connect Lambda function timeouts Security group policy IAM User or Role with policy management S3 Bucket Security. Don't make S3 bucket public Automate the EC2, Database Backup
Jenkins or GOCD: -
Pipeline failure due to server not connect Plugins upgradation issue Configuration issue with pipeline like variables, SSH etc. Agent failure during pipeline execution Jenkins master server crash or failure Limitation of build executors Store credentials securely Security Vulnerabilities like port open, unsecured configurations etc Outdated Jenkins version Master-slave server failure
Kubernetes: -
Infrastructure capacity issue for node to launch new container Networking configuration challenges Log management or export logs to CloudWatch or Grafana etc Cluster Setup and Connection Pod monitoring Check pod status after every deployment and make sure it should be running Setup CICD pipeline for new deployment in k8s cluster If any error logs then according to error, send it to developer team to fix it
Real Time Issues:
Pod Deployment Failures: Issue: Pods fail to deploy, and troubleshooting the root cause, whether its misconfigured resources, image availability, or connectivity issues, can be challenging.
Ingress Configuration Problems: Issue: Ingress rules not working as expected, leading to routing or load balancing issues. Debugging involves checking configuration syntax, backend services, and networking.
Persistent Volume (PV) and Persistent Volume Claim (PVC) Mismatches: Issue: Mismatched PV and PVC configurations can lead to data access problems. Resolving this involves aligning storage classes, access modes, and reclaim policies.
Networking Issues: Issue: Networking challenges like pod-to-pod communication failures, service discovery issues, or external access problems. Diagnosing involves examining network policies, service configurations, and firewall settings.
Resource Constraints: Issue: Pods experiencing resource limitations or excessive resource usage, causing performance degradation. Addressing this requires optimizing resource allocations and scaling strategies.
Scaling Challenges: Issue: Difficulty in scaling applications horizontally or vertically due to misconfigurations, improper auto-scaling settings, or limitations in cluster capacity.
Secrets Management: Issue: Problems with managing and securing sensitive information using Kubernetes secrets, including issues with encryption, distribution, and updates.
Node Failures and Recovery: Issue: Nodes going down unexpectedly, affecting application availability. Handling this involves implementing node health checks, redundancy, and automated recovery mechanisms.
Image Registry Access Issues: Issue: Problems pulling container images from registries during pod initialization, often related to authentication, authorization, or image availability.
Rolling Updates and Rollbacks: Issue: Challenges in orchestrating rolling updates without downtime or rolling back to a previous version when issues arise. This requires careful management of deployment strategies and versioning.
Terraform: -
State File Corruption: Issue: Terraform state file corruption can occur due to unexpected interruptions or conflicts, leading to inconsistencies in infrastructure management.
Resource Dependencies: Issue: Managing dependencies between resources can be challenging, especially when creating resources that depend on outputs from other resources.
Variable Validation: Issue: Ensuring proper validation of input variables can be tricky, leading to misconfigurations or unexpected behavior.
Sensitive Data Handling: Issue: Managing sensitive data like API keys or passwords in Terraform can pose security risks.
Provider Version Compatibility: Issue: Upgrading Terraform versions might lead to compatibility issues with specific providers or modules.
State Locking: Issue: Concurrent Terraform runs can result in state locking issues, causing conflicts and potential data corruption.
Dynamic Resource Creation: Issue: Dynamically creating resources based on variable inputs can be complex and prone to errors.
Module Versioning: Issue: Managing module versions across different environments can lead to inconsistencies.
Rollback Challenges: Issue: Rolling back infrastructure changes can be difficult, especially when dealing with destructive changes.
Provider Rate Limiting: Issue: Some cloud providers impose rate limits, causing Terraform to fail during rapid or large-scale deployments.
Comments
Post a Comment