Hi, I'm

Nikhil Pagote

>

18 yrs · 0 mo · 11 days building & operating production infrastructure at scale

Measurable Outcomes

Real results delivered through SRE practices, cloud automation, and platform engineering.

18+ Years of Experience Building & operating production infrastructure at scale
20% Downtime Reduction Through SRE practices, structured incident response & postmortems
50+ AWS Resources via IaC EC2, EKS, RDS, Lambda & more — all Terraform-managed
15% Team Productivity Gain Through mentoring, DevOps tooling & CI/CD automation

Who I Am

AWS Cloud Platform Engineer and DevOps specialist with 18 yrs · 0 mo · 11 days of experience designing, automating, and operating production infrastructure at scale. Currently building AWS-native infrastructure using Terraform, Kubernetes (EKS), and Python-driven automation — delivering CI/CD pipelines, containerised workloads, infrastructure-as-code, and full-stack observability for SaaS platforms.

Proven track record of leading cloud migrations, establishing SRE best practices, and mentoring engineering teams. Experienced across a broad range of industries including financial services, networking, IT services, and enterprise SaaS.

Passionate about automation-first thinking, clean infrastructure design, and building systems that are not just functional, but resilient, observable, and easy to operate.

Currently deepening expertise in MLOps and AI infrastructure — with a clear ambition to bridge DevOps and machine learning: orchestrating model training pipelines, managing GPU compute, and operating large-scale distributed systems in HPC environments. Drawn to the engineering challenges that sit at the intersection of infrastructure, performance, and intelligent systems.

18+ Years Experience
8+ Companies
10+ Certifications
50+ AWS Resources Managed

Technical Skills

AWS Cloud Platform

EC2EKSLambda S3RDSECR VPCIAMCloudWatch CodePipelineCloudFormationCDK Route 53Secrets ManagerKMS AWS Service Catalog WAFELBALB Step FunctionsEventBridge AWS ParallelCluster

Infrastructure as Code

TerraformOpenTofu TerragruntCloudFormation AWS CDKAnsible PackerBoto3 ChefSaltStackPulumi

Automation

Python 3Boto3FastAPI FlaskStreamlitREST APIs LangGraphCrewAI PandasNumPy Bash/ShellRust

CI/CD Pipelines

GitHub ActionsJenkins Bitbucket PipelinesAWS CodePipeline GitLab CIGitDORA Metrics Groovy (Jenkinsfile) GitOpsSLSA Framework

Kubernetes & Containers

Kubernetes (CKA)Amazon EKS Helm ChartsDocker ECRService Mesh IstioArgoCD GPU Workloads PodmanBuildahDocker Compose

Observability & SRE

PrometheusGrafana DynatraceCloudWatch SLO/SLA TrackingIncident Response OpenTelemetryElasticsearch Jaeger

Databases

PostgreSQLMySQLMongoDB Amazon RDS PineconeChromaDB InfluxDBRedisetcd

Systems Foundation

Linux/UnixIBM AIX PowerVMPowerHA SAN/StoragePerformance Tuning HPC ClustersSlurmPBS CUDADCGM IBM Spectrum Scale (GPFS)Lustre FS njmon

MLOps, LLMOps & Generative AI

MLflowWeights & Biases Amazon SageMakerKubeflow KFPDVCRay Apache Airflow vLLMTriton Inference Server LangChainLangGraphLangSmithCrewAI

Delivery Pipelines

Pipelines are not just for apps — the same principles apply across infrastructure, models, and data.

Cloud Platform & App Delivery
Source
Git
Build & Test
Passed
Containerise
ECR
IaC Deploy
Running
Observe
Waiting
GitHub ActionsJenkinsTerraformEKSGrafana
ML Training & Inference
Data Prep
Ready
Train (GPU)
CUDA
Evaluate
MLflow
Serve
vLLM
MLOps
Waiting
PyTorchCUDAvLLMTritonRay
50+ AWS Resources via IaC
5+ Pipeline Platforms
20% Downtime Reduction
15% Team Productivity Gain

Work Experience

Redwood Software

SaaS Cloud Engineer

May 2024 – Present Burnham, Slough, UK
  • Designed end-to-end CI/CD pipelines using Bitbucket Pipelines and AWS CodePipeline, automating multi-environment deployment workflows and reducing release cycle time.
  • Automated AWS infrastructure provisioning using Terraform (modular, multi-environment) and Python/Boto3, managing 50+ resources including EC2, EKS, S3, RDS, and Lambda with version-controlled IaC.
  • Built full-stack observability with Prometheus and Grafana — dashboards for latency, error rates, and capacity planning; deployed containerised apps on EKS with Helm charts.
Santander UK plc

IT Systems Manager — Digital Transformation

Oct 2022 – Mar 2024 Milton Keynes, UK
  • Led CI/CD pipeline implementation with DORA metrics using GitHub Actions, reducing feedback time and establishing engineering productivity baselines.
  • Orchestrated migration of on-premises infrastructure to AWS — containerised legacy workloads onto EKS, redesigned pipelines with Terraform, achieving measurable reduction in operational costs.
  • Led and mentored a team of engineers implementing SRE best practices, achieving 20% reduction in service downtime and 15% increase in team productivity.
Cisco

DevOps Architect / SRE

May 2021 – Oct 2022 Feltham, UK
  • Promoted SRE best practices across the organisation, reducing system downtime by 20% through structured incident response and blameless postmortems.
  • Assessed and modernised legacy technical debt — migrated Perl automation to Python, aligned with Cisco security compliance, introduced DevOps innovation patterns.
Cognizant Technology Solutions

Associate Operations Manager

Feb 2015 – May 2021 London, UK
  • Managed end-to-end IT service delivery for enterprise clients — data centre operations, AWS cloud migrations, DevOps automation, and presales engagements.
  • Built and managed CI/CD pipelines using Jenkins, Docker, Git, and Puppet with Python/Bash automation scripts; implemented Terraform IaC across multiple client engagements.
Atos

System Analyst — AIX/PowerVM Engineer

Feb 2013 – Feb 2015 Pune, India
  • Administered IBM Power7/8 systems running AIX with PowerVM LPAR virtualisation; configured GPFS parallel filesystems and managed PowerHA high-availability clustering with SAN fabric.
Infosys Technologies Ltd

Senior System Engineer

Mar 2011 – Feb 2013 Pune, India
  • Advanced AIX/PowerVM/Linux administration across enterprise Power estates. Awarded RCL Crown Award for Excellence — recognised as most valuable team member.
Mphasis

Senior Infrastructure Engineer

Mar 2009 – Jan 2011 Pune, India
  • Managed AIX server fleet: installation, configuration, performance tuning, security hardening, and system upgrades across production environments.

Projects

High-impact professional work spanning cloud platforms, SRE, and systems engineering

Redwood Software 2024 – Present
Cloud Migration & Legacy Infrastructure

Docker Swarm to EKS Migration & Legacy SaaS Platform Ownership

Brought in to lead the migration of RunMyJob (RMJ/RMF) — Redwood's flagship SaaS workload automation product — from a production Docker Swarm platform to EKS. Began by taking full ownership of the legacy infrastructure to deeply understand its application architecture, services, and networking — the foundation required to design the EKS target state.

  • Owned end-to-end region deployment — from provisioning a new AWS environment via Control Tower and management account, through IAM policy creation, jump server setup, Docker Swarm cluster implementation, Harbor container registry deployment, application deployment, and Traefik reverse proxy configuration
  • Authored and maintained node bootstrap scripts and supporting Python automation to configure cluster nodes from bare provisioning through to a fully operational state
  • Managed and maintained legacy SaaS infrastructure across AWS multi-account, multi-region environments with DR, using Terraform and Terragrunt as the primary IaC toolchain
  • Automated legacy platform gaps not covered by IaC using Python, Boto3, and a Flask API framework — with CRUD endpoints for creating, updating, and deleting Docker Swarm services for production customers
  • Held full ownership of the SaaS infrastructure repository, SaaS services repository, and the Flask API framework (originally written in Python 3.8)
  • Modernised the legacy Python codebase: introduced compatibility layers to support the latest Python version alongside the legacy Docker runtime, and migrated package management to uv and linting to ruff
  • Owned container image creation and orchestration — authored and maintained Dockerfiles and the pipelines that built and published images to the container registry
  • Migrated the image build pipelines from Bitbucket Pipelines to GitHub Actions, contributing to the broader CI/CD modernisation effort
Multi-accountAWS with DR
Docker Swarm→ EKS migration
Docker Swarm AWS Control Tower Terraform Terragrunt Harbor Traefik Python / Boto3 Flask Bitbucket Pipelines GitHub Actions uv ruff
Cisco 2021 – 2022
SRE

SRE Practice & Reliability Transformation

Established SRE culture across the organisation, reducing system downtime by 20% through structured incident response frameworks and modernising legacy automation.

  • Reduced system downtime by 20% through SRE best practices and on-call structure
  • Implemented blameless postmortem culture, runbooks, and incident severity classification
  • Migrated legacy Perl automation to Python, aligned with Cisco security compliance
  • Introduced DevOps innovation patterns and engineering productivity tooling
20%downtime reduction
Perl → Pythonlegacy migration
Python Kubernetes Prometheus PagerDuty Linux Bash
Cognizant Technology Solutions 2015 – 2021
Legacy Infrastructure & Team Leadership

AIX 5.3 Legacy Infrastructure Management — Leading German Telecom Operator, UK Subsidiary

Led a 7-person team managing production-critical AIX 5.3 on IBM Power 5 for a leading German telecom operator's UK subsidiary — with zero vendor support, as the market had moved on to IBM Power 8.

  • Managed and mentored a team of 7, serving as the sole escalation path with no vendor support available for the end-of-life platform
  • Kept mission-critical production workloads running on IBM Power 5 / AIX 5.3 beyond the vendor support lifecycle
  • Identified that no monitoring agent was compatible with AIX 5.3; authored custom shell health-check scripts scheduled via cron to fill the gap
  • Scripts detected early server anomalies proactively, enabling advance identification of P1 and P2 incidents before they impacted production
  • Pursued Python training in 2015 alongside the leadership role — this project marked the beginning of a programming journey that has since grown into a core skill
7engineers led
Zerovendor support
P1/P2early detection
AIX 5.3 IBM Power 5 Shell / Bash Cron Legacy Infrastructure Team Leadership Python
CI/CD & Web Delivery

Jenkins CI/CD Pipeline — Shop & Purchase Web App for Leading US Airline

Designed and built a Jenkins-based CI/CD pipeline in Groovy for a leading US airline's shop-and-purchase web application, hosted on IBM WebSphere across a 5-node cluster with DR and Production environments.

  • Authored Jenkinsfile pipelines entirely in Groovy; built reusable Jenkins shared libraries to encapsulate complex deployment logic cleanly
  • Pipeline deployed to DR environment first — promoted to Production only after all five WebSphere cluster nodes reported a healthy application status
  • Implemented node-aware retry logic: after each deployment iteration the pipeline prompted for confirmation, identified failed nodes, and re-deployed exclusively to those nodes — skipping already-healthy ones
  • Eliminated the need for a dedicated CD tool entirely, avoiding its licensing cost by delivering full CI and CD capability within a single, self-sufficient Jenkins pipeline
5-nodeWebSphere cluster
ZeroCD tool licensing cost
Jenkins Groovy IBM WebSphere REST APIs DR / Production
AWS & DevOps — Data Science Platform

Self-Service Data Science Infrastructure — UK Financial Regulatory Authority

Delivered a self-service AWS provisioning platform for a UK financial regulatory authority's data science team — enabling data scientists to request ad hoc EC2 environments for experimentation without infrastructure team involvement.

  • Implemented CI/CD pipelines and IaC in AWS CloudFormation to provision on-demand EC2 instances and ALB based on data scientist requests, approved via the account team
  • Wrote Chef cookbooks executed on provisioned instances to install required Python versions, Conda, Jupyter Notebook, Pandas, NumPy, and supporting libraries
  • Configured Apache HTTP Server as a reverse proxy, enabling data scientists to access Jupyter Notebooks directly from a browser — no SSH or direct instance login required
  • Transformed an ad hoc, manual provisioning process into a repeatable self-service model, freeing the data science team to focus entirely on experimentation and model training
Self-serviceon-demand provisioning
Zeroinstance login required
AWS EC2 CloudFormation ALB Chef Jupyter Notebook Conda Apache Python
Infosys Technologies Ltd 2011 – 2013
HPC & High-Performance Infrastructure

Greenfield HPC-Class Cluster Build on IBM Power 795

Architected and delivered a greenfield HPC-class infrastructure for a leading UK manufacturer — from bare-metal networking and high-performance SAN storage through to parallel filesystem and clustered database deployment. The same technology stack (GPFS, high-density Power nodes, HA clustering) underpins modern AI and HPC compute clusters.

  • Provisioned IBM Power 795 compute nodes from bare metal — BIOS, HMC cluster configuration, LPAR partitioning, and full OS hardening on AIX
  • Designed and configured high-performance SAN fabric: IBM DS8800 primary storage with full multipath zoning; IBM TS5300 tape library for tiered backup
  • Deployed GPFS (IBM Spectrum Scale) parallel filesystem — the same technology used in national HPC centres and AI supercomputing clusters worldwide
  • Implemented Oracle RAC over GPFS — a production-grade parallel database on a parallel filesystem, requiring deep expertise in concurrent I/O, node fencing, and cluster quorum
  • Configured PowerHA (HACMP) for compute node high availability, delivering the redundancy model directly transferable to modern HPC cluster schedulers (Slurm, PBS)
GreenfieldHPC-class build
GPFSparallel filesystem
Full-stacknet → storage → OS → FS
IBM Power 795 HPC PowerVM / VIO GPFS / Spectrum Scale AIX PowerHA / HACMP HMC
Open Source & Learning Continuous

Continuous learning and experimentation with modern DevOps and MLOps practices

Governance-first ML automation

End-to-End MLOps Pipeline — Production-Grade Experimentation Framework

Personal POC implementing a complete, industry-standard MLOps lifecycle — built and operated as a production-grade pipeline from the ground up using Databricks and Spark.

  • Data ingestion and consumption from source systems into a governed data layer
  • Data filtering, cleaning, and feature engineering as reproducible pipeline stages
  • Model building, training, and testing with experiment tracking and versioning
  • Model deployment to a serving layer using vLLM and Triton Inference Server for high-performance model hosting, with automated promotion gates between environments
  • Model evaluation and continuous monitoring using MLflow and Weights & Biases — tracking metrics, comparing runs, and detecting drift and performance degradation
  • Model registry integration for versioned model storage, lineage tracking, and controlled promotion between staging and production
  • Automated retraining trigger — closing the feedback loop by initiating a new training run when monitoring detects data drift or performance degradation
  • Entire pipeline treated as production-grade: IaC-provisioned infrastructure, version-controlled configs, and governance-first design throughout
Databricks Spark Python Kubeflow KFP MLflow DVC Weights & Biases Apache Airflow Amazon SageMaker Ollama Terraform
Secure deployment automation & RAG

AI-Powered Shopping Cart — RAG App with Production-Grade CI/CD

A Streamlit-based shopping cart app with an embedded AI chatbot — users browse products and converse with a bot in the same interface. Built with a dual-database RAG architecture and wrapped in a production-grade, security-enforced CI/CD pipeline.

  • Supabase (Postgres + RLS) as the transactional source of truth for inventory, cart, and orders; Pinecone as the vector store for semantic product search and personalised recommendations
  • Groq Cloud powering both embedding (llama3-8b) and chat completion (mixtral-8x7b) — enabling natural language product discovery without exact keyword matches
  • Full RAG pipeline: user query → Groq embedding → Pinecone ANN search → Supabase product fetch → Groq response
  • Production-grade CI/CD via GitHub Actions: ruff lint, mypy strict type-check, pytest >80% coverage, detect-secrets scan, Docker multi-stage build, and ArgoCD GitOps deployment to EKS
  • Python MVP planned for Rust rewrite — same pipeline, same CI gates, benchmarked with Criterion
LangChain LangGraph CrewAI Streamlit Supabase Pinecone Groq Cloud RAG GitHub Actions ArgoCD Docker Ollama Python Rust
HPC & Infrastructure-as-Code

HPC Cluster on AWS ParallelCluster with Kubernetes & Slurm

Exploring IaC-driven deployment of HPC infrastructure using AWS ParallelCluster — with Kubernetes and the Slurm operator (SlinkyProject) running on top. Also evaluating Floci as a complementary tool in the HPC orchestration space.

  • Provisioning HPC compute clusters via AWS ParallelCluster using Terraform IaC
  • Deploying Kubernetes on the cluster and integrating the Slurm operator via the SlinkyProject to bridge HPC job scheduling with Kubernetes workload management
  • Planning to test Elastic Fabric Adapter (EFA) for low-latency, high-bandwidth RDMA networking between nodes — AWS's equivalent of InfiniBand
  • Integrating Amazon FSx for Lustre as a high-performance parallel filesystem for the cluster — the standard shared storage layer in HPC environments
  • Extending the node group with GPU-enabled EC2 instances to demonstrate GPU workloads scheduled via the Slurm Workload Manager on Kubernetes
AWS ParallelCluster Terraform HCL Kubernetes Slurm SlinkyProject Floci EFA / RDMA FSx for Lustre GPU (EC2) AWS
Reliability engineering

Full-Stack Kubernetes Observability POC

Personal POC building a local Kubernetes cluster with kind to demonstrate end-to-end observability — covering metrics, logs, traces, GitOps, and service mesh on a single cluster. No GPU workloads; purely focused on cloud-native observability patterns.

  • Provisioning a local Kubernetes cluster using kind as a lightweight, reproducible environment
  • Deploying OpenTelemetry to collect logs, traces, and health checks from the cluster and running applications, with Prometheus as the metrics backend and Grafana for dashboards
  • Implementing ArgoCD for GitOps-driven application deployment and lifecycle management
  • Running a sample Python application to generate realistic logs and traces, exercising the full observability pipeline end-to-end
  • Configuring Istio service mesh to demonstrate traffic management, mutual TLS, and inter-service observability within the cluster
kind OpenTelemetry Prometheus Grafana ArgoCD Istio Python Kubernetes
Learning & Growth Focus
  • Building a production-grade end-to-end MLOps pipeline on Databricks — from data ingestion to model registry, drift detection, and automated retraining
  • Deploying a full Kubernetes observability stack on a local kind cluster using OpenTelemetry, Prometheus, Grafana, ArgoCD, and Istio service mesh
  • Provisioning HPC infrastructure on AWS ParallelCluster with Kubernetes, Slurm operator, EFA networking, FSx for Lustre, and GPU-enabled node groups
  • Developing an AI-powered RAG shopping cart app with LangChain, Groq, Supabase, and Pinecone — wrapped in a security-enforced GitHub Actions CI/CD pipeline
  • Exploring Rust as a second runtime — rewriting the Python shopping cart backend to benchmark performance and learn systems programming

Certifications

What People Say

I have seen Nikhil as the utmost proactive DevOps Architect and forward thinker. Many times he has been a great technical consultant for me as a reference point. He is really a premium asset for the team.

NN
Nikhel Nakhasi
Chapter Section Lead, Cloud & DevOps · Roche

Nikhil is a Great Guy, wonderful person to work with. He can be a Gem in any Organisation.

MM
Manish Meshram
Solution Architect, Digital Transformation · Crave InfoTech

Tech savvy, self-motivated professional with a strong drive to deliver results.

RD
Rajib Dam Chowdhury
Delivery Manager · ITC Infotech

I had the pleasure of working closely with Nikhil on several projects, and I can confidently say he is an exceptional colleague. His ability to collaborate effectively, communicate clearly, and bring creative solutions to challenges made a significant impact on our team's success. Nikhil consistently demonstrated reliability and dedication, ensuring tasks were completed on time and with high quality. Beyond his professional skills, he brought positivity and encouragement to the workplace, making him a valued team member and a joy to work with.

RP
Rutvizkumar P
Assistant Vice President, Unix · Barclays

I had the pleasure of managing Nikhil at Cognizant, where we worked together on the UK's leading payment gateway project, based at the client's Basildon, UK office. Nikhil is a technically strong engineer with solid hands-on experience in Kubernetes infrastructure and performance tuning. During our time together, we worked extensively on tuning an on-prem Kubernetes cluster setup, and Nikhil consistently showed great ownership, a methodical approach to problem-solving, and a collaborative attitude under pressure. I would strongly recommend Nikhil to any team looking for a skilled and dependable infrastructure engineer.

RV
Ravindra Verma
Senior DevOps Architect · Philips

Let's Connect

Open to new opportunities in Cloud Platform Engineering, DevOps/SRE, and Platform Engineering roles. Whether you have a role in mind or just want to talk infrastructure, I'd love to hear from you.