• Designed and built an SRE-focused Kubernetes platform that reduced Mean Time To Detection (MTTD) by 60% and Mean Time To Resolution (MTTR) by 45% through metrics-driven alerting and automated remediation. • Defined and operationalized SLIs, SLOs, and error budgets for critical services, integrating Prometheus, Alertmanager, and Grafana for proactive incident detection and response. • Implemented GitOps-based deployment workflows using ArgoCD and Helm, enabling safe, auditable releases with automated rollback guarantees. • Built agent-driven remediation runbooks with guardrails, enabling controlled, automated mitigation of common failure scenarios without human intervention.