Site Reliability and DevOps Engineering is a key requirement within the Rackspace support infrastructure and is expected to provide a prominent level of technical support to our customers via the phone, ticketing system, and automation. This role owns complex customer issues which may take several days or weeks to resolve and keeps our customers updated through every step of the process. Provides a framework for system development, maintenance, and enhancement efforts. Implements standards and guidelines of SRE support. You will be a member of an SRE team that operates our private cloud and develops tools and integrations for a portfolio of cloud infrastructure services. You will use your private cloud knowledge to drive improvements in operations and releases through code. You will use common open-source observability tooling like ELK and Grafana for proactive alerts to measure and maintain Service Level Objectives. You will work with the Tier 1 team on escalations and use those escalations as opportunities to automate.
This is an onsite position based in KSA.
Responsibilities- Develop and deliver software required for building & improving the functionality, reliability, availability, and manageability of applications and cloud platforms using a DevOps model for On-Prem (OpenStack and Kubernetes)
- Automate the development, testing, and deployment processes through CI/CD pipelines (Git, GitLab, Helm, ArgoCD) to deliver to different architectures
- Work with Tier 1 support on system and customer escalations
- Troubleshoot and resolve issues related to infrastructure, Kubernetes clusters, applications, workloads, and networks
- Collaborate with software engineering teams to optimize application performance and reliability
- Work on the reliability and continuity of Kubernetes or VM (Virtual Machines) workloads
- Continuously evaluate and improve systems and processes to enhance reliability, performance, and efficiency
- Stay updated with industry trends, best practices, and emerging technologies in SRE and DevOps
Role Requirements - Solid private cloud (OpenStack, VMware, and Kubernetes) infrastructure background and operational, fixing, and problem-solving experience
- Software development lifecycle including development, testing, packaging, deployment, upgrade, and support
- Private cloud and Kubernetes resource development and operations experience. Familiarity with major OpenStack components like Keystone, Nova, Neutron, Glance and Kubernetes components like CNI (Container Network Interface), CRI (Container Runtime Interface), CSI (Container Storage Interface) and control plane
- Software development experience in Python
- Ability to write patches for OpenStack in Python or workload manifests in YAML for Kubernetes and contribute to the community
- Working with the open-source community for bug fixes/enhancement etc
- Experience supporting Software-defined storage with Ceph or other cloud-based storage as well as Kubernetes Container Storage interfaces like Portworx, MicroCeph and minIO
- Hypervisor technologies including KVM
- Ubuntu, RedHat Enterprise Linux and/or CentOS build, development, and operations
- Experience in building and maintaining code distribution through automated pipelines
- Experience in building cloud-native Kubernetes workloads and pipelines to deliver
- Experience with Ansible or Puppet for configuration management
- Software-defined network technologies including OVS (Open vSwitch), OVN (Open Virtual Network), NFV, etc
- Infrastructure as Code experience – Terraform, Ansible, Git, GitLab, Helm, ArgoCD, Vault