Jobs search

Senior Site Reliability Engineer

Datum Technologies Group • Atlanta, GA, US • 1m ago

Long term contract

Atlanta, GA

Qualifications:

Manage and optimize data streaming and API components in OpenShift (on-premises) and AWS.
Review and optimize application APIs and processes to enhance response times across various components.
Automated testing processes, including data quality checks, production delivery, and deployment for production environments.
Develop integrations between on-premises applications, AWS, and third-party tools (ServiceNow, VersionOne, Sumo).
Collaborate with teams to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Lead performance monitoring and troubleshooting of platform applications, identifying root causes and documenting solutions.
Evolve cloud infrastructure for the application suite by experimenting with new technologies and completing prototypes to assess benefits.
Design and develop CI/CD pipelines to deploy application artifacts, including APIs and data process jobs.
Configure and implement monitoring and alerting metrics to enable proactive issue detection by support teams.
Maintain data integrity and access control using AWS security tools and services such as HSM and IAM.
Develop and monitor AWS billing tools, generate cost reports, and implement cost optimization strategies.
Work with security architects to design and implement data security tools, encryption, and key management.
Address security vulnerabilities identified by audits and the wider security community and develop solutions for support teams to regularly scan and resolve issues.
Monitor and analyze platform capacity and performance, collaborating with architecture teams to design elastic infrastructure for irregular traffic bursts.
Contribute to the design and implementation of backup strategies for service restoration and disaster recovery.
Provide continuous input to architecture, infrastructure, and application teams to improve design, performance, and security.
BS in Computer Science or a related technical field, or equivalent practical experience.

Desired Skillset:

Strong expertise in AWS cloud platforms.
Proficiency in automation, scripting, and monitoring tools, including OpenShift, CloudFormation, Terraform, Ansible, Shell, and Python.
In-depth knowledge of infrastructure layers such as Linux OS, virtualization platforms, software-defined networking, load balancers, firewalls, API tools, monitoring tools, and storage/backup strategies.
Extensive experience with enterprise systems and mission-critical application operations, including issue resolution.
Experience with automating and operationalizing Development/QA using CI/CD tools such as GitLab, GitHub, Jenkins, Maven, Gradle, and Nexus.
Working experience in Software Release Management.

Minimum Experience:

3+ years in DevOps or SysOps engineering, focusing on major cloud platforms (preferably AWS).
2+ years of application development, including data streaming and deployment of high-availability critical application components.
1+ year in a Site Reliability Engineering (SRE) role preferred.
Overall 7+ years of professional experience.

Responsibilities

As a Site Reliability Engineer (SRE) with expertise in AWS cloud infrastructure and application monitoring, you will ensure the reliability, scalability, and performance of our cloud-based systems and applications. Key responsibilities include:

Proven experience as an SRE or in a similar role, focusing on AWS cloud infrastructure.
Deep understanding of AWS services (Lambda, S3, SQS, IAM, Route 53, etc.) and proficiency in Infrastructure as Code (Terraform, CloudFormation).
Hands-on experience with monitoring tools (CloudWatch, Sumo Logic, Dynatrace, Grafana) for performance monitoring and alerting.
Proficiency in scripting and automation (Python, Bash) for building and maintaining deployment pipelines and infrastructure.
Strong analytical and troubleshooting skills to diagnose and resolve complex infrastructure, application, and data issues.
Experience with containerization technologies (Docker, Kubernetes) and serverless architectures (AWS Lambda).
Familiarity with CI/CD pipelines and version control systems (Git) for continuous integration and deployment.

As a Lead Engineer with the Retail Site Reliability Engineering team, you will lead Cloud and Big Data technology efforts. This role positions you as a technical leader, exposing you to industry-leading technologies. You will contribute to a growing ecosystem of services and features, supporting business-critical applications, and providing escalation for complex issues in on-premises and AWS environments. The ideal candidate is well-versed in DevOps technologies, automation, infrastructure orchestration, configuration management, and continuous integration, unbound by conventional approaches.

"All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.”

Apply