Site Reliability Engineer

Job Locations US-OH-Cincinnati | US-GA-Atlanta
ID
2024-1391
Position Type
Regular Full-Time

Overview

This position sits within our IT Infrastructure division, which handles the combined set of software, hardware, networks, and facilities used to develop, test, deliver, monitor, control, or support IT Services.

The Opportunity 

You will be joining a small (but growing) team that builds and supports multiple large-scale, distributed, fault-tolerant systems deployed to AWS, GCP, Azure and numerous on-prem/colocation datacenters. The successful candidate will be a self-starter that demonstrates excellent communication and problem-solving skills with a strong drive for innovation.  Responsible for providing scalable and secure solutions in a demanding SaaS 24x7 environment. The individual will be attentive to the health, observability, and maintainability of our environment as well as the associated auxiliary components that drive our day-to-day business.  Work to proactively identify potential outages, continually iterating to make improvements. An ideal candidate would have a background in both software development, with solid groundings in the fundamentals of computer science as well as the systems skills necessary to understand and operate within both contexts.   

Responsibilities

What You’ll Be Doing 

  • Part of a team that troubleshoots applications, middleware, infrastructure, networks, tools, patching 
  • Maintains/updates on ongoing operations and project tasks. 
  • Build enhancements within an existing software architecture and suggest improvements to the architecture. 
  • Proactively identify potential outages, continually iterating to make improvements. 
  • Strong communication and analytical skills. 
  • Thorough understanding of product development 
  • Collaborates on architectural design reviews and changes. 
  • Own, define and improve metrics, KPIs, SLOs and visualizations for systems. 
  • Assists on complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs). 
  • Advocate quality accountability within the organization with well-defined processes, metrics, and goals for process quality. This includes participating in effective postmortems and ensuring actions are followed-up. 
  • Building, and maintaining, robust, actionable alerting and monitoring systems and workflows.   Influence across boundaries and at all levels of the organization. 
  • Implement, maintain, and improve CI/CD processes and tools. 
  • Work closely with development teams to improve services, deployments, and releases. 
  • Troubleshoot production issues and continued documentation of runbooks. 
  • Part of an on-call rotation to address production issues. 
  • This job description in no way implies that the duties listed here are the only ones that team members can be required to perform   

Qualifications

What You Bring to the Team 

  • Experience in a similar SRE / Infrastructure role for at least 2+ years 
  • Working knowledge in AWS/GCP and/or AzurePreferred GCP experience 
  • Experience operating in a Linux environment (preferably CentOS) for at least 2+ years. 
  • Experience in deployment automation with tools such as TeamCity, Octopus, Jenkins, Ansible, or Git/BitBucket for at least 1+ year 
  • Experience in reading/coding for Python, Windows PowerShell and bash scripting for at least 2+ years 
  • Familiar with DevOps environments / Containerization (Docker, Kubernetes) 
  • Knowledge of Atlassian tool sets (Jira/Confluence/BitBucket) 
  • Familiar with working in Microsoft Teams 
  • Understanding of SAFeExperience is preferred. 
  • Possess a high attention to detail and organization with the passion and ability to create order out of disorder with excellence and efficiency. 
  • A desire to automate everything. Whether that be infrastructure as code or tooling to eliminate toil, automation should be a core focus of your mindset and the elimination of repetitive tasks should be a constant desire in the role. 
  • Natural curiosity. You aren’t simply satisfied with something working, you want to know why it works and how it works. 
  • A mindset of total ownership - you aren’t afraid to dig into things you’ve never worked on before, from the browser all the way to the persistence layer. You’ve got a solid foundation in debugging and can jump in when needed to any problem you’re asked to help with. 
  • You have been exposed to the fundamentals of distributed computing and look for ways to make systems more resilient, self-healing, and eliminate the need for human intervention as much as possible. 
  • Strong communication and interpersonal skills allowing the candidate to work well in a team environment and deliver excellent customer service. 
  • The ability to convey the importance of site reliability to a wide variety of audiences that range from non-technical to the most technical of engineers. Drive stakeholders buy-in of key metrics such as SLAs/SLOs for all supported systems. 
  • Ability to maintain SLAs through the implementation of proactive issue detection and reporting 
  • Demonstrated experience working in large, complex systems environments. 

Physical Demands and Work Environment: 

  • The physical activities of this position include frequent sitting, telephone communication, working on a computer for extended periods of time. Visual acuity is required to perform activities close to the eyes.  
  • This position is fully remote with only occasional travel to the office for team meetings and events. Team members are expected to have an established workspace.  
  • Ability to work remotely in the United States or Canada. 

E-Verify Statement 


ConstructConnect utilizes the E-Verify program with every potential new hire. This makes it possible for us to make certain that every employee who works for ConstructConnect is eligible to work in the United States. To learn more about E-Verify you can call 1-800-255-7688 or visit their website. E-Verify® is a registered trademark of the United States Department of Homeland Security. 

 

Privacy Notice

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed