Full Job DescriptionOverview:
At SolarWinds, we’re a people-first company.Our purpose is to enrich the lives of the people we serve—including our employees, customers, shareholders, Partners, and communities. Join us in our mission to help customers accelerate business transformation with simple, powerful, and secure solutions.
The ideal candidatethrives in an innovative, fast-paced environment and is collaborative, accountable, ready, and empathetic.We’re looking for individuals who believe they can accomplish more as a team and create lasting growth for themselves and others. We hire based on attitude, competency, and commitment.Solariansare ready to advance ourworld-class solutions in a fast-paced environment and accept the challenge to lead with purpose. If you’re looking to build your careerwith an exceptional team, you’ve come to the right place. Join SolarWinds and grow with us
Responsibilities:
Your Role
As part of our SaaS team, you will develop and maintain “DevOps” systems for our suite of cloud-based solutions running on AWS. Keeping our services available, accessible, fast, and working correctly is a top priority. You will be required to help our engineers automate everything possible. Keeping the feedback loop between production and development tight so developers are responsible and accountable for how their code runs.
Measurement
Own the processes, techniques, and systems for measuring the reliability and performance of our services.
Create accurate, continuous measurements of availability, latency, performance, failure rate, and capacity as key indicators for the organization.
Create and maintain dashboards that track these key indicators for all internal stakeholders.
Support the extension of these key indicators to external stakeholders as our Service Level Agreements.
Develop an extended suite of metrics that provides DevOps teams with the supporting details needed to operate our services to these key indicators.
Analyze metrics for predictive trends and generate alerts for metrics and trends that exceed thresholds.
Systems and Tools Development
Develop software systems and tools to promote reliable, consistent, correct, and high-performing operation of our services, employing automation wherever practical and beneficial.
Own the Continuous Integration/Continuous Deployment pipeline and the deployment process in general, treating developers as the customer. Balance reliability and agility.
Implement processes and mechanisms to capture and document all changes to the service and infrastructure as a source of truth for operations troubleshooting and compliance.
Initiate routine testing of failure recovery mechanisms (e.g. chaos monkey, game days) on a small scale within the first 90 days and expand it to the entire system in the first year.
Service Performance and Reliability
Evangelize Performance and Reliability across product development.
Identify negative habits and trends and work with development managers and teams to eradicate them.
Work inside and alongside development teams in sprints to improve the performance and reliability of services.
Find and fix significant vulnerabilities and inefficiencies, single points of failure, and other weaknesses in our software by any means, but particularly by critical review of everyday pull requests.
Maintain a prioritized backlog of specific issues and improvements that can’t be corrected immediately. Campaign to keep the backlog in checks
Position Specific Standards:
Must collaborate with peers and managers to achieve the goals of the position.
Must ensure an appropriate working environment where the ability to concentrate and or have video/audio calls is not hindered.
Experience delivering multiple competing projects with numerous stakeholders
Long Term Capacity Planning and COGS evaluation/implementation for teams
Qualifications:
5+ years of experience working with Linux and having responsibilities of Site Reliability Engineer
2+ years of experience in working in a leadership role as an Engineer
Solid understanding of automation principles and programming experience using frameworks such as Python and Ansible
Experience working with Kubernetes/EKS
Experience utilizing automation tools like Terraform
Experience with CI/CD tooling
Strong understanding of Security, Monitoring, and Performance aspects of cloud-native platform and application architectures
Proven track record delivering high quality and consistent systems and environments for the development team
Ability to multi-task and self-manage work
Strong written, verbal, and presentation skills