Full Job Description
About SRE:
We are looking for a Site Reliability Engineer/Sr. Site Reliability Engineer to help us build and enhance platforms to achieve availability, scalability and operational effectiveness. The right individual will embrace the opportunity to tackle challenging problems and use their influence to drive continual improvement. You will also work on the cutting edge of technology, leveraging Kong, Repose, Docker, Mesos/Kubernetes, Jenkins, Chef, HaProxy, Nginx, GitLab, MySQL, Scylla, Aerospike, Service Mesh ( Istio/Linkerd), Prometheus etc.
Roles and Responsibilities:
Managing Availability, Performance, Capacity of infrastructure and applications.
Building and implementing observability for applications health/performance/capacity.
Optimizing On-call rotations and processes.
Documenting “tribal” knowledge.
Managing Infra-platforms like
Providing help in onboarding new services with the production readiness review process.
Providing reports on services SLO/Error Budgets/Alerts and Operational Overhead.
Working with Dev and Product teams to define SLO/Error Budgets/Alerts.
Working with the Dev team to have an in-depth understanding of the application architecture and its bottlenecks.
Identifying observability gaps in product services, infrastructure and working with stake owners to fix it.
Managing Outages and doing detailed RCA with developers and identifying ways to avoid that situation.
Managing/Automating upgrades of the infrastructure services.
Automate toil work.
Experience & Skills:
6+ Years of experience as an SRE/DevOps/Infrastructure Engineer on large scale microservices and infrastructure.
A collaborative spirit with the ability to work across disciplines to influence, learn, and deliver.
A deep understanding of computer science, software development, and networking principles.
Demonstrated experience with languages, such as Python, Java, Golang etc.
Extensive experience with Linux administration and good understanding of the various linux kernel subsystems (memory, storage, network etc).
Extensive experience in DNS, TCP/IP, UDP, GRPC, Routing and Load Balancing.
Expertise in GitOps, Infrastructure as a Code tools such as Terraform etc.. and Configuration Management Tools such as Chef, Puppet, Saltstack, Ansible.
Expertise of Amazon Web Services (AWS) and/or other relevant Cloud Infrastructure solutions like Microsoft Azure or Google Cloud.
Experience in building CI/CD solutions with tools such as Jenkins, GitLab, Spinnaker, Argo etc.
Experience in managing and deploying containerized environments using Docker, Mesos/Kubernetes is a plus.
Experience with multiple datastores is a plus (MySQL, PostgreSQL, Aerospike, Couchbase, Scylla, Cassandra, Elasticsearch).
Experience with data platforms tech stacks like Hadoop, Hive, Presto etc is a plus.