Site Reliability Engineer
Below is a brief job description. Please inquire for more detail. This is an amazing company with an outstanding culture!
The SRE is responsible for the full-system life-cycle which includes infrastructure provisioning, system configuration, deployments, monitoring, and incident response in production environments. The SRE uses analysis to assess the availability, latency, scalability, and the efficiency of a product or infrastructure and builds reliability into systems.
● Design and build out our cloud infrastructure.
● Participate in software and system performance analysis, tuning, and
service capacity planning.
● Manage the availability, scalability, security, and performance of our
platform and applications.
● Diagnose bottlenecks for the full stack and provide recommendations to
overcome the bottlenecks as an interim work around, while longterm
solutions are investigated.
● Periodically assess all monitoring requirements and implement
enhancements to meet or exceed changing business needs.
● Proactively review, recommend, and implement changes to the live
infrastructure after ensuring the right validation has been carried out.
● Use data analysis to pick up trends before they become major problems.
● 5+ years of experience in operating hightraffic SaaS environments.
● Expertise in the mentality, processes, and tools needed to deliver five nines.
● Skills to build a fully automated, highly elastic cloud orchestration framework on AWS.
● Strong working knowledge of Linux and its underlying components, system statistics, performance tuning, filesystems and IO.
● Solid scripting skills
● Development experience
● Experience with continuous integration frameworks
● Experience with performance diagnostics, performance tuning, capacity planning, and monitoring.
AWS, Rackspace, Ansible, Terraform, MySQL, Nginx, Elasticsearch,Memcached, RabbitMQ, Jenkins, Git, Bash, Python, PHP, Java, Ruby, Nessus, Nagios, Sumologic, NewRelic, PagerDuty