Site Reliability Engineer

Reports to: Team Lead, Systems Engineering and Operations

Work location: USA or Canada, 100% remote

Direct reports: 0

We are hiring an experienced Site Reliability Engineer to help expand our platform and operations excellence. We are inviting you to join our small, fully remote team of developers and operators helping make our platform faster, more secure, and more reliable. You will be self-motivated and disciplined in order to work with our fully distributed team.

We are looking for someone who is a quick study, who is eager to learn and grow with us, and who has experience in DevOps and Agile cultures. At Crunch, we believe in learning together: we recognize that we don’t have all the answers, and we try to ask each other the right questions. As Crunch employees are completely distributed, it’s crucial that you can work well independently, and keep yourself motivated and focused.

Our stack

We currently run our in-house production Python code against Redis, MongoDB, and ElasticSearch services. We proxy API requests through NGINX, load balance with ELBs, and deploy our React web application to AWS CloudFront CDN. We use EFS for persistent storage. Our current CI/CD process is built around GitHub, Jenkins, BlueOcean including unit, integration, and end to end tests and automated system deployments. We deploy to auto-scaling Groups using Ansible and Cloud-Init.

In the future (and to some degree currently), all or part of our platform will include Kubernetes, Helm, FluxV2, and Spinnaker.

What you’ll do

  • Monitor and detect emerging customer-facing incidents on the Crunch platform; assist in their proactive resolution, and work to prevent them from occurring
  • Coordinate and participate in a weekly on-call rotation, where you will handle short term customer incidents (from direct surveillance or through alerts via our Support Engineers)
  • Diagnose live incidents, differentiate between platform issues versus usage issues across the entire stack; hardware, software, application and network within physical datacenter and cloud-based environments, take the first steps towards resolution, and see the problem through to resolution
  • Automate routine monitoring and troubleshooting tasks
  • Provide consistent, high-quality feedback and recommendations to our product managers and development teams regarding product defects or recurring performance issues
  • Drive improvements and advancements to the platform in areas such as container orchestration, service mesh, request/retry strategies
  • Build frameworks and tools to empower safe, developer-led changes, automate the manual steps and provide insight into our complex system
  • Work directly with the team to enhance the performance, scalability and observability of resources of multiple applications and ensure that production handoff requirements are met and escalate issues
  • Embed into SRE projects to stay close to the operational workflows and issues
  • Evangelize the adoption of best practices in relation to performance and reliability across the organization
  • Maintain project and operational workload statistics
  • Promote a healthy and functional work environment
  • Work with Team Lead and/or external security contractors to do periodic penetration testing, and drive resolution for any issues discovered
  • Administer a large portfolio of SaaS tools used throughout the company
  • Execute other projects from the Team Lead as needed

Basic qualifications

  • Experience being an on-call DevOps, SRE, or Cloud Operations senior engineer (at least 5 years)
  • Experience implementing Terraform best practices for infrastructure in AWS (at least 2 years)
  • Proven track record of designing, building, sizing, optimizing, and maintaining cloud infrastructure especially in AWS
  • Proven experience automation glue code, and managing production infrastructure in AWS
  • Proven track record of designing, implementing, and maintaining full build/release pipelines in a cloud environment (Jenkins experience preferred)
  • Experience with containers and container orchestration tools (Docker / Kubernetes / Helm production experience required) (Spinnaker experience preferred)
  • Experience with improving developer experience with desktop tooling and scripts
  • Expertise with Linux system administration (2 yrs) and networking technologies including (IPv6 nice to have).
  • Knowledge of NoSQL database operations and concepts
  • Experience with MongoDB, Elasticsearch, and Redis (at least 1 year)
  • Capability to write programs/scripts to solve both short-term systems problems and to automate repetitive workflows (Python and Bash preferred)
  • Exceptional English communication and troubleshooting skills.
  • Understanding and experience with implementing best security practices in AWS / Linux / Kubernetes and other listed services, pen testing and internal vulnerability analysis / incident response
  • Experience in monitoring, system performance data collection and analysis, and reporting
  • Filling in for IT when necessary for the Crunch.io org
  • A keen interest in learning new things and keeping up to date with best practices and latest tooling methods

Advanced (preferred) qualifications

  • AWS / Terraform / Kubernetes certifications or certifications associated with similar products
  • Familiarity with Agile Manifesto and SCRUM / Kanban / Scrumban
  • Software development experience using Python or JavaScript
  • Bachelor’s Degree in Statistics, Science, Programming or Engineering related field

About Crunch

Crunch.io is a market-defining company in the analytics SaaS marketplace. We’re a company on the rise. We’ve built a revolutionary platform that transforms our customers’ ability to drive insight from market research and survey data. We offer a complete survey data analysis platform that allows market researchers, analysts, and marketers to collaborate in a secure, cloud-based environment, using a simple, intuitive drag-and-drop interface to prepare, analyze, visualize and deliver survey data and analysis. Quite simply, Crunch provides the quickest and easiest way for anyone, from CMO to PhD, with zero training, to analyze survey data. Users create tables, charts, graphs and maps. They filter, and slice-and-dice survey data directly in their browser.

Our start-up culture is casual, respectful of each other’s varied backgrounds and lives, and high-energy because of our shared dedication to our product and our mission. We are loyal to each other and our company. We value work/life balance, efficiency, simplicity, and fantastic customer service! Crunch has no offices and fully embraces a 100% remote culture. We have 40 employees spread across 5 continents. Remote work at Crunch is flexible and largely independent, yet highly cooperative.

We are an equal-opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status.

Learn more about our team!