Reports to: Team Lead, Systems Engineering and Operations
Work location: USA or Canada, 100% remote
Direct reports: 0
We are hiring an experienced Site Reliability Engineer to help expand our platform and operations excellence. We are inviting you to join our small, fully remote team of developers and operators helping make our platform faster, more secure, and more reliable. You will be self-motivated and disciplined in order to work with our fully distributed team.
We are looking for someone who is a quick study, who is eager to learn and grow with us, and who has experience in DevOps and Agile cultures. At Crunch, we believe in learning together: we recognize that we don’t have all the answers, and we try to ask each other the right questions. As Crunch employees are completely distributed, it’s crucial that you can work well independently, and keep yourself motivated and focused.
We currently run our in-house production Python code against Redis, MongoDB, and ElasticSearch services. We proxy API requests through NGINX, load balance with ELBs, and deploy our React web application to AWS CloudFront CDN. We use EFS for persistent storage. Our current CI/CD process is built around GitHub, Jenkins, BlueOcean including unit, integration, and end to end tests and automated system deployments. We deploy to auto-scaling Groups using Ansible and Cloud-Init.
In the future (and to some degree currently), all or part of our platform will include Kubernetes, Helm, FluxV2, and Spinnaker.
What you’ll do
Monitor and detect emerging customer-facing incidents on the Crunch platform; assist in their proactive resolution, and work to prevent them from occurring
Coordinate and participate in a weekly on-call rotation, where you will handle short term customer incidents (from direct surveillance or through alerts via our Support Engineers)
Diagnose live incidents, differentiate between platform issues versus usage issues across the entire stack; hardware, software, application and network within physical datacenter and cloud-based environments, take the first steps towards resolution, and see the problem through to resolution
Automate routine monitoring and troubleshooting tasks
Provide consistent, high-quality feedback and recommendations to our product managers and development teams regarding product defects or recurring performance issues
Drive improvements and advancements to the platform in areas such as container orchestration, service mesh, request/retry strategies
Build frameworks and tools to empower safe, developer-led changes, automate the manual steps and provide insight into our complex system
Work directly with the team to enhance the performance, scalability and observability of resources of multiple applications and ensure that production handoff requirements are met and escalate issues
Embed into SRE projects to stay close to the operational workflows and issues
Evangelize the adoption of best practices in relation to performance and reliability across the organization
Maintain project and operational workload statistics
Promote a healthy and functional work environment
Work with Team Lead and/or external security contractors to do periodic penetration testing, and drive resolution for any issues discovered
Administer a large portfolio of SaaS tools used throughout the company
Execute other projects from the Team Lead as needed
Experience being an on-call DevOps, SRE, or Cloud Operations senior engineer (at least 5 years)
Experience implementing Terraform best practices for infrastructure in AWS (at least 2 years)
Proven track record of designing, building, sizing, optimizing, and maintaining cloud infrastructure especially in AWS
Proven experience automation glue code, and managing production infrastructure in AWS
Proven track record of designing, implementing, and maintaining full build/release pipelines in a cloud environment (Jenkins experience preferred)
Experience with containers and container orchestration tools (Docker / Kubernetes / Helm production experience required) (Spinnaker experience preferred)
Experience with improving developer experience with desktop tooling and scripts
Expertise with Linux system administration (2 yrs) and networking technologies including (IPv6 nice to have).
Knowledge of NoSQL database operations and concepts
Experience with MongoDB, Elasticsearch, and Redis (at least 1 year)
Capability to write programs/scripts to solve both short-term systems problems and to automate repetitive workflows (Python and Bash preferred)
Exceptional English communication and troubleshooting skills.
Understanding and experience with implementing best security practices in AWS / Linux / Kubernetes and other listed services, pen testing and internal vulnerability analysis / incident response
Experience in monitoring, system performance data collection and analysis, and reporting
Filling in for IT when necessary for the Crunch.io org
A keen interest in learning new things and keeping up to date with best practices and latest tooling methods
Advanced (preferred) qualifications
AWS / Terraform / Kubernetes certifications or certifications associated with similar products
Familiarity with Agile Manifesto and SCRUM / Kanban / Scrumban
Bachelor’s Degree in Statistics, Science, Programming or Engineering related field
Crunch.io is a market-defining company in the analytics SaaS marketplace. We’re a company on the rise. We’ve built a revolutionary platform that transforms our customers’ ability to drive insight from market research and survey data. We offer a complete survey data analysis platform that allows market researchers, analysts, and marketers to collaborate in a secure, cloud-based environment, using a simple, intuitive drag-and-drop interface to prepare, analyze, visualize and deliver survey data and analysis. Quite simply, Crunch provides the quickest and easiest way for anyone, from CMO to PhD, with zero training, to analyze survey data. Users create tables, charts, graphs and maps. They filter, and slice-and-dice survey data directly in their browser.
Our start-up culture is casual, respectful of each other’s varied backgrounds and lives, and high-energy because of our shared dedication to our product and our mission. We are loyal to each other and our company. We value work/life balance, efficiency, simplicity, and fantastic customer service! Crunch has no offices and fully embraces a 100% remote culture. We have 40 employees spread across 5 continents. Remote work at Crunch is flexible and largely independent, yet highly cooperative.
We are an equal-opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status.