Reports to: VP Software Engineering
Work location: Forever 100% remote within Western Hemisphere timezones and work hours
Direct reports: 3-5
We are hiring a Team Lead over Systems Engineering and Operations to help expand our platform and operations excellence. We are inviting you to join our small, fully remote team of developers and operators helping make our platform faster, more secure, and more reliable. You will be self-motivated and disciplined in order to work with our fully distributed team.
We are looking for someone who is a quick study, who is eager to learn and grow with us, and who has extensive experience in Devops, Terraform, CI/CD, and AWS. At Crunch, we believe in learning together: we recognize that we don’t have all the answers, and we try to ask each other the right questions. As Crunch employees are completely distributed, it’s crucial that you can work well independently, and keep yourself motivated and focused.
We currently run our in-house production Python code against Redis, MongoDB, and ElasticSearch services. We proxy API requests through NGINX, load balance with ELBs, and deploy our React web application to AWS CloudFront CDN. Our current CI/CD process is built around GitHub, Jenkins, BlueOcean, and Kubernetes including unit, integration, and end-to-end tests and automated system deployments. We deploy to auto-scaling Groups using Ansible, Cloud-Init, and Terraform.
In the future (and to some degree currently), all or part of our platform will include Kubernetes, Helm, FluxV2, and Spinnaker.
What you’ll do
- Manage and lead a team of SREs who are tasked with ensuring our uptime guarantees to our customer base.
- Scale the SRE team with the strategic implementation of new processes and tools.
- Hire, onboard, train, and mentor exceptional SREs.
- Assist in scoping, designing and deploying systems that reduce Mean Time to Resolve for customer incidents.
- Inform executive leadership and escalation management personnel of major outages
- Compile and report KPIs across the full company.
- Work with Sales Engineers to complete pre-sales questionnaires, and to gather customer use metrics.
- Prioritize projects competing for human and computational resources to achieve organizational goals.
- Monitor and detect emerging customer-facing incidents on the Crunch platform; assist in their proactive resolution, and work to prevent them from occurring.
- Coordinate and participate in a weekly on-call rotation, where you will handle short term customer incidents (from direct surveillance or through alerts via our Support Engineers).
- Diagnose live incidents, differentiate between platform issues versus usage issues across the entire stack; hardware, software, application and network within physical datacenter and cloud-based environments, take the first steps towards resolution, and see the problem through to resolution.
- Automate routine monitoring and troubleshooting tasks.
- Provide consistent, high-quality feedback and recommendations to our product managers and development teams regarding product defects or recurring performance issues.
- Drive improvements and advancements to the platform in areas such as container orchestration, service mesh, request/retry strategies.
- Build frameworks and tools to empower safe, developer-led changes, automate the manual steps and provide insight into our complex system.
- Work directly with the team to enhance the performance, scalability and observability of resources of multiple applications and ensure that production handoff requirements are met and escalate issues.
- Embed into SRE projects to stay close to the operational workflows and issues.
- Evangelize the adoption of best practices in relation to performance and reliability across the organization.
- Maintain project and operational workload statistics.
- Promote a healthy, functional, uplifting, and fun work environment.
- Work with contractors to do periodic penetration testing, and drive resolution for any issues discovered.
- Administer a large portfolio of SaaS tools used throughout the company.
- Execute other projects from the Team Lead as needed.
- Team Leader experience of an on-call DevOps, SRE, or Cloud Operations team.
- Experience recruiting, mentoring, and promoting high performing team members.
- Experience being an on-call DevOps, SRE, or Cloud Operations senior engineer.
- Experience implementing Terraform best practices for infrastructure in AWS.
- Proven track record of designing, building, sizing, optimizing, and maintaining cloud infrastructure especially in AWS.
- Experience with containers and container orchestration tools (Docker / Kubernetes / Helm production experience required) (Spinnaker experience preferred).
- Experience with improving developer experience with desktop tooling and scripts.
- Expertise with Linux system administration and networking technologies including (IPv6 nice to have).
- Knowledge of NoSQL database operations and concepts.
- Experience with MongoDB, Elasticsearch, and Redis.
- Capability to write programs/scripts to solve both short-term systems problems and to automate repetitive workflows (Python and Bash preferred).
- Exceptional English communication and troubleshooting skills.
- Understanding and experience with implementing good security practices in AWS / Linux / Kubernetes and other listed services, pen testing and internal vulnerability analysis / incident response.
- Experience in monitoring, system performance data collection and analysis, and reporting.
- Filling in for IT when necessary for the Crunch.io organization.
- A keen interest in learning new things and keeping up to date with best practices and latest tooling methods.
Advanced (preferred) qualifications
- Bachelor’s Degree in Statistics, Science, Programming or Engineering related field
- Experience with MongoDB, Elasticsearch, and Redis
- AWS / Terraform / Kubernetes certifications or certifications associated with similar products
Crunch.io is a market-defining company in the analytics SaaS marketplace. We’re a company on the rise. We’ve built a revolutionary platform that transforms our customers’ ability to drive insight from market research and survey data. We offer a complete survey data analysis platform that allows market researchers, analysts, and marketers to collaborate in a secure, cloud-based environment, using a simple, intuitive drag-and-drop interface to prepare, analyze, visualize and deliver survey data and analysis. Quite simply, Crunch provides the quickest and easiest way for anyone, from CMO to PhD, with zero training, to analyze survey data. Users create tables, charts, graphs and maps. They filter, and slice-and-dice survey data directly in their browser.
Our start-up culture is casual, respectful of each other’s varied backgrounds and lives, and high-energy because of our shared dedication to our product and our mission. We are loyal to each other and our company. We value work/life balance, efficiency, simplicity, and fantastic customer service! Crunch has no offices and fully embraces a 100% remote culture. We have 40 employees spread across 5 continents. Remote work at Crunch is flexible and largely independent, yet highly cooperative.
We are an equal-opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status.
Learn more about our team!