Reports to: VP Software Engineering
Work location: Anywhere
Direct reports: 2
We are hiring a Team Lead over Systems Engineering and Operations to help expand our platform and operations excellence. We are inviting you to join our small, fully remote team of developers and operators helping make our platform faster, more secure, and more reliable. You will be self-motivated and disciplined in order to work with our fully distributed team.
We are looking for someone who is a quick study, who is eager to learn and grow with us, and who has experience in DevOps and Agile cultures. At Crunch, we believe in learning together: we recognize that we don’t have all the answers, and we try to ask each other the right questions. As Crunch employees are completely distributed, it’s crucial that you can work well independently, and keep yourself motivated and focused.
We currently run our in-house production Python code against Redis, MongoDB, and ElasticSearch services. We proxy API requests through NGINX, load balance with ELBs, and deploy our React web application to AWS CloudFront CDN. Our current CI/CD process is built around GitHub, Jenkins, BlueOcean including unit, integration, and end to end tests and automated system deployments. We deploy to auto-scaling Groups using Ansible and Cloud-Init.
In the future, all or part of our platform may be deployed via DroneCI, Kubernetes, nginx ingress, Helm, and Spinnaker.
What you’ll do
As a manager:
- Manage and lead a team of Cloud Operations Engineers who are tasked with ensuring our uptime guarantees to our customer base.
- Scale the worldwide Cloud Operations Engineering team with the strategic implementation of new processes and tools.
- Hire and ramp exceptional Cloud Operations Engineers.
- Assist in scoping, designing and deploying systems that reduce Mean Time to Resolve for customer incidents.
- Inform executive leadership and escalation management personnel of major outages
- Compile and report KPIs across the full company.
- Work with Sales Engineers to complete pre-sales questionnaires, and to gather customer use metrics.
- Prioritize projects competing for human and computational resources to achieve organizational goals.
As an individual contributor:
- Monitor and detect emerging customer-facing incidents on the Crunch platform; assist in their proactive resolution, and work to prevent them from occurring.
- Coordinate and participate in a weekly on-call rotation, where you will handle short term customer incidents (from direct surveillance or through alerts via our Technical Services Engineers).
- Diagnose live incidents, differentiate between platform issues versus usage issues across the entire stack; hardware, software, application and network within physical datacenter and cloud-based environments, and take the first steps towards resolution.
- Automate routine monitoring and troubleshooting tasks.
- Cooperate with our product management and engineering organizations by identifying areas for improvement in the management of applications powering the Crunch infrastructure.
- Provide consistent, high-quality feedback and recommendations to our product managers and development teams regarding product defects or recurring performance issues.
- Be the owner of our platform. This includes everything from our cloud provider implementation to how we build, deploy and instrument our systems.
- Drive improvements and advancements to the platform in areas such as container orchestration, service mesh, request/retry strategies.
- Build frameworks and tools to empower safe, developer-led changes, automate the manual steps and provide insight into our complex system.
- Work directly with software engineering and infrastructure leadership to enhance the performance, scalability and observability of resources of multiple applications and ensure that production handoff requirements are met and escalate issues.
- Embed into SRE projects to stay close to the operational workflows and issues.
- Evangelize the adoption of best practices in relation to performance and reliability across the organization.
- Provide a solid operational foundation for building and maintaining successful SRE teams and processes.
- Maintain project and operational workload statistics.
- Promote a healthy and functional work environment.
- Work with Security experts to do periodic penetration testing, and drive resolution for any issues discovered.
- Liaise with IT and Security Team Leads to successfully complete cross-team projects, filling in for these Leads when necessary.
- Administer a large portfolio of SaaS tools used throughout the company.
- Team Lead experience of an on-call DevOps, SRE, or Cloud Operations team (at least 2 years).
- Experience recruiting, mentoring, and promoting high performing team members.
- Experience being an on-call DevOps, SRE, or Cloud Operations engineer (at least 2 years).
- Proven track record of designing, building, sizing, optimizing, and maintaining cloud infrastructure.
- Proven experience developing software, CI/CD pipelines, automation, and managing production infrastructure in AWS.
- Proven track record of designing, implementing, and maintaining full CI/CD pipelines in a cloud environment (Jenkins experience preferred).
- Experience with containers and container orchestration tools (Docker, Kubernetes, Helm, traefik, Nginx ingress and Spinnaker experience preferred).
- Expertise with Linux system administration (5 yrs) and networking technologies including IPv6.
- Knowledgeable about a wide range of web and internet technologies.
- Knowledge of NoSQL database operations and concepts.
- Experience in monitoring, system performance data collection and analysis, and reporting.
- Capability to write small programs/scripts to solve both short-term systems problems and to automate repetitive workflows (Python and Bash preferred).
- Exceptional English communication and troubleshooting skills.
- A keen interest in learning new things.
Advanced (preferred) qualifications
- Bachelor’s Degree in Statistics, Science, Programming or Engineering related field
- Experience with MongoDB, Elasticsearch, and Redis
- AWS Management
Crunch.io is a market-defining company in the analytics SaaS marketplace. We’re a company on the rise. We’ve built a revolutionary platform that transforms our customers’ ability to drive insight from market research and survey data. We offer a complete survey data analysis platform that allows market researchers, analysts, and marketers to collaborate in a secure, cloud-based environment, using a simple, intuitive drag-and-drop interface to prepare, analyze, visualize and deliver survey data and analysis. Quite simply, Crunch provides the quickest and easiest way for anyone, from CMO to PhD, with zero training, to analyze survey data. Users create tables, charts, graphs and maps. They filter, and slice-and-dice survey data directly in their browser.
Our start-up culture is casual, respectful of each other’s varied backgrounds and lives, and high-energy because of our shared dedication to our product and our mission. We are loyal to each other and our company. We value work/life balance, efficiency, simplicity, and fantastic customer service! Crunch has no offices and fully embraces a 100% remote culture. We have 40 employees spread across 5 continents. Remote work at Crunch is flexible and largely independent, yet highly cooperative.
We are an equal-opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status.
Learn more about our team!