Lead Engineer
OneDeal
What you'll be doing
- Leading
- Driving initiatives to improve reliability, performance and security
- Educating the team so that they have the knowledge and skills to deliver reliable, performant and secure systems
- Ensuring we're improving continuously and learning from our incidents and near-misses
- Communicating the strategy and vision to the team to ensure alignment
- Infrastructure Engineering & Architecture
- Design, build, and provision new infrastructure to support new products, initiatives and solve problems
- Uses software development skills to create software that integrates infrastructure with services and APIs
- Mentoring
- Reviews code quality and coaches Engineers directly.
- Shares insights and feedback with Engineering Manager
- Availability
- Works to increase the mean-time-between-failures and decrease the mean-time-to-repair of public-facing systems
- Monitoring
- Contributes to the improvement of the monitoring and measurement systems that support our operational scale and continuous delivery
- Operations
- Helps engineering teams to operate the systems required to deliver their products.
- Is able to think ahead to solve operational problems before they become critical.
- Emergency Response
- Takes part in the roster to support the site after normal office hours
- Troubleshoots live production issues
- Facilitates the response to emergency situations
- Reviews incidents and makes recommendations based on lessons learned
- Performance, Efficiency & Latency
- Contributes to the measurement techniques that assist in the performance tuning of the applications stack
- Use the monitoring systems to help maintain application performance at acceptable levels
- Recommends and implements performance improvements across the stack
- Security & Risk
- Participates in the ongoing process to identify and mitigate risk in Envato systems
- Change Management
- Participates in the development and maintenance of our applications and infrastructure
- Provides support and advice to engineering teams delivering changes to production
- Capacity Planning
- Uses our monitoring to advise on capacity requirements
About you
- Essential
- Has automated failover scenarios
- A commitment to continual learning
- Has provided a positive contribution to both operations-focused and development-focused work
- Contributes positively to a team through both synchronous and asynchronous communication
- Has built and maintained cloud-based applications and infrastructure
- Demonstrable knowledge of caching techniques
- Has worked with CDN providers and or DDoS mitigation services to improve service performance and reliability
- Has worked in a culture of shared responsibility between software developers and infrastructure specialists
- Linux administration
- Monitoring and logging tools
- Passion for and experience in best practice systems operations tools and techniques
- Public and private cloud-hosted infrastructure
- SQL Database management
- Supporting a large web-based application with global traffic
- Has worked with tools and frameworks for automating infrastructure
- Preferred
- Has built self-healing systems
- Has automated testing of failure scenarios
- Has measurably improved the resilience of systems
- Experience with AWS
- Experience with CloudFlare
- Experience with tools including Datadog, Rollbar and Sumologic
- Experience in other agile environments
- Application development experience
- MySQL performance tuning and troubleshooting expertise (in Aurora a bonus)
- Security and Risk identification, assessment and mitigation
- The ability to provide security and risk assessment and actions for a large e-Commerce site
- Experience with Cloudformation, Terraform, and the AWS CDK