Lead ML Systems Engineer
Twelve Labs
You will:
- Prioritize the team’s work in building and improving our machine learning systems in production for video foundation and language model (VFM & VLM), in collaboration with senior engineers and other stakeholders
- Inference Infrastructure: Construct the most performant, scalable, and reliable inference engine optimized for Twelve Lab’s video foundation and language models.
- ML Deployment & Operations (VFMOps / VLMOps): Lead the initiative in serving the model in the most optimized manner, deploying the pipeline, and automating the model training to deployment process.
- Data: Oversee the data infrastructure and preparation of high-quality video data for our training runs.
- Design processes (e.g. postmortem review, incident response, on-call rotations) that help the team operate effectively
- Coach and develop your reports to decide how they would like to advance in their careers and help them do so
- Run the team’s recruiting efforts through a period of rapid growth
You may be a good fit if you have:
- 10+ years of software development experience, including experience in machine learning engineering
- 5+ years of experience in building end-to-end machine learning systems encompassing infrastructure, MLOps, and data management
- You have experience working with engineers at different levels and have coached them in their career development
- 2+ years of experience managing high output engineering teams
- Proficiency in working with video processing and data pipelining
- Experience in establishing and maintaining secure software and system development environments
Desired Experience:
- MS or PhD in Computer Science, Math, or equivalent real-world experience
- Fast-paced startup engineering experience
- Experience working with large scale models
- Experience working with both cloud and on-premise environment
- ML research experience would be helpful, as this role requires interchangeable effort on both research side and software side
- Experience in handling large-scale computing system and firm understanding on scale-up and scale-out approach in cloud environment
Relevant Tech Stack:
- Language: Python, Golang, C++, CUDA
- ML / Platform: PyTorch, Docker, Kubernetes, Terraform
- ML Demo page: Gradio, Streamlit
- MLOps: MLFlow, Weights and Biases
- Data: Pachyderm, DVC
- Automation: Airflow, Kubeflow
- Model serving: Triton, FasterTransformer