Staff Test Engineer - AI
Outreach
Software Engineering, Data Science, Quality Assurance
Hyderabad, Telangana, India
As we scale the depth and breadth of our AI platform, quality is not an afterthought — it is foundational. We are looking for a Staff AI Test Engineer who is first and foremost an exceptional quality engineer, and who brings a genuine curiosity and working understanding of how AI and LLM-based systems behave, fail, and improve. If you are passionate about building rigorous test strategies for complex, probabilistic systems at scale, we want to talk to you.
This role requires someone who understands the unique challenges of testing AI systems: outputs are not always deterministic, correctness is often contextual, and traditional pass/fail assertions are insufficient on their own. You will design and implement evaluation frameworks that combine deterministic validation with LLM-based grading, establish quality standards for agent behavior, and partner closely with Data Science, Engineering, and Product teams to make quality a shared discipline.
You will be a senior voice in how we build, ship, and continuously improve AI products at Outreach.
Your Daily Adventures Will Include:
- Own the AI Quality Strategy. Define and lead the end-to-end testing strategy for Outreach’s GenAI platform, including agentic workflows, LLM tool calls, LangGraph orchestration, and supporting ML pipelines.
- Build Evaluation Frameworks. Design and implement evaluation systems that handle both deterministic and non-deterministic outputs — combining rule-based assertions, golden dataset testing, and LLM-as-Judge approaches to grade agent responses at scale.
- Test Agents End-to-End. Own testing across Outreach’s suite of AI agents — Revenue Agent, Research Agent, Meeting Agent, Personalisation Agent, and Ask Outreach — covering functional correctness, tool selection accuracy, context handling, and response quality.
- Partner with DS and Engineering. Work closely with Data Science, MLOps, and platform engineers to ensure testability is designed in from the start — not bolted on after.
- Drive CI/CD for AI. Integrate evaluation pipelines into CI/CD workflows so that regressions in agent behavior are caught before they reach production.
- Define Quality Metrics. Establish and track metrics that matter for AI systems: answer quality scores, tool invocation accuracy, hallucination rates, latency, and regression trends over model and prompt changes.
- Champion Best Practices. Define standards for AI testing across the org — including prompt regression testing, retrieval quality evaluation, and agent behavior contracts.
- Mentor and Influence. Raise the quality bar across engineering teams by mentoring engineers, reviewing designs for testability, and advocating for quality-driven development practices.
- Stay Current. Actively track developments in AI evaluation tooling, LLM benchmarking, and testing research — and bring relevant advances into our practice.
Our Vision of You:
- 7–12 years of experience in software development and/or test automation, with demonstrated experience leading quality efforts on complex, distributed systems.
- B.S. in Computer Science or a related technical field.
- Strong programming skills in Python, with experience writing reusable, maintainable test frameworks.
- Proven experience testing large-scale backend or platform systems, including microservices and API layers.
- Deep understanding of test design principles, CI/CD integration, and scalable test automation.
- Experience with test frameworks such as PyTest or equivalent.
- Solid understanding of evaluation methodologies for non-deterministic systems — including statistical assertions, behavioral testing, and regression baselines.
- Hands-on experience with Databricks for building and validating ML pipelines and data workflows.
- Experience with MLflow for experiment tracking, model versioning, and pipeline observability.
- Strong communication and collaboration skills across engineering, data science, and product functions.
- Experience testing GenAI products, LLM-based systems, or agentic AI platforms.
- Experience with prompt engineering and prompt tuning — understanding how prompt changes affect model behavior and building regression suites to catch prompt-driven regressions.
- Hands-on experience with LLM-as-Judge evaluation patterns — using LLMs to grade LLM outputs at scale.
- Familiarity with LangGraph, LangChain, or similar agent orchestration frameworks.
- Experience with ML pipelines, ML flow tooling (e.g., MLflow, Kubeflow, Metaflow), or model evaluation workflows.
- Understanding of RAG (Retrieval-Augmented Generation) architectures and how to evaluate retrieval quality.
- Experience with cloud platforms (AWS, GCP, or Azure) and containerized environments (Docker, Kubernetes).
- Domain knowledge in sales, sales engagement, or CRM platforms (e.g., Salesforce, HubSpot, or similar) — understanding the workflows, terminology, and data that sales teams operate with.
- Prior experience contributing to AI quality strategies in a product or research environment.
Preferred Qualifications: