hero
companies
Jobs

Senior Platform & DevOps Engineer (GCP/K8s)

Cooklist

Cooklist

Software Engineering
Austin, TX, USA · Remote
Posted on Oct 17, 2025

About Cooklist

Cooklist is the AI grocery‑intelligence platform powering meal planning and shopping for millions of consumers across our consumer app and white‑label enterprise suite. Our mission is to combine the intelligence of a personal shopper, chef, and nutritionist to help people save time, eat better, and enjoy happier lives.

We’re profitable, process billions of dollars in transactions at the nation’s largest retailers, and our mobile experiences reach millions of people. We’re backed by Techstars, Mercury Fund, and industry leaders including the former CTO of Kroger and the Chief Product Officer of Amazon Fresh.

Role Overview

We’re hiring a Senior Platform & DevOps Engineer to own the infrastructure Cooklist runs on (GCP, GKE, CI/CD, observability). You’ll own our internal platform end-to-end which includes designing and securing the infrastructure, streamlining CI/CD and automation, and building the internal tooling that keeps Cooklist and retailer apps running fast, safe, and reliable at scale.

This is a high‑ownership role spanning architecture, reliability, security, and speed. You’ll collaborate tightly with mobile and AI teams to institute patterns, tests, and tooling that let AI coding assistants safely accelerate development.

You’ll also lead our pen‑test readiness and security hardening program to meet the high bar required by enterprise retailers (SOC 2‑aligned controls, OWASP ASVS/CIS benchmarks), partnering with leadership to produce audit‑ready evidence and close the loop on findings quickly.

Responsibilities

  • GCP/GKE operations: Design clusters, autoscaling, and progressive delivery (blue/green, canary, shadow) via GitHub Actions/Cloud Build + GitOps.

  • Observability & SRE: Define SLOs/SLIs, instrument dashboards across services, lead on‑call and incidents with blameless postmortems.

  • Platform security: Enforce least‑privilege IAM, KMS‑managed secrets, NetworkPolicies/Pod Security, image provenance/scanning, Cloud Armor/WAF, and TLS 1.2+ with HSTS and modern ciphers.

  • Pen‑test readiness: Scope third‑party tests, triage & remediate findings to SLAs, and deliver closure evidence to retailer partners.

  • SOC 2 compliance: Drive readiness through audit - gap assessment, controls/policies, evidence automation (Vanta/Drata/Tugboat), governance ops, and auditor coordination.

  • Secure SDLC: Implement SAST/DAST, dependency/container/IaC scanning, SBOMs, and CI/CD release gates; maintain a risk register and minimize MTTR for High vulns.

  • Reliability & cost: Ensure backups/DR (RPO/RTO), safe schema/backfills, feature flags/kill‑switches, and capacity/cost optimization across compute/storage/network.

Qualifications

  • 5+ years building and operating production Python backends running high‑traffic production systems; Django experience is a strong plus (not a hard requirement).

  • Strong GCP + Kubernetes (GKE) chops: containerization, autoscaling, rollout strategies, metrics/alerts, and cost management.

  • Proficiency with CI/CD (GitHub Actions/Cloud Build), IaC (Terraform), and Docker.

  • Led SOC 2 programs (Type I and/or Type II) and third‑party penetration tests; practical knowledge of OWASP Top 10/ASVS and CIS Benchmarks (GCP/K8s).

  • Excellent observability instincts (metrics, logs, traces, dashboards) and operational ownership (on‑call, runbooks, post‑mortems).

  • Builder mindset with founder energy: bias to action, high bar for polish, comfort owning outcomes end‑to‑end.

Success Looks Like

  • 3 months: You’ve shipped a significant API and productionized an infra improvement (e.g., canary rollouts + SLO dashboards). p95 latency down on a key endpoint. External pen test passed with no Critical/High.

  • 6 months: Zero‑downtime schema evolution playbook adopted; Secure SDLC gates and vuln scanning enforced; MTTR for High vulns < 14 days; SOC 2 roadmap approved and evidence tooling live.

  • 12 months: Reliability and developer velocity measurably improved. SOC 2 Type I ready/issued with auditor; risk register tracked and burn‑down trending.

Our Stack

  • Language: Python/Django backend; Javascript/React Native frontend
  • APIs/Data: GraphQL; real‑time streaming over WebSockets
  • Data: Postgres/AlloyDB; Redis
  • Cloud/Infra: GCP, GKE (Autopilot/Standard), GitHub Actions
  • LLM engineering: internally built prompt/tool libraries, RAG pipelines & eval system

What We Offer

  • Work directly with founders and an elite, tight‑knit team
  • Ship experiences that materially improve the lives of millions
  • A high‑intensity, high‑ownership environment designed for builders
  • Competitive compensation + meaningful equity
  • Texas-based WFH