Principal SRE position
Hackajob
Required Qualifications
- 10+ years of related experience in software development, systems engineering, and/or networking
- Kubernetes Expertise - Deep hands-on experience managing Kubernetes clusters (AWS EKS or similar) with a focus on networking, scaling, and security. Strong troubleshooting skills across Kubernetes workloads, infrastructure, and networking.
- Infrastructure as Code & Automation - Expertise in Terraform for infrastructure as code. Proven experience with ArgoCD and GitHub Actions for GitOps workflows and CI/CD pipelines.
- Monitoring & Observability - Proficiency in Prometheus, Grafana, and incident management workflows. Experience implementing application-level monitoring and tracing to identify performance bottlenecks.
- Guardrails & System Security - Demonstrated ability to set up guardrails for databases, Kubernetes clusters, and applications to ensure reliable and secure operations.
- Cloud Expertise - Advanced knowledge of AWS services, including EKS, EC2, CloudWatch, Route53, Aurora, and S3.
- Familiarity with auto-scaling, load balancing, and cloud cost optimization.
- Programming & Scripting Skills - Strong proficiency in Python, Go, or Bash for scripting and automation tasks.
- Systems Troubleshooting - Proven ability to troubleshoot complex, distributed systems across cloud infrastructure, databases, and networking.
- Experience with other cloud platforms such as GCP or Azure.
- Familiarity with logging and observability tools like ELK, Loki, or Graylog.
- Exposure to chaos engineering and resilience testing.
- Knowledge of HashiCorp Vault, SOPS, and secrets management best practices.
- Expertise in database systems, including setup, scaling, and optimization.
- Strong listening and communication skills
- Strong coaching and mentorship capabilities