Senior MLOps Engineer — Open Data Science

Brief description of the vacancy

We are seeking a Senior MLops Engineer with proven experience in deploying and managing large-scale ML infrastructure for LLMs, TTS, STT, Stable Diffusion, and other GPU-intensive models in production. You will lead the design and operation of cost-efficient, high-availability, and high-performance serving stacks in a Kubernetes-based AWS environment.

About the company

Company Identity AI Labs

A fast-growing and well-funded AI startup in the UAE. Mission of the company is to redefine how humans interact with AI through emotionally intelligent, relationship-focused technology🚀

Responsibilities

You will architect, deploy, and maintain scalable ML infrastructure on AWS EKS using Terraform and Helm.
You will own end-to-end model deployment pipelines for LLMs, diffusion models (LDM/Stable Diffusion), and other generative/AI models requiring high GPU throughput.
You will design cost-effective, auto-scaling serving systems using tools like Triton Inference Server, vLLM, Ray Serve, or similar model-serving frameworks.
You will build and maintain CI/CD pipelines integrating the ML model lifecycle (training → validation → packaging → deployment).
You will optimize GPU resource utilization and implement job orchestration with frameworks like KServe, Kubeflow, or custom workloads on EKS.
You will deploy and manage FluxCD (or ArgoCD) for GitOps-based deployment and environment promotion.
You will implement robust monitoring, logging, and alerting for model health and infrastructure performance (Prometheus, Grafana, Loki).
You will collaborate closely with ML Engineers and Software Engineers to ensure smooth integration, observability, and feedback loops.

Requirements

2–3 years of experience with model serving frameworks such as Triton, vLLM, Ray Serve, TorchServe, or similar.
2–3 years of experience deploying and optimizing LLMs and LDMs (e.g., Stable Diffusion) under high load with GPU-aware scaling.
3–4 years of experience with Kubernetes (EKS) and infrastructure-as-code (Terraform, Helm).
4–5 years of hands-on software engineering experience in Python, with production-grade experience in ML model lifecycle.
Nice to have: familiarity with Go or Rust for backend or performance-critical systems.
Fluent English

Working conditions

Full time job in Dubai office, official employment and full relocation package

Contacts

Log InOnly registered users can open employer contacts.

be04a471ed455

Posted: