Senior MLOps Engineer

Dubai
from $5,000/month
Office
Full-time

LLM LDM Kubernetes Python RayServe TouchServe

Brief description of the vacancy

We are seeking a Senior MLops Engineer with proven experience in deploying and managing large-scale ML infrastructure for LLMsTTSSTTStable Diffusion, and other GPU-intensive models in production. You will lead the design and operation of cost-efficienthigh-availability, and high-performance serving stacks in a Kubernetes-based AWS environment.

About the company

Company Identity AI Labs

A fast-growing and well-funded AI startup in the UAE. Mission of the company is to redefine how humans interact with AI through emotionally intelligent, relationship-focused technology🚀

Responsibilities

  • You will architect, deploy, and maintain scalable ML infrastructure on AWS EKS using Terraform and Helm.
  • You will own end-to-end model deployment pipelines for LLMs, diffusion models (LDM/Stable Diffusion), and other generative/AI models requiring high GPU throughput.
  • You will design cost-effective, auto-scaling serving systems using tools like Triton Inference ServervLLMRay Serve, or similar model-serving frameworks.
  • You will build and maintain CI/CD pipelines integrating the ML model lifecycle (training → validation → packaging → deployment).
  • You will optimize GPU resource utilization and implement job orchestration with frameworks like KServeKubeflow, or custom workloads on EKS.
  • You will deploy and manage FluxCD (or ArgoCD) for GitOps-based deployment and environment promotion.
  • You will implement robust monitoring, logging, and alerting for model health and infrastructure performance (Prometheus, Grafana, Loki).
  • You will collaborate closely with ML Engineers and Software Engineers to ensure smooth integration, observability, and feedback loops.

Requirements

  • 2–3 years of experience with model serving frameworks such as TritonvLLMRay ServeTorchServe, or similar.
  • 2–3 years of experience deploying and optimizing LLMs and LDMs (e.g., Stable Diffusion) under high load with GPU-aware scaling.
  • 3–4 years of experience with Kubernetes (EKS) and infrastructure-as-code (Terraform, Helm).
  • 4–5 years of hands-on software engineering experience in Python, with production-grade experience in ML model lifecycle.
  • Nice to have: familiarity with Go or Rust for backend or performance-critical systems.
  • Fluent English

Working conditions

Full time job in Dubai office, official employment and full relocation package

Contacts

Log InOnly registered users can open employer contacts.

Our website uses cookies, including web analytics services. By using the website, you consent to the processing of personal data using cookies. You can find out more about the processing of personal data in the Privacy policy