ML engineer (LLM optimization & inference acceleration)

Brief description of the vacancy

We are looking for an ML Engineer to focus on developing and optimizing algorithms that accelerate large language model (LLM) inference. Your work will directly impact latency, cost efficiency, and scalability of production-grade AI systems. You’ll explore and implement cutting-edge techniques such as speculative decoding, prompt compression, quantization, and generation optimization. [Tech Stack] PyTorch, Hugging Face Transformers, TensorRT, ONNX Runtime, vLLM, SGLang, DeepSpeed, FlashAttention, xFormers, Quantization tools (BitsAndBytes, GPTQ)

About the company

Company Sobolev Research Center

Sobolev Research Center is the local branch of the Lomonosov Research Institute (R&D ecosystem in Russia &CIS of the international IT company)

Responsibilities

Speculative Decoding & Generation Acceleration (EAGLE3, dFlash, MTP)

Design algorithms that reduce the number of decoding steps and improve generation speed.

What you’ll do:

a. Implement speculative decoding pipelines (draft + target models)

b. Develop multi-token prediction approaches

c. Explore parallel and tree-based decoding strategies

d. Implement early-exit mechanisms

2. Prompt Compression & Context Optimization (Token pruning / attention-based filtering, semantic compression via embeddings, LLM-based summarization (self-compression))

Reduce input context length without degrading output quality.

What you’ll do:

a. Compress long prompts and conversation history

b. Filter irrelevant tokens dynamically

c. Optimize context window usage

3. Generation Length Prediction (Length prediction models (regression/classification), confidence-based stopping, entropy-based stopping criteria, reinforcement learning for stopping policies)

Optimize compute usage by predicting output sequence length.

What you’ll do:

a. Build models to predict generation length

b. Dynamically allocate compute resources

c. Implement early stopping strategies

4. Inference & Systems Optimization (FlashAttention, PagedAttention)

Improve system-level performance and scalability.

What you’ll do:

a. Implement dynamic batching and efficient serving

b. Optimize KV-cache usage

c. Apply memory-efficient attention mechanisms

d. Improve parallelism and throughput

Requirements

Must-have:

● Strong experience with deep learning frameworks (PyTorch or TensorFlow)

● Solid understanding of Transformer architectures and LLMs

● Experience with model inference optimization

● Strong Python skills

● Understanding of GPU/CPU performance and memory bottlenecks

Working conditions

Full time office mode, flexible schedule, VMI, Novosibirsk, Academgorodok, Ingenernaya str. 7

Contacts

Log InOnly registered users can open employer contacts.

mcb1251f8620b

Posted: