LLMSpeculative decoding and generation accelerationprompt compression and context optimizationgeneration length predictioninference and system optimization
We are looking for an ML Engineer to focus on developing and optimizing algorithms that accelerate large language model (LLM) inference. Your work will directly impact latency, cost efficiency, and scalability of production-grade AI systems. You’ll explore and implement cutting-edge techniques such as speculative decoding, prompt compression, quantization, and generation optimization. [Tech Stack] PyTorch, Hugging Face Transformers, TensorRT, ONNX Runtime, vLLM, SGLang, DeepSpeed, FlashAttention, xFormers, Quantization tools (BitsAndBytes, GPTQ)
Company Sobolev Research Center
Sobolev Research Center is the local branch of the Lomonosov Research Institute (R&D ecosystem in Russia &CIS of the international IT company)
Design algorithms that reduce the number of decoding steps and improve generation speed.
What you’ll do:
a. Implement speculative decoding pipelines (draft + target models)
b. Develop multi-token prediction approaches
c. Explore parallel and tree-based decoding strategies
d. Implement early-exit mechanisms
2. Prompt Compression & Context Optimization (Token pruning / attention-based filtering, semantic compression via embeddings, LLM-based summarization (self-compression))
Reduce input context length without degrading output quality.
What you’ll do:
a. Compress long prompts and conversation history
b. Filter irrelevant tokens dynamically
c. Optimize context window usage
3. Generation Length Prediction (Length prediction models (regression/classification), confidence-based stopping, entropy-based stopping criteria, reinforcement learning for stopping policies)
Optimize compute usage by predicting output sequence length.
What you’ll do:
a. Build models to predict generation length
b. Dynamically allocate compute resources
c. Implement early stopping strategies
4. Inference & Systems Optimization (FlashAttention, PagedAttention)
Improve system-level performance and scalability.
What you’ll do:
a. Implement dynamic batching and efficient serving
b. Optimize KV-cache usage
c. Apply memory-efficient attention mechanisms
d. Improve parallelism and throughput
Must-have:
● Strong experience with deep learning frameworks (PyTorch or TensorFlow)
● Solid understanding of Transformer architectures and LLMs
● Experience with model inference optimization
● Strong Python skills
● Understanding of GPU/CPU performance and memory bottlenecks
Full time office mode, flexible schedule, VMI, Novosibirsk, Academgorodok, Ingenernaya str. 7
Log InOnly registered users can open employer contacts.
Our website uses cookies, including web analytics services. By using the website, you consent to the processing of personal data using cookies. You can find out more about the processing of personal data in the Privacy policy