Overview
We are seeking a Senior Machine Learning Engineer to bridge the gap between advanced Vision-Language Model (VLM) research and high-performance production serving. This unique position combines two critical competencies: the ability to design innovative VLM architectures, including dataset curation and multilingual alignment, as well as the expertise to optimize the inference stack through kernel optimization, distillation, and memory management to run these models on specific hardware constraints (NVIDIA H100 and AMD MI300x).
The successful candidate will take ownership of the entire vertical slice, from analyzing the latest arXiv papers and enhancing training sets to writing the C++/CUDA kernels necessary for deploying the final model in a production environment.
Responsibilities
- VLM Research & Architecture Design
- Continuously evaluate and implement the latest research trends in Vision-Language Models, with a focus on Referring Expression Comprehension (REC), Document Understanding (Pix2Struct), and Visual Question Answering (VQA).
- Design and build massive-scale training and evaluation datasets, ensuring multilingual compatibility and broad visual understanding tailored to European market requirements.
- Lead the model co-design process, creating architectures optimized for the capabilities of accelerators, addressing both compute-bound and memory-bound operations.
- Advanced Inference Optimization & Serving
- Architect high-throughput serving layers using SGLang and vLLM, optimizing for non-standard decoding strategies.
- Conduct scientific experiments to discover the Pareto-optimal balance between serving latency and generation quality.
- Execute Knowledge Distillation (KD), unstructured pruning, and quantization techniques to fit large-scale VLM architectures onto single-node GPU setups without compromising model quality.
- Systems Engineering & Kernel Development
- Write and optimize custom kernels (CUDA/HIP) to accelerate serving latency and identify bottlenecks at the operator level.
- Manage the full pre-training and post-training tech stack to ensure seamless integration between model weights and inference engines.
- Take ownership of deploying serving-efficient models in a production environment, ensuring reliability and scalability.
Qualifications
- Mandatory Requirements (Must Have)
- Education: Master’s or PhD in Computer Science, Artificial Intelligence, or High-Performance Computing.
- Experience: A minimum of 4+ years in Machine Learning, with a focus on both Model Architecture and Systems Optimization.
- VLM Expertise: Proven experience in building and deploying Vision-Language Models (e.g., architectures similar to CLIP, Flamingo, Pix2Struct) and crafting custom evaluation sets for tasks such as Document Understanding.
- Serving Stack Proficiency: Expert knowledge of SGLang and vLLM for optimized serving.
- Hardware Specifics: Demonstrated expertise in optimizing models for NVIDIA (H100) and AMD (MI300x) accelerators.
- Optimization Techniques: Practical experience with Knowledge Distillation and Pruning to reduce model latency for targeted serving sizes.
- Production Engineering: A proven track record of transitioning complex multi-modal models from research code to deployed, user-facing production products.
Apply online using the form below. Please note that only applications matching the job profile will be considered.