Overview
We are seeking a Senior Machine Learning Engineer to bridge the gap between advanced Vision-Language Model (VLM) research and high-performance production serving. This position requires a dual competency: the ability to design novel VLM architectures, including dataset curation and multilingual alignment, while also optimizing the inference stack involving kernel optimization, distillation, and memory management to effectively operate on specific hardware constraints (NVIDIA H100 and AMD MI300x). The successful candidate will own the entire vertical slice, engaging in activities from reviewing the latest arXiv papers and enhancing training sets to writing the C++/CUDA kernels that serve the final model in production.
Responsibilities
- VLM Research & Architecture Design:
- Continuously evaluate and implement the latest research trends in Vision-Language Models, specifically focusing on Referring Expression Comprehension (REC), Document Understanding (Pix2Struct), and Visual Question Answering (VQA).
- Design and build massive-scale training and evaluation datasets, ensuring multilingual compatibility and broad visual understanding for European market requirements.
- Lead the model co-design process, creating architectures that are natively optimized for accelerator capabilities (computing-bound vs. memory-bound operations).
- Advanced Inference Optimization & Serving:
- Architect high-throughput serving layers using SGLang and vLLM, optimizing for non-standard decoding strategies.
- Implement scientific experiments to identify the Pareto-optimal frontier between serving latency and generation quality.
- Execute Knowledge Distillation (KD), unstructured pruning, and quantization techniques to fit large-scale VLM architectures onto single-node GPU setups (specifically H100 or MI300x) without compromising model quality.
- Systems Engineering & Kernel Development:
- Write and optimize custom kernels (CUDA/HIP) to accelerate serving latency, identifying bottlenecks at the operator level.
- Manage the full pre-training and post-training tech stack, ensuring seamless integration between model weights and inference engines.
- Take ownership of deploying the serving-efficient model in a production environment, ensuring reliability and scalability.
Qualifications
- Mandatory Requirements (Must Have):
- Education: Master’s or Ph.D. in Computer Science, Artificial Intelligence, or High-Performance Computing.
- Experience: Minimum of 4+ years in Machine Learning, with a split focus on Model Architecture and Systems Optimization.
- VLM Expertise: Proven experience in building and shipping Vision-Language Models (e.g., architectures similar to CLIP, Flamingo, Pix2Struct). Must have created custom evaluation sets for tasks like Document Understanding.
- Serving Stack Proficiency: Expert-level knowledge of SGLang and vLLM for optimized serving.
- Hardware Specifics: Demonstrable experience optimizing models for both NVIDIA (H100) and AMD (MI300x) accelerators.
- Optimization Techniques: Hands-on experience with Knowledge Distillation and Pruning to reduce model latency for target serving sizes.
- Production Engineering: A track record of advancing complex multi-modal models from research code to a deployed, user-facing production product.
Apply online using the form below. Only applications matching the job profile will be considered.