














### **About Us** We are a **stealth\-mode startup** building next\-generation infrastructure for the AI industry. Our team has decades of experience in software, systems, and deep tech. We are working on a new kind of AI runtime that pushes the boundaries of performance and flexibility making advanced models portable, efficient, and customizable for real\-world deployment. If you want to be part of a small, fast\-moving team shaping the **future of applied AI systems**, this is your opportunity. ### **Role** We are looking for a **C\+\+ Engineer** with strong systems and GPU programming background to help extend and optimize an open\-source AI inference runtime. You will work on low\-level internals of large language model serving, focusing on: * Dynamic adapter integration (e.g., LoRA/QLoRA) * Incremental model update mechanisms * Multi\-session inference caching and scheduling * GPU performance improvements (Tensor Cores, CUDA/ROCm) This is a **hands\-on role**: you will be designing, coding, profiling, and iterating on high\-performance inference code that runs directly on CPUs and GPUs. ### **Responsibilities** * Implement support for **runtime adapter loading (LoRA)**, enabling models to be customized on the fly without retraining or model merges. * Design and implement mechanisms for **incremental model deltas**, allowing models to be extended and updated efficiently. * Extend runtime to handle **multi\-session execution**, with isolation and caching strategies for concurrent users. * Optimize core math kernels and memory layouts to improve inference performance on **CPU and GPU backends**. * Collaborate with backend and infrastructure engineers to integrate your work into APIs and orchestration layers. * Write benchmarks, unit tests, and profiling tools to ensure correctness and measure performance gains. * Contribute to system architecture discussions and help define the roadmap for future runtime features. ### **Requirements** * Strong proficiency in **modern C\+\+ (C\+\+14/17/20\)** and systems programming. * Solid understanding of **low\-level performance optimization**: memory management, multithreading, SIMD, cache efficiency. * Experience with **CUDA** and/or **ROCm/HIP** GPU programming. * Familiarity with **linear algebra kernels** (matrix multiply, attention) and how they map to hardware acceleration (Tensor Cores, BLAS libraries, etc.). * Exposure to **machine learning inference frameworks** (e.g., llama.cpp, TensorRT, ONNX Runtime, TVM, PyTorch internals) is a plus. * Comfortable working in a **Unix/Linux** environment; experience with build systems (CMake, Bazel) and CI pipelines. * Strong problem\-solving and debugging skills; ability to dive deep into both code and performance traces. * Self\-motivated and able to thrive in a **fast\-moving startup** environment. ### **Nice to Have** * Experience implementing **LoRA or adapter\-based fine\-tuning** in inference runtimes. * Knowledge of **quantization methods** and deploying quantized models efficiently. * Background in distributed systems or multi\-GPU orchestration. * Contributions to **open\-source ML/AI systems**. ### **Why Join** * Build core IP at the intersection of **AI and systems engineering**. * Work with a highly technical founding team on problems that are both intellectually challenging and commercially impactful. * Opportunity to shape the direction of a new AI platform from the ground up * Competitive compensation (contract or full\-time), equity potential, and flexible remote work. Please Use this link to apply to this job: https://www.baasi.com/career/apply/3164000


