AI Tinkerers Tel Aviv: Inference Night in the White City
Event Ended
This event has already taken place.
AI Tinkerers Tel Aviv: Inference-focused Meetup

When and Where
- Date: November 17, 2025
- Time: 18:30-21:30
- Location: Da vinci TLV, Leonardo da Vinci St 14, 4 Floor, Tel Aviv
What
The next AI Tinkerers Tel Aviv meetup is all about inference - the unsung engine behind every LLM application. From paged attention to GPU memory bottlenecks, we’re diving deep into how modern inference stacks actually work.
We’ll kick things off with a short meet-and-greet followed by a lightning talk by our sponsors - AutonomyAI on enterprise vibe coding.
After that, we’ll go all in on inference:
🧩 LLM Inference 101: Efficient Serving with vLLM
Asaf Gardin, Senior Software Engineer (Inference Team), AI21 Labs
Serving large language models efficiently is notoriously hard: unpredictable request patterns, tricky batching, and GPU-hungry KV-states. Asaf will break down how vLLM tackles these challenges using PagedAttention, continuous batching, prefix caching, and hybrid-model support (ahm Jamba, ahm). You’ll walk away knowing what makes vLLM the go-to open-source inference engine.
⚙️ LLM Inference 201: From Compute-Bound to Memory-Bound
Tomer Asida, Senior LLM Inference Engineer, NVIDIA
Every inference step lives in two worlds: prefill (compute-bound) and decode (memory-bound). Tomer will show how GPU math throughput and memory bandwidth dictate performance, and how cutting-edge systems squeeze efficiency from both sides of the pipeline.
Come hungry and curious.
Hosted at AI21 Labs HQ, this night is for anyone building, optimizing, or just obsessed with how inference really works.
Sponsors
AI21 Labs - AI21 is pioneering the development of enterprise AI Systems and Foundation Models. Their mission is to build trustworthy artificial intelligence that powers humanity towards superproductivity. They offer privately deployed models with unmatched performance and reliability with tailored solutions for every organization.

AutonomyAI - AutonomyAI helps engineering teams ship faster without compromising quality. It integrates directly into your codebase to generate production-grade interfaces that follow your existing design systems, frameworks, and conventions. No black box, no throwaway prototypes, just clean, mergeable code. Used by enterprise teams and backed by $4M in funding, AutonomyAI eliminates front-end bottlenecks so developers can focus on logic, not layout.

Detailed Agenda
LLM Inference 101: Efficient Serving with vLLM
Asaf Gardin, Senior Software Engineer (Inference Team), AI21 Labs
Serving large language models (LLMs) efficiently is a difficult problem: requests arrive unpredictably, batching is hard to optimize, and storing key–value (KV) states quickly consumes GPU capacity. vLLM is an open-source library designed to solve these challenges and deliver high-throughput, cost-effective inference.
In this talk, we’ll cover the fundamentals of why LLM serving is hard, introduce vLLM’s core ideas like PagedAttention and continuous batching, and explain how they enable dynamic, efficient execution. We’ll also highlight features such as quantization, Automatic Prefix Caching, , and support for hybrid models like Jamba, which combine attention and Mamba layers. By the end, you’ll understand both the challenges of LLM inference and how vLLM makes large-scale deployment practical.
LLM Inference 201: Two Phases, Two Bottlenecks: How GPU Performance Shapes LLM Inference
Tomer Asida, Senior LLM Inference Software Engineer, NVIDIA
Every LLM request goes through two very different execution phases: prefill and decode. Each of them is limited by completely different hardware bottlenecks; Prefill is compute-bound and dominated by matrix multiplications, while decode is memory-bound and constrained by moving data from GPU DRAM to SRAM.
In this talk, we’ll cover:
- What compute-bound vs. memory-bound actually means
-
The main factors contributing to GPU performance
The differences between the prefill and decode phases, and why they are bound by different factors
Techniques modern inference systems use to optimize both phases
Together, these concepts form a clear, GPU-level foundation for understanding how inference engines like vLLM extract maximum efficiency.