AI Tinkerers Tel Aviv: Inference Night in the White City

Nov

Monday

Monday, November 17th, 2025 • 6:30PM to 9:30PM (IDT)

Address Info

Available on RSVP acceptance

Event Ended

This event has already taken place.

Attendees 153+ registered

We are bringing together a highly technical group of attendees: the most common skills and interests are AI/ML and Python, followed by data science, computer vision, and NLP. Our attendees include senior engineering leaders, researchers, and founders from companies like NVIDIA, IBM Research, Amazon, and Wix, including PhDs, university lecturers, and recipients of the Israel Defense Prize.

AI Tinkerers Tel Aviv: Inference-focused Meetup

When and Where

Date: November 17, 2025
Time: 18:30-21:30
Location: Da vinci TLV, Leonardo da Vinci St 14, 4 Floor, Tel Aviv

What

The next AI Tinkerers Tel Aviv meetup is all about inference - the unsung engine behind every LLM application. From paged attention to GPU memory bottlenecks, we’re diving deep into how modern inference stacks actually work.

We’ll kick things off with a short meet-and-greet followed by a lightning talk by our sponsors - AutonomyAI on enterprise vibe coding.

After that, we’ll go all in on inference:

🧩 LLM Inference 101: Efficient Serving with vLLM
Asaf Gardin, Senior Software Engineer (Inference Team), AI21 Labs
Serving large language models efficiently is notoriously hard: unpredictable request patterns, tricky batching, and GPU-hungry KV-states. Asaf will break down how vLLM tackles these challenges using PagedAttention, continuous batching, prefix caching, and hybrid-model support (ahm Jamba, ahm). You’ll walk away knowing what makes vLLM the go-to open-source inference engine.

⚙️ LLM Inference 201: From Compute-Bound to Memory-Bound
Tomer Asida, Senior LLM Inference Engineer, NVIDIA
Every inference step lives in two worlds: prefill (compute-bound) and decode (memory-bound). Tomer will show how GPU math throughput and memory bandwidth dictate performance, and how cutting-edge systems squeeze efficiency from both sides of the pipeline.

Come hungry and curious.
Hosted at AI21 Labs HQ, this night is for anyone building, optimizing, or just obsessed with how inference really works.

Sponsors

AI21 Labs - AI21 is pioneering the development of enterprise AI Systems and Foundation Models. Their mission is to build trustworthy artificial intelligence that powers humanity towards superproductivity. They offer privately deployed models with unmatched performance and reliability with tailored solutions for every organization.
(Logo) A typographic logo displaying the brand name "AI21 labs" in a bold, sans-serif font, utilizing dark grey for the first part and bright pink for the second part. Text: AI21 labs Colors: #212121, #e5396b, #f0f0f0 Note: This image is a highly stylized wordmark, using specific colors and typography to represent a brand identity, which defines it as a logo.

AutonomyAI - AutonomyAI helps engineering teams ship faster without compromising quality. It integrates directly into your codebase to generate production-grade interfaces that follow your existing design systems, frameworks, and conventions. No black box, no throwaway prototypes, just clean, mergeable code. Used by enterprise teams and backed by $4M in funding, AutonomyAI eliminates front-end bottlenecks so developers can focus on logic, not layout.

Detailed Agenda

LLM Inference 101: Efficient Serving with vLLM

Asaf Gardin, Senior Software Engineer (Inference Team), AI21 Labs

Serving large language models (LLMs) efficiently is a difficult problem: requests arrive unpredictably, batching is hard to optimize, and storing key–value (KV) states quickly consumes GPU capacity. vLLM is an open-source library designed to solve these challenges and deliver high-throughput, cost-effective inference.

In this talk, we’ll cover the fundamentals of why LLM serving is hard, introduce vLLM’s core ideas like PagedAttention and continuous batching, and explain how they enable dynamic, efficient execution. We’ll also highlight features such as quantization, Automatic Prefix Caching, , and support for hybrid models like Jamba, which combine attention and Mamba layers. By the end, you’ll understand both the challenges of LLM inference and how vLLM makes large-scale deployment practical.

LLM Inference 201: Two Phases, Two Bottlenecks: How GPU Performance Shapes LLM Inference

Tomer Asida, Senior LLM Inference Software Engineer, NVIDIA

Every LLM request goes through two very different execution phases: prefill and decode. Each of them is limited by completely different hardware bottlenecks; Prefill is compute-bound and dominated by matrix multiplications, while decode is memory-bound and constrained by moving data from GPU DRAM to SRAM.

In this talk, we’ll cover:

What compute-bound vs. memory-bound actually means
The main factors contributing to GPU performance
The differences between the prefill and decode phases, and why they are bound by different factors
Techniques modern inference systems use to optimize both phases

Together, these concepts form a clear, GPU-level foundation for understanding how inference engines like vLLM extract maximum efficiency.

AI Tinkerers Tel Aviv: Inference Night in the White City

Event Ended

AI Tinkerers Tel Aviv: Inference-focused Meetup

When and Where

What

Sponsors

Detailed Agenda

LLM Inference 101: Efficient Serving with vLLM

LLM Inference 201: Two Phases, Two Bottlenecks: How GPU Performance Shapes LLM Inference

Ready for more?

Contact Organizers

Sign in to continue

Enter the 4-digit verification code sent to your email

AI Tinkerers Tel Aviv: Inference Night in the White City

Event Ended

AI Tinkerers Tel Aviv: Inference-focused Meetup

When and Where

What

Sponsors

Detailed Agenda

LLM Inference 101: Efficient Serving with vLLM

LLM Inference 201: Two Phases, Two Bottlenecks: How GPU Performance Shapes LLM Inference

Ready for more?

Subscribe to AI Tinkerers - Tel Aviv

Contact Organizers

Sign in to continue

Enter the 4-digit verification code sent to your email