AI Tinkerers Tel Aviv: Inference Night in the White City [AI Tinkerers - Tel Aviv]

AI Tinkerers Tel Aviv: Inference Night in the White City

Nov
17
Monday
Monday, November 17th, 2025 6:30PM to 9:30PM (IDT)
Address Info
Available on RSVP acceptance

Event Ended

This event has already taken place.

Attendees 153+ registered
We are bringing together a highly technical group of attendees: the most common skills and interests are AI/ML and Python, followed by data science, computer vision, and NLP. Our attendees include senior engineering leaders, researchers, and founders from companies like NVIDIA, IBM Research, Amazon, and Wix, including PhDs, university lecturers, and recipients of the Israel Defense Prize.

AI Tinkerers Tel Aviv: Inference-focused Meetup

(Banner) This is a promotional banner advertising a tech event titled '(TL)vLLM: Inference in the White City,' presented by AI Tinkereres Tel Aviv. Text: AI Tinkereres Tel Aviv Presents (TL)vLLM Inference in the White City modern, clean, digital illustration | Colors: #212E4A, #10B2A0, #A633A6, #F8F4EA Note: The image is a digital graphic designed to serve as a promotional announcement or advertisement for an event, featuring multiple lines of text and supporting illustrations.

When and Where

  • Date: November 17, 2025
  • Time: 18:30-21:30
  • Location: Da vinci TLV, Leonardo da Vinci St 14, 4 Floor, Tel Aviv

What

The next AI Tinkerers Tel Aviv meetup is all about inference - the unsung engine behind every LLM application. From paged attention to GPU memory bottlenecks, we’re diving deep into how modern inference stacks actually work.

We’ll kick things off with a short meet-and-greet followed by a lightning talk by our sponsors - AutonomyAI on enterprise vibe coding.

After that, we’ll go all in on inference:

🧩 LLM Inference 101: Efficient Serving with vLLM
Asaf Gardin, Senior Software Engineer (Inference Team), AI21 Labs
Serving large language models efficiently is notoriously hard: unpredictable request patterns, tricky batching, and GPU-hungry KV-states. Asaf will break down how vLLM tackles these challenges using PagedAttention, continuous batching, prefix caching, and hybrid-model support (ahm Jamba, ahm). You’ll walk away knowing what makes vLLM the go-to open-source inference engine.

⚙️ LLM Inference 201: From Compute-Bound to Memory-Bound
Tomer Asida, Senior LLM Inference Engineer, NVIDIA
Every inference step lives in two worlds: prefill (compute-bound) and decode (memory-bound). Tomer will show how GPU math throughput and memory bandwidth dictate performance, and how cutting-edge systems squeeze efficiency from both sides of the pipeline.

Come hungry and curious.
Hosted at AI21 Labs HQ, this night is for anyone building, optimizing, or just obsessed with how inference really works.

Sponsors

AI21 Labs - AI21 is pioneering the development of enterprise AI Systems and Foundation Models. Their mission is to build trustworthy artificial intelligence that powers humanity towards superproductivity. They offer privately deployed models with unmatched performance and reliability with tailored solutions for every organization.
(Logo) A typographic logo displaying the brand name "AI21 labs" in a bold, sans-serif font, utilizing dark grey for the first part and bright pink for the second part. Text: AI21 labs Colors: #212121, #e5396b, #f0f0f0 Note: This image is a highly stylized wordmark, using specific colors and typography to represent a brand identity, which defines it as a logo.

AutonomyAI - AutonomyAI helps engineering teams ship faster without compromising quality. It integrates directly into your codebase to generate production-grade interfaces that follow your existing design systems, frameworks, and conventions. No black box, no throwaway prototypes, just clean, mergeable code. Used by enterprise teams and backed by $4M in funding, AutonomyAI eliminates front-end bottlenecks so developers can focus on logic, not layout.
(Banner) A graphic banner featuring the large, bold text "AUTONOMY AI" surrounded by several small, whimsical, cartoon robots standing on or near the letters. Text: AUTONOMY AI Bold typography combined with detailed, whimsical, cartoon illustrations. | Colors: #000000, #EAE4D0, #C84632, #73A8A4 Note: The image is designed in a wide, horizontal format featuring large, attention-grabbing text integrated with illustrative elements, typical of a web header or promotional banner.

Detailed Agenda

LLM Inference 101: Efficient Serving with vLLM

Asaf Gardin, Senior Software Engineer (Inference Team), AI21 Labs

Serving large language models (LLMs) efficiently is a difficult problem: requests arrive unpredictably, batching is hard to optimize, and storing key–value (KV) states quickly consumes GPU capacity. vLLM is an open-source library designed to solve these challenges and deliver high-throughput, cost-effective inference.

In this talk, we’ll cover the fundamentals of why LLM serving is hard, introduce vLLM’s core ideas like PagedAttention and continuous batching, and explain how they enable dynamic, efficient execution. We’ll also highlight features such as quantization, Automatic Prefix Caching, , and support for hybrid models like Jamba, which combine attention and Mamba layers. By the end, you’ll understand both the challenges of LLM inference and how vLLM makes large-scale deployment practical.

LLM Inference 201: Two Phases, Two Bottlenecks: How GPU Performance Shapes LLM Inference

Tomer Asida, Senior LLM Inference Software Engineer, NVIDIA

Every LLM request goes through two very different execution phases: prefill and decode. Each of them is limited by completely different hardware bottlenecks; Prefill is compute-bound and dominated by matrix multiplications, while decode is memory-bound and constrained by moving data from GPU DRAM to SRAM.

In this talk, we’ll cover:

  • What compute-bound vs. memory-bound actually means
  • The main factors contributing to GPU performance
    The differences between the prefill and decode phases, and why they are bound by different factors
    Techniques modern inference systems use to optimize both phases

Together, these concepts form a clear, GPU-level foundation for understanding how inference engines like vLLM extract maximum efficiency.

Ready for more?

Check out other posts from this blog.

View all posts

Contact Organizers

Questions? We're here to help.