AI Seminar

Small Hybrid Language Model

Shizhe DiaoResearch ScientistNVIDIA Research

WHERE:

George G. Brown Laboratories (GGBL) Building, Room 2505

WHEN:

Monday, November 25, 2024 @ 12:00 pm - 1:30 pm
This event is free and open to the publicAdd to Google Calendar

AI Seminar and Guest Lecture at CSE595 NLP Fall 2024

Location: George G. Brown Laboratories (GGBL) Building, Room 2505

Zoom: https://umich.zoom.us/j/9932414410

Abstract:

The quadratic computational cost and high memory demands of Transformers pose efficiency challenges while state space models like Mamba offer constant complexity and efficient hardawre optimization but struggle with memory recall tasks. In this talk, I will talk about a new architecture, Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the “forced-to-attend” burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67× cache size reduction, and 3.49× throughput.

Bio:

Shizhe Diao is a research scientist at NVIDIA Research and is passionate about the research in efficient training and alignment of foundation models. Shizhe completed PhD at the Hong Kong University of Science and Technology, advised by Professor Tong Zhang. He is the main developer of LMFlow, a widely used framework for post-training adaptation of LLMs. LMFlow has received over 8k stars and won the Best Demo Paper Award at NAACL 2024. Another of his works, R-Tuning, which focuses on the alignment of LLMs, received the Outstanding Paper Award at NAACL 2024.

Slides:

https://file.notion.so/f/f/1828dad3-68b8-4436-a6df-879b76bc0a25/364cb9b1-d99e-4839-97fb-455dc4b6623b/Guest_Lecture_Small_Hybrid_Language_Model.pdf?table=block&id=14d0a3ee-5f66-8076-84f9-e50dda457522&spaceId=1828dad3-68b8-4436-a6df-879b76bc0a25&expirationTimestamp=1733011200000&signature=Gg6N7IzHodkRTYQRo03xv461xjW2iL0jYjVpMTTCbvc&downloadName=%5BGuest+Lecture%5D+Small+Hybrid+Language+Model.pdf

Organizer

AI Lab

Student Host

Martin Ziqiao MaAI Lab Seminar Tsar

Faculty Host

Joyce ChaiProfessor, Computer Science and EngineeringUniversity of Michigan

Events

AI Seminar

Small Hybrid Language Model

Organizer

Student Host

Faculty Host