Sehoon Kim

Member of Technical Staff at xAI.
Contact: kssteven418 at gmail dot com
Research Interests: Efficient AI, AI Systems, Model Compression, Large Language Models

Education

Ph.D. in Computer Science

UC Berkeley, 2020 - 2024

M.S. in Computer Science

UC Berkeley, 2020 - 2023

B.S. in Electrical and Computer Engineering

Seoul National University, 2015 - 2020

About


I am a member of technical staff at xAI. My research interests include efficient AI, AI systems, and model compression, with a focus on large language models. Before joining xAI, I was a Ph.D. student in Berkeley AI Research (BAIR) at UC Berkeley, where I was fortunate to work with Prof. Kurt Keutzer. During my time at Berkeley, I was selected as a MLCommons ML and Systems Rising Star and was a finalist for the NVIDIA Graduate Fellowship in 2024. Even before joining to Berkeley, I graduated from Seoul National University, where I was ranked 1st in the entire class of 2020 (overall GPA: 4.29/4.30, major GPA: 4.30/4.30). In my undergrad years, I was honored to work with Prof. Jangwoo Kim and Byung-Gon Chun.

Honors and Awards


MLCommons ML and Systems Rising Star | 2024
41 Ph.D. students worldwide in the fields of ML System

NVIDIA Graduate Fellowship Program Finalist | 2024
15 Ph.D. students worldwide in the fields of computing innovation

KFAS Doctoral Study Abroad Scholarship | 2020
Around 40 students selected nationally

Kwanjeong Educational Foundation Scholarship | 2017


Selected Publications


Click to view the full list of publications.

Squeezed Attention: Accelerating Long Context Length LLM Inference

Coleman Hooper*, Sehoon Kim*, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Preprint, 2024

An LLM Compiler for Parallel Function Calling

Sehoon Kim*, Suhong Moon*, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

International Conference on Machine Learning (ICML), 2024

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim*, Coleman Hooper*, Amir Gholami*, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

International Conference on Machine Learning (ICML), 2024

Speculative Decoding with Big Little Decoder

Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

Conference on Neural Information Processing Systems (NeurIPS), 2023

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Sehoon Kim*, Amir Gholami*, Albert Shaw†, Nicholas Lee†, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

Conference on Neural Information Processing Systems (NeurIPS), 2022

A Fast Post-Training Pruning Framework for Transformers

Woosuk Kwon*, Sehoon Kim*, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami

Conference on Neural Information Processing Systems (NeurIPS), 2022

Learned Token Pruning for Transformers

Sehoon Kim*, Sheng Shen*, David Thorsley*, Amir Gholami*, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer

Conference on Knowledge Discovery and Data Mining (KDD), 2022

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Anirudda Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

I-BERT: Integer-only BERT Quantization

Sehoon Kim*, Amir Gholami*, Zhewei Yao*, Michael W. Mahoney, Kurt Keutzer

International Conference on Machine Learning (ICML, Oral), 2021

Full Publications


ETS: Efficient Tree Search for Inference-Time Scaling

Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami

Preprint, 2025

Code iconCode

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Rishabh Tiwari*, Haocheng Xi*, Aditya Tomar*, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W Mahoney, Kurt Keutzer, Amir Gholami

Preprint, 2025

Squeezed Attention: Accelerating Long Context Length LLM Inference

Coleman Hooper*, Sehoon Kim*, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Preprint, 2024

TinyAgent: Function Calling at the Edge

Lutfi Eren Erdogan*, Nicholas Lee*, Siddharth Jha*, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

Empirical Methods in Natural Language Processing (EMNLP) Demo Track, 2024

Code iconCode Blog iconBlog

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Conference on Neural Information Processing Systems (NeurIPS), 2024

Code iconCode

Efficient and Scalable Estimation of Tool Representations in Vector Space

Suhong Moon*, Siddharth Jha*, Lutfi Eren Erdogan, Sehoon Kim, Woosang Lim, Kurt Keutzer, Amir Gholami

Preprint, 2024

Code iconCode

Characterizing Prompt Compression Methods for Long Context Inference

Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami

ICML Workshop on Efficient Systems for Foundation Models (Oral), 2024

Learned Best-Effort LLM Serving

Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

ICML Workshop on Efficient Systems for Foundation Models, 2024

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Nicholas Lee*, Thanakul Wattanawong*, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W Mahoney, Kurt Keutzer, Amir Gholami

The Association for Computational Linguistics (ACL), 2024

Code iconCode

An LLM Compiler for Parallel Function Calling

Sehoon Kim*, Suhong Moon*, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

International Conference on Machine Learning (ICML), 2024

Code iconCode Talk iconTalk LlamaIndex LangChain

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim*, Coleman Hooper*, Amir Gholami*, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

International Conference on Machine Learning (ICML), 2024

Code iconCode

AI and Memory Wall

Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer

IEEE MICRO Journal Special Issue, 2024

Blog iconBlog

SPEED: Speculative Pipelined Execution for Efficient Decoding

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao

NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2023

Full Stack Optimization of Transformer Inference: a Survey

Sehoon Kim*, Coleman Hooper*, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami

Preprint, 2023 (Short Version at ISCA ASSYST Workshop 2023)

Speculative Decoding with Big Little Decoder

Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

Conference on Neural Information Processing Systems (NeurIPS), 2023

Code iconCode

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Sehoon Kim*, Amir Gholami*, Albert Shaw†, Nicholas Lee†, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

Conference on Neural Information Processing Systems (NeurIPS), 2022

Code iconCode NVIDIA NeMo

A Fast Post-Training Pruning Framework for Transformers

Woosuk Kwon*, Sehoon Kim*, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami

Conference on Neural Information Processing Systems (NeurIPS), 2022

Code iconCode

Learned Token Pruning for Transformers

Sehoon Kim*, Sheng Shen*, David Thorsley*, Amir Gholami*, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer

Conference on Knowledge Discovery and Data Mining (KDD), 2022

Code iconCode

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Anirudda Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

Code iconCode

Hessian-Aware Pruning and Optimal Neural Implant

Shixing Yu*, Zhewei Yao*, Amir Gholami*, Zhen Dong*, Sehoon Kim, Michael W. Mahoney, Kurt Keutzer

Winter Conference on Applications of Computer Vision (WACV), 2022

Code iconCode

A Survey of Quantization Methods for Efficient Neural Network Inference

Amir Gholami*, Sehoon Kim*, Zhen Dong*, Zhewei Yao*, Michael W. Mahoney, Kurt Keutzer

Book Chapter: Low-Power Computer Vision: Improving the Efficiency of Artificial Intelligence, 2021

WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model

Gyeong-In Yu, Saeed Amizadeh, Sehoon Kim, Artidoro Pagnoni, Ce Zhang, Byung-Gon Chun, Markus Weimer, Matteo Interlandi

International Conference on Very Large Data Bases (VLDB), 2021

Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs

Taebum Kim, Eunji Jeong, Geon-Woo Kim, Yunmo Koo, Sehoon Kim, Gyeong-In Yu, Byung-Gon Chun

Conference on Neural Information Processing Systems (NeurIPS), 2021

I-BERT: Integer-only BERT Quantization

Sehoon Kim*, Amir Gholami*, Zhewei Yao*, Michael W. Mahoney, Kurt Keutzer

International Conference on Machine Learning (ICML, Oral), 2021

Code iconCode HuggingFace

Memory-Efficient Hardware Performance Counters with Approximate-Counting Algorithms

Jingyi Xu, Sehoon Kim, Borivoje Nikolic, Yakun Sophia Shao

International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021