
Sehoon Kim
Member of Technical Staff at xAI.
Contact: kssteven418 at gmail dot com
Research Interests:
Efficient AI, AI Systems, Model Compression, Large Language Models
Education

Ph.D. in Computer Science
UC Berkeley, 2020 - 2024

M.S. in Computer Science
UC Berkeley, 2020 - 2023

B.S. in Electrical and Computer Engineering
Seoul National University, 2015 - 2020
About
I am a member of technical staff at xAI. My research interests include efficient AI, AI systems, and model compression, with a focus on large language models. Before joining xAI, I was a Ph.D. student in Berkeley AI Research (BAIR) at UC Berkeley, where I was fortunate to work with Prof. Kurt Keutzer. During my time at Berkeley, I was selected as a MLCommons ML and Systems Rising Star and was a finalist for the NVIDIA Graduate Fellowship in 2024. Even before joining to Berkeley, I graduated from Seoul National University, where I was ranked 1st in the entire class of 2020 (overall GPA: 4.29/4.30, major GPA: 4.30/4.30). In my undergrad years, I was honored to work with Prof. Jangwoo Kim and Byung-Gon Chun.
Honors and Awards
MLCommons ML and Systems Rising Star | 2024
41 Ph.D. students worldwide in the fields of ML System
NVIDIA Graduate Fellowship Program Finalist | 2024
15 Ph.D. students worldwide in the fields of computing innovation
KFAS Doctoral Study Abroad Scholarship | 2020
Around 40 students selected nationally
Kwanjeong Educational Foundation Scholarship | 2017
Selected Publications
Click to view the full list of publications.

Squeezed Attention: Accelerating Long Context Length LLM Inference
Coleman Hooper*, Sehoon Kim*, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Preprint, 2024

An LLM Compiler for Parallel Function Calling
Sehoon Kim*, Suhong Moon*, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
International Conference on Machine Learning (ICML), 2024

SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim*, Coleman Hooper*, Amir Gholami*, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer
International Conference on Machine Learning (ICML), 2024

Speculative Decoding with Big Little Decoder
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer
Conference on Neural Information Processing Systems (NeurIPS), 2023

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Sehoon Kim*, Amir Gholami*, Albert Shaw†, Nicholas Lee†, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer
Conference on Neural Information Processing Systems (NeurIPS), 2022

A Fast Post-Training Pruning Framework for Transformers
Woosuk Kwon*, Sehoon Kim*, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami
Conference on Neural Information Processing Systems (NeurIPS), 2022

Learned Token Pruning for Transformers
Sehoon Kim*, Sheng Shen*, David Thorsley*, Amir Gholami*, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer
Conference on Knowledge Discovery and Data Mining (KDD), 2022

Integer-only Zero-shot Quantization for Efficient Speech Recognition
Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Anirudda Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

I-BERT: Integer-only BERT Quantization
Sehoon Kim*, Amir Gholami*, Zhewei Yao*, Michael W. Mahoney, Kurt Keutzer
International Conference on Machine Learning (ICML, Oral), 2021
Full Publications
ETS: Efficient Tree Search for Inference-Time Scaling
Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami
Preprint, 2025

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari*, Haocheng Xi*, Aditya Tomar*, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W Mahoney, Kurt Keutzer, Amir Gholami
Preprint, 2025
Squeezed Attention: Accelerating Long Context Length LLM Inference
Coleman Hooper*, Sehoon Kim*, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Preprint, 2024
TinyAgent: Function Calling at the Edge
Lutfi Eren Erdogan*, Nicholas Lee*, Siddharth Jha*, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
Empirical Methods in Natural Language Processing (EMNLP) Demo Track, 2024


KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
Conference on Neural Information Processing Systems (NeurIPS), 2024

Efficient and Scalable Estimation of Tool Representations in Vector Space
Suhong Moon*, Siddharth Jha*, Lutfi Eren Erdogan, Sehoon Kim, Woosang Lim, Kurt Keutzer, Amir Gholami
Preprint, 2024

Characterizing Prompt Compression Methods for Long Context Inference
Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami
ICML Workshop on Efficient Systems for Foundation Models (Oral), 2024
Learned Best-Effort LLM Serving
Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer
ICML Workshop on Efficient Systems for Foundation Models, 2024
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Nicholas Lee*, Thanakul Wattanawong*, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W Mahoney, Kurt Keutzer, Amir Gholami
The Association for Computational Linguistics (ACL), 2024

An LLM Compiler for Parallel Function Calling
Sehoon Kim*, Suhong Moon*, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
International Conference on Machine Learning (ICML), 2024


SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim*, Coleman Hooper*, Amir Gholami*, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer
International Conference on Machine Learning (ICML), 2024

Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer
IEEE MICRO Journal Special Issue, 2024

SPEED: Speculative Pipelined Execution for Efficient Decoding
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao
NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2023
Full Stack Optimization of Transformer Inference: a Survey
Sehoon Kim*, Coleman Hooper*, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami
Preprint, 2023 (Short Version at ISCA ASSYST Workshop 2023)
Speculative Decoding with Big Little Decoder
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer
Conference on Neural Information Processing Systems (NeurIPS), 2023

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Sehoon Kim*, Amir Gholami*, Albert Shaw†, Nicholas Lee†, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer
Conference on Neural Information Processing Systems (NeurIPS), 2022

A Fast Post-Training Pruning Framework for Transformers
Woosuk Kwon*, Sehoon Kim*, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami
Conference on Neural Information Processing Systems (NeurIPS), 2022

Learned Token Pruning for Transformers
Sehoon Kim*, Sheng Shen*, David Thorsley*, Amir Gholami*, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer
Conference on Knowledge Discovery and Data Mining (KDD), 2022

Integer-only Zero-shot Quantization for Efficient Speech Recognition
Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Anirudda Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

Hessian-Aware Pruning and Optimal Neural Implant
Shixing Yu*, Zhewei Yao*, Amir Gholami*, Zhen Dong*, Sehoon Kim, Michael W. Mahoney, Kurt Keutzer
Winter Conference on Applications of Computer Vision (WACV), 2022

A Survey of Quantization Methods for Efficient Neural Network Inference
Amir Gholami*, Sehoon Kim*, Zhen Dong*, Zhewei Yao*, Michael W. Mahoney, Kurt Keutzer
Book Chapter: Low-Power Computer Vision: Improving the Efficiency of Artificial Intelligence, 2021
WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model
Gyeong-In Yu, Saeed Amizadeh, Sehoon Kim, Artidoro Pagnoni, Ce Zhang, Byung-Gon Chun, Markus Weimer, Matteo Interlandi
International Conference on Very Large Data Bases (VLDB), 2021
Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs
Taebum Kim, Eunji Jeong, Geon-Woo Kim, Yunmo Koo, Sehoon Kim, Gyeong-In Yu, Byung-Gon Chun
Conference on Neural Information Processing Systems (NeurIPS), 2021
I-BERT: Integer-only BERT Quantization
Sehoon Kim*, Amir Gholami*, Zhewei Yao*, Michael W. Mahoney, Kurt Keutzer
International Conference on Machine Learning (ICML, Oral), 2021

Memory-Efficient Hardware Performance Counters with Approximate-Counting Algorithms
Jingyi Xu, Sehoon Kim, Borivoje Nikolic, Yakun Sophia Shao
International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021