Seattle · Zoom ML Engineer LLM Systems

Hangrui Cao

Machine Learning Engineer · LLM Inference & Systems

I am a Machine Learning Engineer at Zoom in Seattle, where I work on LLM inference systems — kernel-level optimization, speculative decoding, deployment auto-tuning for vLLM, and the infrastructure behind Zoom's AI services across ASR, translation, and reasoning.

Previously, I worked on ML inference at AWS Neuron (Amazon's Trainium accelerator team) and on distributed systems at Amazon. I received my M.S. in Computational Data Science from Carnegie Mellon University, and bachelor's degrees from the University of Michigan and Shanghai Jiao Tong University.

I contribute to MLC-AI (WebLLM, MLC-LLM), and previously to CMU Catalyst Lab and Intel Analytics-Zoo. My research interests span LLM inference systems, speculative decoding, ML compilation on custom accelerators, and federated learning.

↗ Google Scholar Email GitHub LinkedIn Twitter

— hangrui cao —

Recent

News & updates

2026 "Humanity's Last Exam" published in Nature (vol. 649, pp. 1139–1146) — an expert-level academic benchmark for evaluating frontier AI capabilities.
2024 · 12 E-Tamba — efficient Transformer→Mamba layer transplantation — presented at the NeurIPS 2024 Workshop on Fine-Tuning in Modern ML.
2024 · 12 Joined the WebLLM paper (MLC-AI) — a high-performance in-browser LLM inference engine, released on arXiv.
2024 · 06 Federated client-selection work extended to IEEE Transactions on Mobile Computing.
2024 · 05 Graduated from Carnegie Mellon University with M.S. in Computational Data Science (System Track, 4.0 / 4.0). Served as TA for Advanced Cloud Computing.

Publications

Selected research

A complete list is available on Google Scholar. ✱ denotes equal contribution.

Nature

649

2026

A Benchmark of Expert-level Academic Questions to Assess AI Capabilities

L. Phan, A. Gatti, Z. Han, …, H. Cao, … et al.

Nature 649, pp. 1139–1146 · 2026 Nature Benchmark

Also known as Humanity's Last Exam. A 3,000-question benchmark spanning expert-level academic domains designed to be adversarial to frontier LLMs at the limit of human knowledge.

arXiv Project

MLC · arXiv

WebLLM

2024

WebLLM: A High-Performance In-Browser LLM Inference Engine

C. F. Ruan, Y. Qin, X. Zhou, R. Lai, H. Jin, Y. Dong, B. Hou, M.-S. Yu, Y. Zhai, S. Agarwal, H. Cao, S. Feng, T. Chen

CoRR (arXiv:2412.15803) · MLC-AI · 2024 MLSys

A high-performance LLM inference engine that runs entirely in the browser, leveraging WebGPU for hardware acceleration. Brings privacy-preserving, serverless LLM inference to web applications without any backend infrastructure.

arXiv Code Demo

Contextual Client Selection for Efficient Federated Learning over Edge Devices

Q. Pan✱, H. Cao✱, Y. Zhu, J. Liu, B. Li

IEEE Transactions on Mobile Computing · 2024 Journal

Extended journal version of our FL-AAAI work. A contextual-bandit-based client selection method for federated learning over heterogeneous edge devices, with formal regret analysis and improvements over Oort in convergence speed and accuracy.

IEEE Preprint

NeurIPS-W

E-Tamba

2024

E-Tamba: Efficient Transformer-Mamba Layer Transplantation

D. Peng, H. Cao

NeurIPS 2024 Workshop on Fine-Tuning in Modern ML: Principles & Scalability Workshop

A surgical layer-transplantation procedure converting attention layers in pretrained Transformers into Mamba-style state-space layers — trading compute for sub-quadratic inference while preserving most of the original model's quality with only 0.9 B tokens of fine-tuning and ~3× lower inference memory.

Scholar

Birds of a Feather Help: Context-aware Client Selection for Federated Learning

H. Cao, Q. Pan, Y. Zhu, J. Liu

Int'l Workshop on Trustable, Verifiable & Auditable Federated Learning, AAAI 2022 Oral

A novel neural combinatorial contextual bandit (NCCB) intelligently selects clients in each federated round while preserving privacy, surpassing the state-of-the-art Oort in both convergence speed and final accuracy.

Paper Venue

Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks

C. Qiu✱, H. Cao✱, Q. Ren✱, R. Li✱, Y. Qiu✱

CoRR (arXiv) · 2020

GAN-based image colorization compared against classification-based approaches, evaluated by PSNR and SSIM.

arXiv Code

Research

Selected projects

Jan 2024 — Now

WebLLM & MLC-LLM

Contributed to the TVM-based MLC-LLM / WebLLM projects (>10k stars). Compiler-level optimizations for LLaMA, Gemma, Phi, Qwen; paged KV cache for memory-efficient inference; WebGPU in-browser execution with IndexedDB client-side caching for privacy-preserving on-device LLMs.

WebLLM MLC-LLM

Jan — Jun 2024

E-Tamba · Transformer→Mamba Transplantation

Resource-efficient framework constructing hybrid Transformer–SSM architectures via layer-importance-guided transplantation. ~3× lower inference memory with only 0.9 B fine-tuning tokens, preserving long-context performance.

Scholar

Sep — Dec 2023

Reinforcement-Learning LLM Router

An RL-based LLM router using contextualized query vectors and PPO to select among different quantized models, balancing quality vs. latency vs. cost.

Jan 2021 — Jan 2022

Driver-Behavior Classification with Bayesian CNNs

UMTRI Multidisciplinary Development Program. Light-weight CNN (90.2% accuracy), Java labeling software, Bayesian CNN for weight uncertainty, and an end-to-end eye-gaze collector with PyGaze + React.

Slides

Dec 2021 — Jun 2022

Collaborative Mobile Super-Resolution

Lightweight collaborative super-resolution for on-device video inference (TensorFlow). 33.8 PSNR on 2K gaming datasets — 7.8% over BasicVSR; optimized Android inference pipelines for an 11.2% mobile deployment efficiency gain.

Aug — Dec 2021

Large-scale GitHub Archive Analysis

Constructed user-interaction graphs from GitHub Archive event streams with Spark; trained an unsupervised GraphSAGE model to embed contributor behavior and uncover correlations between linguistic / emoji patterns and group productivity.

Spring 2022

Real-time Carbon Emission for Low-Carbon Buildings

Reinforcement-learning controller for building emissions; real-time dashboard backed by a streaming database.

Demo

2021

Email Voice Assistant

React + Flask interface for voice-driven email; 7.8% WER speech-to-text with smart-reply via Dialogflow / Rasa.

Code

Software

Open-source contributions

@mlc-ai

MLC-AI · WebLLM

In-browser LLM inference via WebGPU. Compiler-driven model deployment to web, mobile, and edge (10k+ stars).

@acai-systems

CMU Catalyst Group

Systems-for-AI research group at Carnegie Mellon working on efficient model serving and training.

@analytics-zoo

Intel Analytics-Zoo

Unified analytics + AI platform for Apache Spark and Ray. Contributed during 2021 Intel internship.

@DiegoCao

Personal Repositories

Public repos across ML systems, web infrastructure, and research codebases.

Stack

Tools I work with

LLM Inference

vLLMSGLangTensorRT-LLMCuTe DSLCutileTritonFlashAttentionTVM / MLC-LLM

ML & Systems

PyTorchTensorFlowMLIRcuDNNMPIOpenMPNCCL

Languages

C++PythonCUDARustGoJavaTypeScriptScalaJavaScriptSQLBash

Cloud & Infra

AWSKubernetesDockerTerraformSparkStep FunctionsSmithyAzureGCP

Data

PostgreSQLMongoDBRedisKafkaCassandraBigQueryRedshiftInfluxDBTimescaleDB

Academic

Education & training

2022 — May 2024

Carnegie Mellon University

M.S. Computational Data Science · System Track · School of Computer Science GPA 4.0 / 4.0 TA · Advanced Cloud Computing

Distributed Systems Advanced Cloud Computing Parallel Computer Architecture & Programming Databases & Management Systems Compiler Storage Systems Advanced NLP Machine Learning Search Engine Computer Systems

2020 — May 2022

University of Michigan, Ann Arbor

B.S.E. Computer Science · Minor in Mathematics GPA 3.978 / 4.00 Dean's List · University Honors

Data Structures & Algorithms Computer Organization Computer Networks Operating Systems Computing Systems DBMS Computer Foundation Machine Learning Computer Vision NLP

2018 — Aug 2020

Shanghai Jiao Tong University

B.S.E. Electrical & Computer Engineering · UM-SJTU Dual Degree GPA 3.89 / 4.00 Fan Xuji Scholarship · Undergrad Excellence

2020 · Winter

Ritsumeikan University

Winter Exchange Program · Kyoto, Japan

Reach out

Get in touch

Happy to chat about ML infrastructure, LLM inference, research collaborations, or open-source contributions. Email is the fastest way:

hangrui@umich.edu
hangruic@alumni.cmu.edu

✉