Hangrui Cao.
Seattle · Zoom ML Engineer LLM Systems

Hangrui Cao

Machine Learning Engineer  ·  LLM Inference & Systems

I am a Machine Learning Engineer at Zoom in Seattle, where I work on LLM inference systems — kernel-level optimization, speculative decoding, deployment auto-tuning for vLLM, and the infrastructure behind Zoom's AI services across ASR, translation, and reasoning.

Previously, I worked on ML inference at AWS Neuron (Amazon's Trainium accelerator team) and on distributed systems at Amazon. I received my M.S. in Computational Data Science from Carnegie Mellon University, and bachelor's degrees from the University of Michigan and Shanghai Jiao Tong University.

I contribute to MLC-AI (WebLLM, MLC-LLM), and previously to CMU Catalyst Lab and Intel Analytics-Zoo. My research interests span LLM inference systems, speculative decoding, ML compilation on custom accelerators, and federated learning.

Hangrui Cao portrait
— hangrui cao —

News & updates


Selected research

A complete list is available on Google Scholar. denotes equal contribution.

Contextual Client Selection for Efficient Federated Learning over Edge Devices

Q. Pan, H. Cao, Y. Zhu, J. Liu, B. Li

IEEE Transactions on Mobile Computing · 2024 Journal

Extended journal version of our FL-AAAI work. A contextual-bandit-based client selection method for federated learning over heterogeneous edge devices, with formal regret analysis and improvements over Oort in convergence speed and accuracy.

NeurIPS-W
E-Tamba
2024

E-Tamba: Efficient Transformer-Mamba Layer Transplantation

D. Peng, H. Cao

NeurIPS 2024 Workshop on Fine-Tuning in Modern ML: Principles & Scalability Workshop

A surgical layer-transplantation procedure converting attention layers in pretrained Transformers into Mamba-style state-space layers — trading compute for sub-quadratic inference while preserving most of the original model's quality with only 0.9 B tokens of fine-tuning and ~3× lower inference memory.

Birds of a Feather Help: Context-aware Client Selection for Federated Learning

H. Cao, Q. Pan, Y. Zhu, J. Liu

Int'l Workshop on Trustable, Verifiable & Auditable Federated Learning, AAAI 2022 Oral

A novel neural combinatorial contextual bandit (NCCB) intelligently selects clients in each federated round while preserving privacy, surpassing the state-of-the-art Oort in both convergence speed and final accuracy.

Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks

C. Qiu, H. Cao, Q. Ren, R. Li, Y. Qiu

CoRR (arXiv) · 2020

GAN-based image colorization compared against classification-based approaches, evaluated by PSNR and SSIM.


Selected projects

Jan 2024 — Now

WebLLM & MLC-LLM

Contributed to the TVM-based MLC-LLM / WebLLM projects (>10k stars). Compiler-level optimizations for LLaMA, Gemma, Phi, Qwen; paged KV cache for memory-efficient inference; WebGPU in-browser execution with IndexedDB client-side caching for privacy-preserving on-device LLMs.

Jan — Jun 2024

E-Tamba · Transformer→Mamba Transplantation

Resource-efficient framework constructing hybrid Transformer–SSM architectures via layer-importance-guided transplantation. ~3× lower inference memory with only 0.9 B fine-tuning tokens, preserving long-context performance.

Sep — Dec 2023

Reinforcement-Learning LLM Router

An RL-based LLM router using contextualized query vectors and PPO to select among different quantized models, balancing quality vs. latency vs. cost.

Jan 2021 — Jan 2022

Driver-Behavior Classification with Bayesian CNNs

UMTRI Multidisciplinary Development Program. Light-weight CNN (90.2% accuracy), Java labeling software, Bayesian CNN for weight uncertainty, and an end-to-end eye-gaze collector with PyGaze + React.

Dec 2021 — Jun 2022

Collaborative Mobile Super-Resolution

Lightweight collaborative super-resolution for on-device video inference (TensorFlow). 33.8 PSNR on 2K gaming datasets — 7.8% over BasicVSR; optimized Android inference pipelines for an 11.2% mobile deployment efficiency gain.

Aug — Dec 2021

Large-scale GitHub Archive Analysis

Constructed user-interaction graphs from GitHub Archive event streams with Spark; trained an unsupervised GraphSAGE model to embed contributor behavior and uncover correlations between linguistic / emoji patterns and group productivity.

Spring 2022

Real-time Carbon Emission for Low-Carbon Buildings

Reinforcement-learning controller for building emissions; real-time dashboard backed by a streaming database.

2021

Email Voice Assistant

React + Flask interface for voice-driven email; 7.8% WER speech-to-text with smart-reply via Dialogflow / Rasa.


Open-source contributions


Tools I work with

LLM Inference
vLLMSGLangTensorRT-LLMCuTe DSLCutileTritonFlashAttentionTVM / MLC-LLM
ML & Systems
PyTorchTensorFlowMLIRcuDNNMPIOpenMPNCCL
Languages
C++PythonCUDARustGoJavaTypeScriptScalaJavaScriptSQLBash
Cloud & Infra
AWSKubernetesDockerTerraformSparkStep FunctionsSmithyAzureGCP
Data
PostgreSQLMongoDBRedisKafkaCassandraBigQueryRedshiftInfluxDBTimescaleDB

Education & training

2022 — May 2024

Carnegie Mellon University

M.S. Computational Data Science · System Track · School of Computer Science GPA 4.0 / 4.0 TA · Advanced Cloud Computing

Distributed Systems Advanced Cloud Computing Parallel Computer Architecture & Programming Databases & Management Systems Compiler Storage Systems Advanced NLP Machine Learning Search Engine Computer Systems
2020 — May 2022

University of Michigan, Ann Arbor

B.S.E. Computer Science · Minor in Mathematics GPA 3.978 / 4.00 Dean's List · University Honors

Data Structures & Algorithms Computer Organization Computer Networks Operating Systems Computing Systems DBMS Computer Foundation Machine Learning Computer Vision NLP
2018 — Aug 2020

Shanghai Jiao Tong University

B.S.E. Electrical & Computer Engineering · UM-SJTU Dual Degree GPA 3.89 / 4.00 Fan Xuji Scholarship · Undergrad Excellence

2020 · Winter

Ritsumeikan University

Winter Exchange Program · Kyoto, Japan


Get in touch

Happy to chat about ML infrastructure, LLM inference, research collaborations, or open-source contributions. Email is the fastest way:


hangruic@alumni.cmu.edu