Research Interests
GPU Scheduling and Execution Optimization for LLM Inference
This research focuses on inference execution and scheduling techniques that maximize GPU utilization and throughput for large-scale LLM inference. By leveraging token length prediction and continuous batching, we classify execution characteristics across requests and optimize GPU scheduling and runtime behavior..
Cloud-based LLM Serving and Infrastructure-Aware KV Cache Management
This line of research studies cost-efficient LLM serving architectures that meet service-level objectives (SLOs) by optimizing the placement, migration, and eviction of KV caches across hierarchical cloud infrastructures composed of GPU, CPU, and SSD resources..
Vector Database and Storage Acceleration for RAG
This work targets VectorDB workloads for retrieval-augmented generation (RAG), accelerating the storage and retrieval of large-scale embedding data. We investigate data access and execution scheduling techniques that improve GPU utilization across the end-to-end retrieval–inference pipeline.
Cross-Layer LLM System Design across Memory, Storage, VectorDB, and GPU
We explore cross-layer system designs that integrate the LLM runtime, KV cache, VectorDB, file system, storage, and GPU scheduler to simultaneously improve throughput and tail latency for cloud-based RAG-enabled LLM services.
|