Todd Mowry

Todd Mowry, professor, Computer Science Department, CMU



Office 9113 Gates and Hillman Centers


Phone (412) 268-3725

Computer Science Department

Administrative Support Person
Marcella Baker

Research Interests
Computer Architecture

Sam Arch
Patrick Coppock
Hongyi Jin
Ruihang Lai

CSD Courses Taught

15418 - Spring, 2024

15618 - Spring, 2024

Research/Teaching Statement

The goal of my research is to dramatically boost the performance of future microprocessor-based systems. To accomplish this, we exploit various forms of parallelism through a combination of novel architectural, compiler and operating systems support. In particular, we have been focusing on the opportunities and challenges created by two important VLSI technology trends which are expected to reshape computer systems over the next decade: the potential for single-chip multiprocessing due to higher levels of single-chip integration, and the need to tolerate off-chip latency as the gap between processor speed and the speed of memory and I/O continues to widen.

Single-Chip Multiprocessing: The STAMPede Project. As advances in integrated circuit technology continue to provide more and more transistors on a chip, processor architects are faced with the pleasant challenge of finding the best way to translate these additional resources into improved performance. One of the more compelling options is to integrate multiple processors onto the same chip. While this will certainly increase computational throughput, it will only reduce execution time of a given application if it can be run in parallel. Hence the key question is how do we convert the applications that we care about into parallel programs? Expecting programmers to only write parallel programs from now on is unrealistic. Instead, the preferred solution would be for the compiler to parallelize programs automatically. Unfortunately, compilers have only been successful so far at parallelizing the numeric applications commonly run on supercomputers. For single-chip multiprocessing to have an impact on the majority of users, we must also find a way to automatically parallelize the non-numeric applications (e.g., spreadsheets, web software, graphics codes, etc.) which account for the bulk of the software run on commercial microprocessors. Based on our preliminary studies, we believe that a breakthrough in our ability to automatically parallelize non-numeric applications may be possible through "thread-level data speculation", which is a technique that allows the compiler to safely parallelize applications in cases where it believes that dependences are unlikely, but cannot statically prove that they do not exist. To accomplish this, we add modest hardware support to track data dependence violations at run-time and alert the software so that it can recover appropriately. Developing the architectural, compiler, and operating system support necessary to turn this potential into a reality is the goal of the STAMPede (Single-chip Tightly-coupled Architecture for MultiProcessing) project.

Coping with Large Latencies. Processor speeds are continuing to increase far more rapidly than off-chip components such as DRAM, disk, and networks, largely due to physical limitations such as distance and the speed of light. The challenge presented by this trend is that from the processor's perspective, the latency of main memory and I/O is increasing at a dramatic rate, and thus threatens to become an increasingly important performance bottleneck. The good news, however, is that the bandwidth of these off-chip devices has been improving through innovations such as synchronous (i.e. pipelined) DRAM, disk arrays, and fiber optic networks. Therefore we are exploring new ways that the compiler (with varying degrees of help from the hardware and the operating system) can use prefetching and other techniques to intelligently trade off consuming more bandwidth to reduce overall latency. Recent work in this area has included prefetching pointer-based codes, prefetching to hide disk latency in out-of-core numeric applications, and hiding network communication latency in workstation clusters.



ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines

2023 • Proceedings of Machine Learning Research • 202:5162-5177
Chen S, Fegade P, Chen T, Gibbons PB, Mowry TC