Thesis Proposal - Mingjie Sun

Location:
In Person - Traffic21 Classroom, Gates Hillman 6501

Speaker:
MINGJIE SUN, Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://eric-mingjie.github.io/

Transformer is a neural network architecture centered on the self-attention mechanism. In recent years, it has become the de-facto architecture for deep learning, e.g., Large Language Models (LLMs) and Vision Transformers (ViTs). However, these models, with millions to billions of parameters, remain largely opaque and their mechanisms are difficult to interpret. As their real-world applications grow, gaining a deep understanding of their internal representations is essential for effectively utilizing and improving these models.

In this work, we closely examine the activation landscape in Transformers. We demonstrate that understanding the intriguing activation phenomena in Transformers can have practical and meaningful implications. First, we identify a fundamental limitation of the well-established magnitude pruning method, where it fails to consider the existence of features with large activations in large-scale Transformers. Leveraging this key insight, we develop a simple and effective pruning approach. Second, we discover and study the presence of very few activations with extremely large magnitudes, which we call massive activations. We investigate the role of massive activations in Transformers and show how they are fundamentally connected to the self-attention mechanism. Last, we discuss our proposed extensions of this work, primarily focusing on developing a unified framework for LLM compression, through a principled investigation of existing works.

Thesis Committee:

J. Zico Kolter (Chair)
Graham Neubig
Aditi Raghunathan
Kaiming He (Massachusetts Institute of Technology)

Additional Information


Add event to Google
Add event to iCal