0. Concept of transformer

A standard transformer architecture contains main blocks, multi-head attention (MHA) and feed-forward network (FFN).

In MHA, Query (Q), Key (K), and Value (V) matrices are derived from the input through learned linear transformations. The attention mechanism computes a weighted sum of the values based on the similarity between queries and keys, allowing the model to focus on relevant parts of the input sequence. The process can be summarized as follows: 1. Linear Transformations: The input is linearly transformed into query, key, and value matrices: - - - where is the model dimension, is the number of attention heads, and and are the dimensions of the key and value vectors, respectively. 2. Multi-Head Attention: The attention mechanism is applied to the queries, keys, and values: - - Reshape to and and then transpose them to and . - The multi-head attention jointly extracts information from diverse subspaces by projecting the input into multiple subspaces, computing attention in each subspace - And then concatenating the results , where is a learned linear transformation.

The FFN module transforms features from each token independently, typically consisting of two linear layers with a non-linear activation function in between: where , and are bias vectors.

1. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Considering activations when compressing LLM weights, the objective optimization is

1.1 Activation-aware singular value decomposition.

The diagonal matrix is constructed to represent the impact of input channels on the weights, essentially adjusting to better adapt with the activation patterns of the input .

How to get the ? Each entry of scales the influence of the i-th input channel. There are two way to measure :

1.2 Sensitivity-based Truncation Rank Searching

Different layers in LLMs exhibit varying degrees of sensitivity to information compression, which is reflected in the distribution of their singular values. ‘Wikitext’ is selected as the calibration set, and the reduction in perplexity is the metric for assessing how effectively a LM predicts a sequence of tokens.

Define the truncation ratio for a weight matrix with and rank .

Observations

Higher sensitivity in MLP layers
Some layers exhibit relatively lower sensitivity.

1.3 Absorbing singular values

2. SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

Leverage a truncation-aware data whitening technique that ensures a direct mapping between singular values and compression loss.

2.1 Customized SVD Loss

Following the optimization goal to decompose in ASVD, SVD-LLM enforce be orthonormal such that .

Definition The frobenius norm of matrix A with dimension is defined as

The compression loss when truncating the i-th singular value of is defined as

and are orthonormal matrices, and is orthonormal, so the trace of the product of these matrices is equal to the i-th singular value .

If we truncate in for compression, the compression loss is Mention that S is the cholesky factorization of the covariance matrix such that is orthonormal.

2.2 Parameter update

The final compressed weight matrix is

Instead of directly applying LoRA fine-tuning to the compressed weight matrix , SVD-LLM apply LoRA on top of and seperately to preserve low-rank structures And a sequential low-rank fine-tuning is applied to these two matrices to further enhance the compression performance.

3. SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression

Drawbacks of SVD-LLM: 1. apply a homogeneous compression ratio to all weight matrices. 2. Cholesky decomposition requires the covariance matrix to be positive definite, which is not always the case in practice.

3.1 Heterogeneous Compression Ratio Allocation

Motivation: the weights which will cause larger compression loss should be allocated with smaller compression ratio. Process: assign the compression ratio to each weight with the same type based on the ratio of the normalized compression loss () to the total loss (Seen in Line 11 in Algorithm).

3.2 Loss-optimized Weight Truncation

Motivation: The numerical stability of the cholesky decomposition.

Process: SVD-LLM V2 bypasses the cholesky decomposition and directly truncates the singular values based on the loss-optimized weight truncation. The process is as follows: 1. are obtained by SVD(). 2. are obtained by SVD(). 3. . Since is symmetric, . SO we get and . 4. Suppose , Hence is orthonormal. > For any matrix and an orthonormal matrix , 5. are obtained by SVD decomposition of ,

Rong Hu

SVD in LLM Compression

0. Concept of transformer

1. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

1.1 Activation-aware singular value decomposition.

1.2 Sensitivity-based Truncation Rank Searching

1.3 Absorbing singular values

2. SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

2.1 Customized SVD Loss

2.2 Parameter update

3. SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression

3.1 Heterogeneous Compression Ratio Allocation

3.2 Loss-optimized Weight Truncation

Rong Hu

SVD in LLM Compression

0. Concept of transformer

1. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

1.1 Activation-aware singular value decomposition.

1.2 Sensitivity-based Truncation Rank Searching

1.3 Absorbing singular values

2. SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

2.1 Customized SVD Loss

2.2 Parameter update

3. SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression

3.1 Heterogeneous Compression Ratio Allocation

3.2 Loss-optimized Weight Truncation

4. BASIS SHARING: CROSS-LAYER PARAMETER SHARING FOR LARGE LANGUAGE MODEL COMPRESSION