A standard transformer architecture contains main blocks, multi-head
attention (MHA) and feed-forward network (FFN).
In MHA, Query (Q), Key (K), and Value (V) matrices are derived from
the input through learned linear
transformations. The attention mechanism computes a weighted sum of the
values based on the similarity between queries and keys, allowing the
model to focus on relevant parts of the input sequence. The process can
be summarized as follows: 1. Linear Transformations:
The input is linearly transformed
into query, key, and value matrices: - - - where
is the model dimension, is the
number of attention heads, and
and are the dimensions of the
key and value vectors, respectively. 2. Multi-Head
Attention: The attention mechanism is applied to the queries,
keys, and values: - - Reshape to and
and then transpose them to and
. - The multi-head attention jointly extracts information
from diverse subspaces by projecting the input into multiple subspaces,
computing attention in each subspace - And then concatenating the results , where
is a learned linear transformation.
The FFN module transforms features from each token independently,
typically consisting of two linear layers with a non-linear activation
function in between: where , and are bias vectors.
1.
ASVD: Activation-aware Singular Value Decomposition for Compressing
Large Language Models
Considering activations when compressing LLM weights, the objective
optimization is
1.1
Activation-aware singular value decomposition.
The diagonal matrix is
constructed to represent the impact of input channels on the weights,
essentially adjusting to better
adapt with the activation patterns of the input .
How to get the ? Each entry of
scales the influence of the
i-th input channel. There are two way to measure :
1.2
Sensitivity-based Truncation Rank Searching
Different layers in LLMs exhibit varying degrees of sensitivity to
information compression, which is reflected in the distribution of their
singular values. ‘Wikitext’ is selected as the calibration set, and the
reduction in perplexity is the metric for assessing how effectively a LM
predicts a sequence of tokens.
Define the truncation ratio for a weight matrix
with and rank .
Observations
Higher sensitivity in MLP layers
Some layers exhibit relatively lower sensitivity.
1.3 Absorbing singular values
2.
SVD-LLM: Truncation-aware Singular Value Decomposition for Large
Language Model Compression
Leverage a truncation-aware data whitening technique that ensures a
direct mapping between singular values and compression loss.
2.1 Customized SVD Loss
Following the optimization goal to decompose in ASVD, SVD-LLM enforce be orthonormal such that .
Definition The frobenius norm of matrix A with
dimension is defined
as
The compression loss when
truncating the i-th singular value of is defined as
and are orthonormal matrices, and is orthonormal, so the trace of
the product of these matrices is equal to the i-th singular value .
If we truncate in for compression, the compression
loss is Mention that S is the cholesky factorization of the covariance
matrix such that is orthonormal.
2.2 Parameter update
The final compressed weight matrix is
Instead of directly applying LoRA fine-tuning to the compressed
weight matrix , SVD-LLM apply
LoRA on top of and seperately to preserve low-rank
structures And a sequential low-rank fine-tuning is applied to these two
matrices to further enhance the compression performance.
3.
SVD-LLM V2: Optimizing Singular Value Truncation for Large Language
Model Compression
Drawbacks of SVD-LLM: 1. apply a homogeneous compression ratio to all
weight matrices. 2. Cholesky decomposition requires the covariance
matrix to be positive
definite, which is not always the case in practice.
3.1 Heterogeneous
Compression Ratio Allocation
Motivation: the weights which will cause larger compression loss
should be allocated with smaller compression ratio. Process: assign the compression ratio to each weight with
the same type based on the ratio of the normalized compression loss
()
to the total loss (Seen in Line 11 in Algorithm).
3.2 Loss-optimized Weight
Truncation
Motivation: The numerical stability of the cholesky
decomposition.
Process: SVD-LLM V2 bypasses the cholesky decomposition and directly
truncates the singular values based on the loss-optimized weight
truncation. The process is as follows: 1. are obtained by SVD(). 2. are obtained by SVD(). 3. . Since is symmetric, . SO we get and . 4. Suppose , Hence is
orthonormal. > For any matrix
and an orthonormal matrix , 5. are obtained by
SVD decomposition of ,
4.
BASIS SHARING: CROSS-LAYER PARAMETER SHARING FOR LARGE LANGUAGE MODEL
COMPRESSION