Over the years, applications for neural networks have steadily increased. However, neural networks pose two main challenges: they are computationally and memory-intensive. These factors make them difficult to use in production environments, which typically have strict latency requirements for fast inference. Pruning is one technique that addresses these issues by compressing trained networks, enabling faster inference and more efficient storage.
In this topic, we will explore various pruning methods and learn how to select the most appropriate settings.
The setup
Many neural networks are believed to be over-parameterized, meaning that a large portion of the learned connections are redundant. This concept was first discussed in the "Optimal Brain Damage" paper by LeCun et al. in 1989. Pruning operates on the idea that not all weights in a trained neural network contribute equally to its performance.
In fact, only a fraction of the weights carries the majority of important information. Thus, it is safe to remove parts of the network without significantly impacting its accuracy. In extreme cases, pruning can reduce a model's size to less than 10% of its original, though this may involve some trade-offs, which will be discussed later. Below is an illustration showing the impact of pruning on accuracy loss as more parameters are removed from the model:
The general setting of pruning
In the most basic case, pruning can be illustrated as follows:
In the simplest form, pruning involves zeroing out parameters so they are absent in the pruned version. Typically, weights and activations are the main targets for pruning, while biases are rarely affected due to their smaller count in models.
Pruning can occur in two modes: one-shot or incremental. In one-shot pruning, the model is trained and pruned in a single step, possibly followed by fine-tuning. However, there are two important considerations: while the weights become sparser, the compression-accuracy tradeoff can be optimized further through incremental pruning. The downside of one-shot pruning is that it can be too aggressive, leading to significant performance degradation. In contrast, incremental pruning involves repeatedly training, pruning, and fine-tuning the model in cycles. This approach gradually removes less important weights and makes the remaining ones more informative. Incremental pruning can be illustrated as follows:
Another way to classify pruning is by its granularity: specifically, how the parameters should be removed based on the structure. Unstructured (or fine-grained) pruning, involves selectively removing individual weights, while structured (or coarse-grained) pruning removes entire groups of parameters, such as channels in convolutional neural networks (CNNs). Structured pruning is generally easier to optimize due to its regular structure, compared to fine-grained pruning.
The choice of pruning method often depends on the architecture being used. For example, there has been substantial research on pruning CNNs and large language models (LLMs), as their specific design elements (e.g., regularization, BatchNorm, optimizers, etc.) interact with the pruning process.
Pruning criteria
Now, we will consider how to determine whether a parameter is "important" and should be kept, as well as how to establish the sparsity (pruning ratio) of the final model.
One common method is magnitude-based pruning, where the absolute value of the weight determines its importance. The assumption is that the absolute value (also known as the L1 norm) of a weight reflects its relative importance to the network's accuracy. To choose the threshold value, multiple pruning levels expressed as percentages (or sparsity) are set for every parameterized layer in the network. After pruning, the accuracy is logged for each layer and pruning level using the test set. This process, referred to as sensitivity analysis, reveals how pruning affects different layers.
Another criterion is the average percentage of zero activations (APoZ). APoZ is based on the fact that many outputs in the network are zero, regardless of the input, indicating that these parameters contribute little to the model's performance.
The pruning ratio (also known as the sparsity ratio and expressed as a percentage) determines the final sparsity of the model. If too many parameters are removed, performance can degrade sharply. For example, some studies show that pruning more than 95% of the weights can severely harm accuracy. Thus, there is a clear trade-off between model compression and accuracy. The pruning ratio can be uniform across all layers or vary depending on the layer's importance, as seen in channel-level pruning.
Automated solutions, such as AutoML for Model Compression, can help determine the appropriate pruning hyperparameters using reinforcement learning (RL).
The lottery ticket hypothesis
An interesting theory related to pruning is the The Lottery Ticket Hypothesis, which proposes that within a large network, there exists a smaller subnetwork (the "winning ticket") that, when initialized correctly, can match the accuracy of the original network with fewer training iterations.
The general process for finding a winning lottery ticket looks as follows:
The process involves training, pruning, and resetting the remaining weights to their initial values, repeating this over multiple rounds. The hypothesis supports the idea that over-parameterization helps networks explore a broader range of optimal weights.
Conclusion
In summary:
Pruning is a technique to compress neural networks by removing unnecessary weights while maintaining accuracy.
There are multiple pruning techniques, often chosen based on the architecture of the model.
Pruning can be classified as one-shot or incremental, depending on how the process is carried out.
The pruning ratio or specific criteria, such as weight magnitude, can guide the pruning process.
By understanding these concepts, we can effectively apply pruning to optimize neural networks for real-world applications.