What is DeepSpeed for Generative AI?
DeepSpeed is an open source (apache2 license) library that optimizes training and inference for foundation models. It is a lightweight wrapper for PyTorch and optimizes for both speed and scale.
Training optimization using DeepSpeed
DeepSpeed optimizes training by managing distributed training, mixed precision, gradient accumulation, and checkpoints. Some of its features are:
- It can train up to 13 Billion parameters in a single GPU.
- It implements a feature called Zero Redundancy Optimizer (ZeRO) which essentially reduces redundancies in memory in distributed training.
- It supports combinations of data, model and pipeline parallelism, which it calls 3D parallelism.
- It increases communication efficiency by using 1-bit Adam (using 1 bit compression with Adam), 0/1 Adam and 1-bit LAMB reduce
- It uses a library called Data Efficiency which increases training efficiency and model quality by making better use of data. It does that by using two techniques :
- supports long sequence length using sparse attention kernels.
- Improves training efficiency by using large batch optimizers for deep training (lamb)
- Enables distributed training with mixed precision
Inference Optimization using DeepSpeed
There are two main challenges to inference – latency and cost. DeepSpeed has the following features to optimize inference:
- splitting Inference to multiple GPUs and using the best parallelism strategy for multiple GPU inference.
- Increase the efficiency per GPU using
- deep fusion – combine multiple operations into a single kernel.
- novel kernel scheduling – small batch size increases kernel invocation time and the General Matrix Multiplication library is not tuned for small sizes. DeepSpeed addresses these challenges.
- DeepSpeed Quantization toolkit reduces inference cost and contains
- Different quantization for parameters and activation.
- Specialised INT8 inference kernels.
DeepSpeed contains a component known as compression composer. This offers multiple compression methods such as quantization, head/row/channel pruning, knowledge distillation and layer reduction. It provide an API to combine the compression methods in various combinations.