Guide to quant FP8 - Infermatic.ai

Simple Guide to Convert an FP16 Model to FP8

Overview

This simple guide to quant models walks you through converting a model from FP16 to FP8, an 8-bit data format that significantly improves model inference efficiency without sacrificing output quality. FP8 is ideal for quantizing large language models (LLMs), ensuring faster and more cost-effective deployments.

Requirements for the quants

VM with GPUs: Ensure your VM has sufficient GPUs to handle the FP16 model download and conversion process.
Supported GPU Architectures: The conversion process requires GPUs with NVIDIA Ada Lovelace or Hopper architectures, such as the L40 or H100 GPUs.

Step 1: Setup the Environment

Access your VM or GPU environment and open a terminal.
Install Python and Pip:
```
sudo apt install python3-pip
```

Install the required Python packages:

pip install transformers
pip install -U "huggingface_hub[Cli]"

Clone the AutoFP8 repository:

git clone https://github.com/neuralmagic/AutoFP8.git

Navigate to the AutoFP8 directory:
```
cd AutoFP8
```
Install AutoFP8:
```
pip install -e .
```

Step 2: Download the FP16 Model

In a new terminal, use the Hugging Face CLI to download the FP16 model:

huggingface-cli download [modelName]

Step 3: Quantize the Model to FP8

Open the quantize_model.py script in a text editor:
```
nano quantize_model.py
```

Modify the script to reference the downloaded model name:

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8-Dynamic"

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
# For dynamic activation scales, there is no need for calibration examples
examples = []

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Run the quantization script:
```
python3 quantize_model.py
```

Step 4: Upload the Quantized FP8 Model

Log in to Hugging Face:
```
huggingface-cli login
```
Paste your Hugging Face token when prompted.
Navigate to the model’s weight directory:
```
cd [path_to_model_weights]
```
Upload the FP8 model:
```
huggingface-cli upload [modelName]
```

Conclusion

You have successfully converted your FP16 model to FP8 and uploaded it to Hugging Face!!! This conversion will allow for faster and more efficient inference, especially for large language models.

Check our FP8 models.

Understanding FP8 Quantization

TL;DR: FP8 is an 8-bit data format that offers an alternative to INT8 for quantizing LLMs. Thanks to its higher dynamic range, FP8 is suitable for quantizing more of an LLM’s components, most notably its activations, making inference faster and more efficient. FP8 quantization is also safer for smaller models, like 7B parameter LLMs, than INT8 quantization, offering better performance improvements with less degradation of output quality.

An Introduction to Floating Point Numbers

Floating point number formats were a revelation in the math that underpins computer science, and their history stretches back over 100 years. Today, floating point number formats are codified in the IEEE 754-2019 spec, which sets international standards for how floating point numbers are expressed.

A floating point number has 3 parts:

Sign: A single bit indicating if the number is positive or negative.
Range (Exponent): The power of the number.
Precision (Mantissa): The significant digits of the number.

In contrast, an integer representation is mostly significant digits (precision). It may or may not have a sign bit depending on the format, but no exponent.

FP8 vs INT8 Data Formats

FP8 and INT8 are both 8-bit values, but the way they use those bits determines their utility as data formats for model inference. Here’s a comparison of the dynamic range of each format:

INT8 dynamic range: 2^8
E4M3 FP8 dynamic range: 2^18
E5M2 FP8 dynamic range: 2^32

This higher dynamic range means that after FP16 values are mapped to FP8, it’s easier to tell them apart and retain more of the encoded information from the model parameters, making FP8 quantization more reliable for smaller models.

Applying FP8 in Production

In practice, FP8 enables quantizing not just an LLM’s weights but also the activations and KV cache, avoiding expensive calculations in FP16 during model inference. FP8 is supported on latest-generation GPUs such as the NVIDIA H100 GPU, where alongside other optimizations, it can deliver remarkable performance with minimal quality degradation.

Alternative with vLLM: Quick Start with Online Dynamic Quantization

Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying --quantization="fp8" in the command line or setting quantization="fp8" in the LLM constructor.

In this mode, all Linear modules (except for the final lm_head) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.

vLLM Quantization Documentation