Quantization

class curated_transformers.quantization.Quantizable

Mixin class for models that are quantizable.

A module using this mixin provides the necessary configuration and parameter information to quantize it on-the-fly during the module loading phase.

abstract classmethod modules_to_not_quantize()

Return a set of prefixes that specify which modules are to be ignored during quantization.

Return type:

Set[str]

Returns:

Set of module prefixes.

If empty, all submodules will be quantized.

bitsandbytes

These classes can be used to specify the configuration for quantizing model parameters using the bitsandbytes library.

class curated_transformers.quantization.bnb.Dtype4Bit(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Data type to use for 4-bit quantization.

FP4 = 'fp4': FP4 - Float 4-bit.

NF4 = 'nf4': NF4 - NormalFloat 4-bit.

class curated_transformers.quantization.bnb.BitsAndBytesConfig(inner)

Configuration for quantization using the bitsandbytes library.

static for_4bit(quantization_dtype=Dtype4Bit.FP4, compute_dtype=torch.bfloat16, double_quantization=True)

Construct a configuration for fp4/nf4 quantization.

Parameters:

quantization_dtype (Dtype4Bit) – Data type used for storing quantized weights.
compute_dtype (dtype) – Data type used for performing computations. Supported types: float16, bfloat16, float32.
double_quantization (bool) – If the quantization constants should themselves be quantized.

static for_8bit(outlier_threshold=6.0, finetunable=False)

Construct a configuration for int8 quantization.

Parameters:

outlier_threshold (float) – Threshold for outlier detection during weight decomposition.
finetunable (bool) – If the quantized model should support fine-tuning after quantization.