Quantization

class curated_transformers.quantization.Quantizable

Mixin class for models that are quantizable.

A module using this mixin provides the necessary configuration and parameter information to quantize it on-the-fly during the module loading phase.

abstract classmethod modules_to_not_quantize()

Return a set of prefixes that specify which modules are to be ignored during quantization.

Return type:

Set[str]

Returns:

Set of module prefixes.

If empty, all submodules will be quantized.

bitsandbytes

These classes can be used to specify the configuration for quantizing model parameters using the bitsandbytes library.

class curated_transformers.quantization.bnb.Dtype4Bit(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Data type to use for 4-bit quantization.

FP4 = 'fp4'

FP4 - Float 4-bit.

NF4 = 'nf4'

NF4 - NormalFloat 4-bit.

class curated_transformers.quantization.bnb.BitsAndBytesConfig(inner)

Configuration for quantization using the bitsandbytes library.

static for_4bit(quantization_dtype=Dtype4Bit.FP4, compute_dtype=torch.bfloat16, double_quantization=True)

Construct a configuration for fp4/nf4 quantization.

Parameters:
  • quantization_dtype (Dtype4Bit) – Data type used for storing quantized weights.

  • compute_dtype (dtype) – Data type used for performing computations. Supported types: float16, bfloat16, float32.

  • double_quantization (bool) – If the quantization constants should themselves be quantized.

static for_8bit(outlier_threshold=6.0, finetunable=False)

Construct a configuration for int8 quantization.

Parameters:
  • outlier_threshold (float) – Threshold for outlier detection during weight decomposition.

  • finetunable (bool) – If the quantized model should support fine-tuning after quantization.