Decoders

Base Classes

class curated_transformers.models.DecoderModule(config)

Bases: Generic[ConfigT, CacheT], TransformerModule[ConfigT]

Base class for decoder modules.

property config: ConfigT

Returns the model’s configuration.

abstract forward(piece_ids, attention_mask, *, cache=None, positions=None, store_cache=False)

Apply the decoder to the given piece identifiers.

Parameters:
  • piece_ids (Tensor) –

    Piece identifiers to apply the decoder to.

    Shape: (batch_size, seq_len)

  • attention_mask (AttentionMask) – Attention mask. Sequence elements for which the corresponding mask element is set to False are ignored during attention calculation.

  • cache (Optional[List[TypeVar(CacheT, bound= CacheProtocol)]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.

  • positions (Optional[Tensor]) –

    Input positions. Positions are needed to look up position embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.

    Shape: (batch_size, seq_len)

  • store_cache (bool) – Whether to cache the key/value representations for future reuse.

Return type:

ModelOutputWithCache[TypeVar(CacheT, bound= CacheProtocol)]

Returns:

Decoder output with key/value cache.

class curated_transformers.models.TransformerDecoder(config)

Bases: Generic[ConfigT], DecoderModule[ConfigT, KeyValueCache]

Transformer decoder (Vaswani et al., 2017) base class.

This class provides an implementation of the forward method. Subclasses must set the given member attributes.

property config: ConfigT

Returns the model’s configuration.

forward(piece_ids, attention_mask, *, cache=None, positions=None, store_cache=False)

Apply the decoder to the given piece identifiers.

Parameters:
  • piece_ids (Tensor) –

    Piece identifiers to apply the decoder to.

    Shape: (batch_size, seq_len)

  • attention_mask (AttentionMask) – Attention mask. Sequence elements for which the corresponding mask element is set to False are ignored during attention calculation.

  • cache (Optional[List[KeyValueCache]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.

  • positions (Optional[Tensor]) –

    Input positions. Positions are needed to look up position embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.

    Shape: (batch_size, seq_len)

  • store_cache (bool) – Whether to cache the key/value representations for future reuse.

Return type:

ModelOutputWithCache[KeyValueCache]

Returns:

Decoder output with key/value cache.

Architectures

These modules represent the supported decoder-only architectures.

class curated_transformers.models.FalconDecoder(config, *, device=None)

Bases: TransformerDecoder[FalconConfig], FromHFHub

Falcon (Penedo et al., 2019) decoder.

Construct a Falcon decoder.

Parameters:
  • config (FalconConfig) – Decoder configuration.

  • device (Optional[device]) – Device to which the module is to be moved.

Returns:

The decoder.

property config: ConfigT

Returns the model’s configuration.

classmethod convert_hf_state_dict(params)

Convert a state dict of a Hugging Face model to a valid state dict for the module.

Parameters:

params (Mapping[str, Tensor]) – The state dict to convert.

Returns:

The converted state dict.

forward(piece_ids, attention_mask, *, cache=None, positions=None, store_cache=False)

Apply the decoder to the given piece identifiers.

Parameters:
  • piece_ids (Tensor) –

    Piece identifiers to apply the decoder to.

    Shape: (batch_size, seq_len)

  • attention_mask (AttentionMask) – Attention mask. Sequence elements for which the corresponding mask element is set to False are ignored during attention calculation.

  • cache (Optional[List[KeyValueCache]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.

  • positions (Optional[Tensor]) –

    Input positions. Positions are needed to look up position embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.

    Shape: (batch_size, seq_len)

  • store_cache (bool) – Whether to cache the key/value representations for future reuse.

Return type:

ModelOutputWithCache[KeyValueCache]

Returns:

Decoder output with key/value cache.

classmethod from_hf_config(*, hf_config, device=None)

Create the module from a Hugging Face model JSON-deserialized model configuration.

Parameters:
  • hf_config (Any) – Hugging Face model configuration.

  • device (Optional[device]) – Device on which to initialize the model.

Return type:

TypeVar(Self, bound= FalconDecoder)

Returns:

Module constructed using the configuration.

class curated_transformers.models.GPTNeoXDecoder(config, *, device=None)

Bases: TransformerDecoder[GPTNeoXConfig], FromHFHub

GPT-NeoX (Black et al., 2022) decoder.

Construct a GPT-NeoX decoder.

Parameters:
  • config (GPTNeoXConfig) – Decoder configuration.

  • device (Optional[device]) – Device to which the module is to be moved.

Returns:

The decoder.

property config: ConfigT

Returns the model’s configuration.

classmethod convert_hf_state_dict(params)

Convert a state dict of a Hugging Face model to a valid state dict for the module.

Parameters:

params (Mapping[str, Tensor]) – The state dict to convert.

Returns:

The converted state dict.

forward(piece_ids, attention_mask, *, cache=None, positions=None, store_cache=False)

Apply the decoder to the given piece identifiers.

Parameters:
  • piece_ids (Tensor) –

    Piece identifiers to apply the decoder to.

    Shape: (batch_size, seq_len)

  • attention_mask (AttentionMask) – Attention mask. Sequence elements for which the corresponding mask element is set to False are ignored during attention calculation.

  • cache (Optional[List[KeyValueCache]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.

  • positions (Optional[Tensor]) –

    Input positions. Positions are needed to look up position embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.

    Shape: (batch_size, seq_len)

  • store_cache (bool) – Whether to cache the key/value representations for future reuse.

Return type:

ModelOutputWithCache[KeyValueCache]

Returns:

Decoder output with key/value cache.

classmethod from_hf_config(*, hf_config, device=None)

Create the module from a Hugging Face model JSON-deserialized model configuration.

Parameters:
  • hf_config (Any) – Hugging Face model configuration.

  • device (Optional[device]) – Device on which to initialize the model.

Return type:

TypeVar(Self, bound= GPTNeoXDecoder)

Returns:

Module constructed using the configuration.

class curated_transformers.models.LlamaDecoder(config, *, device=None)

Bases: TransformerDecoder[LlamaConfig], FromHFHub

Llama (Touvron et al., 2023 [a], Touvron et al., 2023 [b]) decoder.

Construct a Llama decoder.

Parameters:
  • config (LlamaConfig) – Decoder configuration.

  • device (Optional[device]) – Device to which the module is to be moved.

Returns:

The decoder.

property config: ConfigT

Returns the model’s configuration.

classmethod convert_hf_state_dict(params)

Convert a state dict of a Hugging Face model to a valid state dict for the module.

Parameters:

params (Mapping[str, Tensor]) – The state dict to convert.

Returns:

The converted state dict.

forward(piece_ids, attention_mask, *, cache=None, positions=None, store_cache=False)

Apply the decoder to the given piece identifiers.

Parameters:
  • piece_ids (Tensor) –

    Piece identifiers to apply the decoder to.

    Shape: (batch_size, seq_len)

  • attention_mask (AttentionMask) – Attention mask. Sequence elements for which the corresponding mask element is set to False are ignored during attention calculation.

  • cache (Optional[List[KeyValueCache]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.

  • positions (Optional[Tensor]) –

    Input positions. Positions are needed to look up position embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.

    Shape: (batch_size, seq_len)

  • store_cache (bool) – Whether to cache the key/value representations for future reuse.

Return type:

ModelOutputWithCache[KeyValueCache]

Returns:

Decoder output with key/value cache.

classmethod from_hf_config(*, hf_config, device=None)

Create the module from a Hugging Face model JSON-deserialized model configuration.

Parameters:
  • hf_config (Any) – Hugging Face model configuration.

  • device (Optional[device]) – Device on which to initialize the model.

Return type:

TypeVar(Self, bound= LlamaDecoder)

Returns:

Module constructed using the configuration.

class curated_transformers.models.MPTDecoder(config, *, device=None)

Bases: TransformerDecoder[MPTConfig], FromHFHub

MosaicML MPT decoder.

Construct an MPT decoder.

Parameters:
  • config (MPTConfig) – Decoder configuration.

  • device (Optional[device]) – Device to which the module is to be moved.

Returns:

The decoder.

property config: ConfigT

Returns the model’s configuration.

classmethod convert_hf_state_dict(params)

Convert a state dict of a Hugging Face model to a valid state dict for the module.

Parameters:

params (Mapping[str, Tensor]) – The state dict to convert.

Returns:

The converted state dict.

forward(piece_ids, attention_mask, *, cache=None, positions=None, store_cache=False)

Apply the decoder to the given piece identifiers.

Parameters:
  • piece_ids (Tensor) –

    Piece identifiers to apply the decoder to.

    Shape: (batch_size, seq_len)

  • attention_mask (AttentionMask) – Attention mask. Sequence elements for which the corresponding mask element is set to False are ignored during attention calculation.

  • cache (Optional[List[KeyValueCache]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.

  • positions (Optional[Tensor]) –

    Input positions. Positions are needed to look up position embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.

    Shape: (batch_size, seq_len)

  • store_cache (bool) – Whether to cache the key/value representations for future reuse.

Return type:

ModelOutputWithCache[KeyValueCache]

Returns:

Decoder output with key/value cache.

classmethod from_hf_config(*, hf_config, device=None)

Create the module from a Hugging Face model JSON-deserialized model configuration.

Parameters:
  • hf_config (Any) – Hugging Face model configuration.

  • device (Optional[device]) – Device on which to initialize the model.

Return type:

TypeVar(Self, bound= MPTDecoder)

Returns:

Module constructed using the configuration.

Downloading

Each decoder type provides a from_hf_hub function that will load a model from Hugging Face Hub. If you want to load a decoder without committing to a specific decoder type, you can use the AutoDecoder class. This class also provides a from_hf_hub method but will try to infer the correct type automatically.

class curated_transformers.models.AutoDecoder

Decoder module loaded from the Hugging Face Model Hub.

classmethod from_fsspec(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)

Construct a module and load its parameters from a fsspec filesystem.

Parameters:
  • fs (AbstractFileSystem) – The filesystem to load the model from.

  • model_path (str) – The path of the model on the filesystem.

  • fsspec_args (Optional[FsspecArgs]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.

  • device (Optional[device]) – Device on which the model is initialized.

  • quantization_config (Optional[BitsAndBytesConfig]) – Configuration for loading quantized weights.

Return type:

TypeVar(ModelT)

Returns:

Module with the parameters loaded.

classmethod from_hf_hub(*, name, revision='main', device=None, quantization_config=None)

Construct and load a model or a generator from Hugging Face Hub.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

  • device (Optional[device]) – Device on which to initialize the model.

  • quantization_config (Optional[BitsAndBytesConfig]) – Configuration for loading quantized weights.

Return type:

TypeVar(ModelT)

Returns:

Loaded model or generator.

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the model’s weights from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the model will read the weights from disk. If the weights are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_repo(*, repo, device=None, quantization_config=None)

Construct and load a model or a generator from a repository.

Parameters:
  • repository – The repository to load from.

  • device (Optional[device]) – Device on which to initialize the model.

  • quantization_config (Optional[BitsAndBytesConfig]) – Configuration for loading quantized weights.

Return type:

DecoderModule[TransformerConfig, KeyValueCache]

Returns:

Loaded model or generator.

Configuration

Falcon

class curated_transformers.models.FalconConfig(*, attention_probs_dropout_prob=0.0, hidden_dropout_prob=0.0, hidden_width=2560, layer_norm_eps=1e-05, new_decoder_architecture=False, n_query_heads=71, n_key_value_heads=1, n_hidden_layers=32, rotary_embedding_base=10000, rotary_embedding_fraction=0.25, use_alibi=False, use_bias=False, use_parallel_attention=True, n_pieces=50280)

Falcon (Penedo et al., 2019) model configuration.

Parameters:
  • attention_probs_dropout_prob (float) – Dropout to apply after attention.

  • hidden_dropout_prob (float) – Dropout to apply to the hidden and embedding layers.

  • hidden_width (int) – Hidden width of the transformer.

  • layer_norm_eps (float) – Epsilon for layer normalization.

  • n_query_heads (int) – Number of query heads.

  • n_key_value_heads (int) – Number of key and value heads.

  • n_hidden_layers (int) – Number of hidden layers.

  • rotary_embedding_base (int) – Base in signifying the rotary embedding period.

  • rotary_embedding_fraction (float) – Fraction of hidden width to apply rotary embeddings to. Must be in [0,1].

  • use_alibi (bool) – Use ALiBi linear biases in self-attention.

  • use_bias (bool) – Use bias in linear layers.

  • use_parallel_attention (bool) – Use parallel attention.

  • n_pieces (int) – Vocabulary size (number of embeddings).

GPT-NeoX

class curated_transformers.models.GPTNeoXConfig(*, attention_probs_dropout_prob=0.0, activation=Activation.GELU, hidden_dropout_prob=0.0, hidden_width=2560, intermediate_width=10240, layer_norm_eps=1e-05, n_positions=2048, model_max_length=2048, n_attention_heads=32, n_hidden_layers=32, rotary_embedding_base=10000, rotary_embedding_fraction=0.25, n_pieces=50280)

GPT-NeoX (Black et al., 2022) model configuration.

Parameters:
  • attention_probs_dropout_prob (float) – Dropout to apply after attention.

  • activation (Activation) – Activation used by the pointwise feed-forward layers.

  • hidden_dropout_prob (float) – Dropout to apply to the hidden and embedding layers.

  • hidden_width (int) – Hidden width of the transformer.

  • intermediate_width (int) – Intermediate width in the feed-forward layer. The non-linearity is applied in this intermediate width.

  • layer_norm_eps (float) – Epsilon for layer normalization.

  • n_attention_heads (int) – Number of attention heads.

  • n_hidden_layers (int) – Number of hidden layers.

  • rotary_embedding_base (int) – Base in signifying the rotary embedding period.

  • rotary_embedding_fraction (float) – Fraction of hidden width to apply rotary embeddings to. Must be in [0,1].

  • n_pieces (int) – Vocabulary size (number of embeddings).

Llama

class curated_transformers.models.LlamaConfig(*, attention_probs_dropout_prob=0.0, activation=Activation.GELU, hidden_dropout_prob=0.0, hidden_width=2560, intermediate_width=10240, rms_norm_eps=1e-05, n_query_heads=32, n_hidden_layers=32, n_key_value_heads=32, rotary_embedding_base=10000, rotary_embedding_fraction=0.25, n_pieces=50280)

Llama (Touvron et al., 2023 [a], Touvron et al., 2023 [b]) model configuration.

Parameters:
  • attention_probs_dropout_prob (float) – Dropout to apply after attention.

  • activation (Activation) – Activation used by the pointwise feed-forward layers.

  • hidden_dropout_prob (float) – Dropout to apply to the hidden and embedding layers.

  • hidden_width (int) – Hidden width of the transformer.

  • intermediate_width (int) – Intermediate width in the feed-forward layer. The non-linearity is applied in this intermediate width.

  • rms_norm_eps (float) – Epsilon for layer normalization.

  • n_query_heads (int) – Number of query heads.

  • n_hidden_layers (int) – Number of hidden layers.

  • n_key_value_heads (int) – Number of key-value heads.

  • rotary_embedding_base (int) – Base in signifying the rotary embedding period.

  • rotary_embedding_fraction (float) – Fraction of hidden width to apply rotary embeddings to. Must be in [0,1].

  • n_pieces (int) – Vocabulary size (number of embeddings).

MPT

class curated_transformers.models.MPTConfig(*, attention_probs_dropout_prob=0.0, activation=Activation.GELU, hidden_dropout_prob=0.0, hidden_width=4096, intermediate_width_multiplier=4, layer_norm_eps=1e-05, model_max_length=2048, n_attention_heads=32, n_hidden_layers=32, n_pieces=50432, use_bias=False)

MosaicML MPT model configuration.

Parameters:
  • attention_probs_dropout_prob (float) – Dropout to apply after attention.

  • activation (Activation) – Activation used by the pointwise feed-forward layers.

  • hidden_dropout_prob (float) – Dropout to apply to the hidden and embedding layers.

  • hidden_width (int) – Hidden width of the transformer.

  • intermediate_width_multiplier (int) – Multiplier for the intermediate width. The hidden width is multiplied by this value to get the intermediate width.

  • layer_norm_eps (float) – Epsilon for layer normalization.

  • model_max_length (int) – Maximum sequence length of the model.

  • n_attention_heads (int) – Number of attention heads.

  • n_hidden_layers (int) – Number of hidden layers.

  • n_pieces (int) – Vocabulary size (number of embeddings).