Causal Language Models
Base Classes
- class curated_transformers.models.CausalLMModule(config)
Bases:
Generic
[ConfigT
,CacheT
],TransformerModule
[ConfigT
]Base class for causal language model modules.
- property config: ConfigT
Returns the model’s configuration.
- abstract forward(piece_ids, attention_mask, *, cache=None, positions=None, store_cache=False)
Apply the causal language model to the given piece identifiers.
- Parameters:
piece_ids (
Tensor
) –Piece identifiers to apply the decoder to.
Shape:
(batch_size, seq_len)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.cache (
Optional
[List
[TypeVar
(CacheT
, bound=CacheProtocol
)]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
store_cache (
bool
) – Whether to cache the key/value representations for future reuse.
- Return type:
CausalLMOutputWithCache
[TypeVar
(CacheT
, bound=CacheProtocol
)]- Returns:
Causal language model output with key/value cache.
- class curated_transformers.models.TransformerCausalLM(config)
Bases:
Generic
[ConfigT
],CausalLMModule
[ConfigT
,KeyValueCache
]Transformer causal LM (Vaswani et al., 2017) base class.
This class provides an implementation of the
forward
method. Subclasses must set the given member attributes..- property config: ConfigT
Returns the model’s configuration.
- forward(piece_ids, attention_mask, cache=None, positions=None, store_cache=False)
Apply the causal language model to the given piece identifiers.
- Parameters:
piece_ids (
Tensor
) –Piece identifiers to apply the decoder to.
Shape:
(batch_size, seq_len)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.cache (
Optional
[List
[KeyValueCache
]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
store_cache (
bool
) – Whether to cache the key/value representations for future reuse.
- Return type:
- Returns:
Causal language model output with key/value cache.
Architectures
These modules represent the supported causal LM architectures. Generally, every decoder-only architecture has a corresponding causal LM architecture.
- class curated_transformers.models.FalconCausalLM(config, *, device=None)
Bases:
TransformerCausalLM
[FalconConfig
],FromHF
[FalconConfig
],Quantizable
Falcon (Penedo et al., 2019) causal language model.
Construct a Falcon causal LM.
- Parameters:
config (
FalconConfig
) – Causal LM configuration.device (
Optional
[device
]) – Device to which the module is to be moved.
- Returns:
The causal LM.
- property config: ConfigT
Returns the model’s configuration.
- classmethod config_from_hf(hf_config)
Convert a Hugging Face model configuration to the module’s configuration.
- classmethod config_to_hf(curated_config)
Convert the module’s configuration to the a Hugging Face model configuration.
- Parameters:
curated_config (
FalconConfig
) – The Curated Transformer model configuration.- Return type:
- Returns:
The converted Hugging Face configuration.
- forward(piece_ids, attention_mask, cache=None, positions=None, store_cache=False)
Apply the causal language model to the given piece identifiers.
- Parameters:
piece_ids (
Tensor
) –Piece identifiers to apply the decoder to.
Shape:
(batch_size, seq_len)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.cache (
Optional
[List
[KeyValueCache
]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
store_cache (
bool
) – Whether to cache the key/value representations for future reuse.
- Return type:
- Returns:
Causal language model output with key/value cache.
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Construct a module and load its parameters from a fsspec filesystem.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_fsspec_(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Load parameters from a fsspec filestytem in-place into the model.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_config(*, hf_config, device=None)
Create the module from a Hugging Face model JSON-deserialized model configuration.
- classmethod from_hf_hub(*, name, revision='main', device=None, quantization_config=None)
Construct a module and load its parameters from Hugging Face Hub.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_hf_hub_(*, name, revision='main', device=None, quantization_config=None)
Load parameters from Hugging Face Hub in-place into the model.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the model’s weights from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the model will read the weights from disk. If the weights are already cached, this is a no-op.
- classmethod from_repo(*, repo, device=None, quantization_config=None)
Construct and load a model from a repository.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- from_repo_(*, repo, device=None, quantization_config=None)
Load parameters from a repository in-place into the model.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- classmethod is_supported(config)
Check if the model with the given configuration is supported by this class.
- classmethod modules_to_not_quantize()
Return a set of prefixes that specify which modules are to be ignored during quantization.
- classmethod state_dict_from_hf(params)
Convert a state dict of a Hugging Face model to a valid state dict for the module.
- class curated_transformers.models.GPTNeoXCausalLM(config, *, device=None)
Bases:
TransformerCausalLM
[GPTNeoXConfig
],FromHF
[GPTNeoXConfig
],Quantizable
GPT-NeoX (Black et al., 2022) causal language model.
Construct a GPT-NeoX causal LM.
- Parameters:
config (
GPTNeoXConfig
) – Causal LM configuration.device (
Optional
[device
]) – Device to which the module is to be moved.
- Returns:
The causal LM.
- property config: ConfigT
Returns the model’s configuration.
- classmethod config_from_hf(hf_config)
Convert a Hugging Face model configuration to the module’s configuration.
- classmethod config_to_hf(curated_config)
Convert the module’s configuration to the a Hugging Face model configuration.
- Parameters:
curated_config (
GPTNeoXConfig
) – The Curated Transformer model configuration.- Return type:
- Returns:
The converted Hugging Face configuration.
- forward(piece_ids, attention_mask, cache=None, positions=None, store_cache=False)
Apply the causal language model to the given piece identifiers.
- Parameters:
piece_ids (
Tensor
) –Piece identifiers to apply the decoder to.
Shape:
(batch_size, seq_len)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.cache (
Optional
[List
[KeyValueCache
]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
store_cache (
bool
) – Whether to cache the key/value representations for future reuse.
- Return type:
- Returns:
Causal language model output with key/value cache.
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Construct a module and load its parameters from a fsspec filesystem.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_fsspec_(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Load parameters from a fsspec filestytem in-place into the model.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_config(*, hf_config, device=None)
Create the module from a Hugging Face model JSON-deserialized model configuration.
- classmethod from_hf_hub(*, name, revision='main', device=None, quantization_config=None)
Construct a module and load its parameters from Hugging Face Hub.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_hf_hub_(*, name, revision='main', device=None, quantization_config=None)
Load parameters from Hugging Face Hub in-place into the model.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the model’s weights from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the model will read the weights from disk. If the weights are already cached, this is a no-op.
- classmethod from_repo(*, repo, device=None, quantization_config=None)
Construct and load a model from a repository.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- from_repo_(*, repo, device=None, quantization_config=None)
Load parameters from a repository in-place into the model.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- classmethod is_supported(config)
Check if the model with the given configuration is supported by this class.
- classmethod modules_to_not_quantize()
Return a set of prefixes that specify which modules are to be ignored during quantization.
- classmethod state_dict_from_hf(params)
Convert a state dict of a Hugging Face model to a valid state dict for the module.
- class curated_transformers.models.LlamaCausalLM(config, *, device=None)
Bases:
TransformerCausalLM
[LlamaConfig
],FromHF
[LlamaConfig
],Quantizable
Llama (Touvron et al., 2023 [a], Touvron et al., 2023 [b]) causal language model.
Construct a Llama causal LM.
- Parameters:
config (
LlamaConfig
) – Causal LM configuration.device (
Optional
[device
]) – Device to which the module is to be moved.
- Returns:
The causal LM.
- property config: ConfigT
Returns the model’s configuration.
- classmethod config_from_hf(hf_config)
Convert a Hugging Face model configuration to the module’s configuration.
- classmethod config_to_hf(curated_config)
Convert the module’s configuration to the a Hugging Face model configuration.
- Parameters:
curated_config (
LlamaConfig
) – The Curated Transformer model configuration.- Return type:
- Returns:
The converted Hugging Face configuration.
- forward(piece_ids, attention_mask, cache=None, positions=None, store_cache=False)
Apply the causal language model to the given piece identifiers.
- Parameters:
piece_ids (
Tensor
) –Piece identifiers to apply the decoder to.
Shape:
(batch_size, seq_len)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.cache (
Optional
[List
[KeyValueCache
]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
store_cache (
bool
) – Whether to cache the key/value representations for future reuse.
- Return type:
- Returns:
Causal language model output with key/value cache.
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Construct a module and load its parameters from a fsspec filesystem.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_fsspec_(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Load parameters from a fsspec filestytem in-place into the model.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_config(*, hf_config, device=None)
Create the module from a Hugging Face model JSON-deserialized model configuration.
- classmethod from_hf_hub(*, name, revision='main', device=None, quantization_config=None)
Construct a module and load its parameters from Hugging Face Hub.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_hf_hub_(*, name, revision='main', device=None, quantization_config=None)
Load parameters from Hugging Face Hub in-place into the model.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the model’s weights from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the model will read the weights from disk. If the weights are already cached, this is a no-op.
- classmethod from_repo(*, repo, device=None, quantization_config=None)
Construct and load a model from a repository.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- from_repo_(*, repo, device=None, quantization_config=None)
Load parameters from a repository in-place into the model.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- classmethod is_supported(config)
Check if the model with the given configuration is supported by this class.
- classmethod modules_to_not_quantize()
Return a set of prefixes that specify which modules are to be ignored during quantization.
- classmethod state_dict_from_hf(params)
Convert a state dict of a Hugging Face model to a valid state dict for the module.
- class curated_transformers.models.MPTCausalLM(config, *, device=None)
Bases:
TransformerCausalLM
[MPTConfig
],FromHF
[MPTConfig
],Quantizable
MosaicML MPT causal language model.
Construct an MPT causal LM.
- Parameters:
- Returns:
The causal LM.
- property config: ConfigT
Returns the model’s configuration.
- classmethod config_from_hf(hf_config)
Convert a Hugging Face model configuration to the module’s configuration.
- classmethod config_to_hf(curated_config)
Convert the module’s configuration to the a Hugging Face model configuration.
- forward(piece_ids, attention_mask, cache=None, positions=None, store_cache=False)
Apply the causal language model to the given piece identifiers.
- Parameters:
piece_ids (
Tensor
) –Piece identifiers to apply the decoder to.
Shape:
(batch_size, seq_len)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.cache (
Optional
[List
[KeyValueCache
]]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
store_cache (
bool
) – Whether to cache the key/value representations for future reuse.
- Return type:
- Returns:
Causal language model output with key/value cache.
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Construct a module and load its parameters from a fsspec filesystem.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_fsspec_(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Load parameters from a fsspec filestytem in-place into the model.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_config(*, hf_config, device=None)
Create the module from a Hugging Face model JSON-deserialized model configuration.
- classmethod from_hf_hub(*, name, revision='main', device=None, quantization_config=None)
Construct a module and load its parameters from Hugging Face Hub.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- from_hf_hub_(*, name, revision='main', device=None, quantization_config=None)
Load parameters from Hugging Face Hub in-place into the model.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Module with the parameters loaded.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the model’s weights from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the model will read the weights from disk. If the weights are already cached, this is a no-op.
- classmethod from_repo(*, repo, device=None, quantization_config=None)
Construct and load a model from a repository.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- from_repo_(*, repo, device=None, quantization_config=None)
Load parameters from a repository in-place into the model.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(Self
, bound= FromHF)- Returns:
Loaded model.
- classmethod is_supported(config)
Check if the model with the given configuration is supported by this class.
- classmethod modules_to_not_quantize()
Return a set of prefixes that specify which modules are to be ignored during quantization.
- classmethod state_dict_from_hf(params)
Convert a state dict of a Hugging Face model to a valid state dict for the module.
Downloading
Each causal LM type provides a from_hf_hub
function that will load a model
from Hugging Face Hub. If you want to load a causal LM without committing to a
specific causal LM type, you can use the AutoCausalLM
class. This class also provides a from_hf_hub
method but will try to infer
the correct type automatically.
- class curated_transformers.models.AutoCausalLM
Causal LM model loaded from the Hugging Face Model Hub.
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None, device=None, quantization_config=None)
Construct a module and load its parameters from a fsspec filesystem.
- Parameters:
fs (
AbstractFileSystem
) – The filesystem to load the model from.model_path (
str
) – The path of the model on the filesystem.fsspec_args (
Optional
[FsspecArgs
]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.device (
Optional
[device
]) – Device on which the model is initialized.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(ModelT
)- Returns:
Module with the parameters loaded.
- classmethod from_hf_hub(*, name, revision='main', device=None, quantization_config=None)
Construct and load a model or a generator from Hugging Face Hub.
- Parameters:
name (
str
) – Model name.revision (
str
) – Model revision.device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
TypeVar
(ModelT
)- Returns:
Loaded model or generator.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the model’s weights from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the model will read the weights from disk. If the weights are already cached, this is a no-op.
- classmethod from_repo(*, repo, device=None, quantization_config=None)
Construct and load a model or a generator from a repository.
- Parameters:
repository – The repository to load from.
device (
Optional
[device
]) – Device on which to initialize the model.quantization_config (
Optional
[BitsAndBytesConfig
]) – Configuration for loading quantized weights.
- Return type:
- Returns:
Loaded model or generator.
Caching
Causal language models apply causal attention, meaning that the attention mechanism only attends to preceding pieces. So, when the model predicts the next piece, the attention and hidden representations of the pieces before it do not change. This means we can avoid recomputing hidden representations of already-seen pieces by caching them. This allows us to generate text in \(\mathcal{O}(n^2)\) time rather than \(\mathcal{O}(n^3)\).
Caching works by calling the causal language model with the store_cache
argument. The model will then return the cached representations as part of its
output. The cached representations can then be passed in the next call to the
language model with the cache
argument:
cache = None
while not_done:
...
output = lm(..., cache=cache, store_cache=True)
cache = output.cache
...