Building Blocks
Curated Transformers provides building blocks to create your own transformer models.
Embedding Layers
These modules implement full embedding layers.
- class curated_transformers.layers.EmbeddingDropouts(embed_output_dropout=Identity(), proj_output_dropout=Identity())
Dropouts used in a transformer embedding layer.
By default, all the dropouts are disabled by setting the dropout to the Torch
Identity
module. Therefore, only dropouts that are needed have to be set.- Parameters:
embed_output_dropout (
Module
) – Dropout of the embeddings.proj_output_dropout (
Module
) – Dropout of the output of the projection layer.
- class curated_transformers.layers.EmbeddingLayerNorms(embed_output_layer_norm=Identity(), proj_output_layer_norm=Identity())
Layer normalizations used in a transformer embedding layer.
By default, all the normalizations are disabled by setting the layer normalization to the Torch
Identity
module. Therefore, only normalizations that are needed have to be set.- Parameters:
embed_output_layer_norm (
Module
) – Normalization of the embeddings.proj_output_layer_norm (
Module
) – Normalization of the output of the projection layer.
- class curated_transformers.layers.TransformerEmbeddings(*, dropouts, embedding_width, hidden_width, layer_norms, n_pieces, n_positions, n_types, device=None)
Transformer embeddings layer.
This is a generic transformer embedding layer. The layer always has piece embeddings and can optionally have position embeddings, type embeddings, and a projection of embeddings to the model’s hidden size.
Construct an embeddings layer.
- Parameters:
dropouts (
EmbeddingDropouts
) – Dropouts to use in the embeddings layer.embedding_width (
int
) – Width of the embeddings.hidden_width (
int
) – Hidden width of the transformer. If this width differs fromembedding_width
, a projection layer is added to ensure that the output of the embeddings layer has the same width as the transformer.layer_norms (
EmbeddingLayerNorms
) – Layer norms to use in the embeddings layer.n_pieces (
int
) – Number of piece embeddings.n_positions (
Optional
[int
]) – Number of position embeddings. Position embeddings are disabled by usingNone
. Position embeddings can be used to inform the model of input order.n_types (
Optional
[int
]) – Number of type embeddings. Type embeddings are disabled by usingNone
. Type embeddings can be used to inform the model of the spans of different sequences in the input.device (
Optional
[device
]) – Device on which the module is to be initialized.
- forward(piece_ids, *, positions=None, type_ids=None)
Apply the embedding layer to the piece identifiers.
- Parameters:
piece_ids (
Tensor
) –Piece identifiers to embed.
Shape:
(batch_size, seq_len)
positions (
Optional
[Tensor
]) –Positional with which to fetch the positional embeddings for the sequences.
Shape:
(batch_size, seq_len)
type_ids (
Optional
[Tensor
]) –Type identifiers to indicate the spans of different sequences in the input. Useful when performing tasks like sequence classification and question answering.
Shape:
(batch_size, seq_len)
- Return type:
Tensor
Encoder/Decoder Layers
These modules implement full encoder/decoder layers.
- class curated_transformers.layers.TransformerDropouts(attn_output_dropout=Identity(), ffn_output_dropout=Identity(), parallel_attn_dropout=Identity())
Dropouts used in a transformer layer.
By default, all the dropouts are disabled by setting the dropout to the Torch
Identity
module. Therefore, only dropouts that are needed have to be set.- Parameters:
attn_output_dropout (
Module
) – Dropout of the output of the attention layer.ffn_output_dropout (
Module
) – Dropout of the output of the attention layer.parallel_attn_dropout (
Module
) – Dropout after summing the attention and feed-forward layers. Only used when parallel attention is enabled.
- classmethod layer_output_dropouts(p)
Utility method to construct attention and feed-forward layer dropouts.
- Parameters:
p (
float
) – Dropout probability.- Return type:
- Returns:
Dropouts of attention and feed-forward layers set to
p
.
- class curated_transformers.layers.TransformerLayerNorms(attn_input_layer_norm=Identity(), attn_residual_layer_norm=Identity(), ffn_input_layer_norm=Identity(), ffn_residual_layer_norm=Identity())
Layer normalizations used in a transformer layer.
By default, all the normalizations are disabled by setting the layer normalization to the Torch
Identity
module. Therefore, only normalizations that are needed have to be set.- Parameters:
attn_input_layer_norm (
Module
) – Normalization of the input to the attention layer.attn_residual_layer_norm (
Module
) – Normalization of the output of the attention layer after the residual connection.ffn_input_layer_norm (
Module
) – Normalization of the input to the feed-forward layer.ffn_residual_layer_norm (
Module
) – Normalization of the output of the feed-forward layer after the residual connection.
- class curated_transformers.layers.DecoderLayer(*, attention_layer, dropouts, feed_forward_layer, layer_norms, use_parallel_attention)
Transformer decoder layer (Vaswani et al., 2017).
Construct a transformer layer.
- Parameters:
attention_layer (
SelfAttention
) – The attention layer to use in the transformer layer.dropouts (
TransformerDropouts
) – Dropouts to use in the transformer layer.feed_forward_layer (
PointwiseFeedForward
) – The pointwise feed-forward layer to use in the transformer layer.layer_norms (
TransformerLayerNorms
) – Layer norms to use in the layer.use_parallel_attention (
bool
) – Use parallel attention.
- forward(input, attention_mask, *, cache=None, positions=None, store_cache=False)
Apply the decoder layer to the given piece hidden representations.
- Parameters:
input (
Tensor
) –Hidden representations to apply the layer to.
Shape:
(batch_size, seq_len, width)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.cache (
Optional
[KeyValueCache
]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) – Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.store_cache (
bool
) – Whether to cache the key/value representations for future reuse.
- Return type:
Tuple
[Tensor
,Optional
[KeyValueCache
]]- Returns:
Layer output and the key/value cache.
Shape:
(batch_size, seq_len, width)
- class curated_transformers.layers.EncoderLayer(*, attention_layer, dropouts, feed_forward_layer, layer_norms, use_parallel_attention)
Transformer encoder layer (Vaswani et al., 2017).
Construct a transformer layer.
- Parameters:
attention_layer (
SelfAttention
) – The attention layer to use in the transformer layer.dropouts (
TransformerDropouts
) – Dropouts to use in the transformer layer.feed_forward_layer (
PointwiseFeedForward
) – The pointwise feed-forward layer to use in the transformer layer.layer_norms (
TransformerLayerNorms
) – Layer norms to use in the layer.use_parallel_attention (
bool
) – Use parallel attention.
- forward(input, attention_mask)
Apply the encoder layer to the given piece hidden representations.
- Parameters:
input (
Tensor
) –Hidden representations to apply the layer to.
Shape:
(batch_size, seq_len, width)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored during attention calculation.
- Return type:
Tuple
[Tensor
,Optional
[KeyValueCache
]]- Returns:
Layer output and the key/value cache.
Shape:
(batch_size, seq_len, width)
Attention
These modules and their helper classes implement the Transformer attention mechanism.
- class curated_transformers.layers.QkvMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
How the query, key and value projections are handled in the self-attention layer.
- MERGED_SPLIT_AFTER = 2
MERGED_SPLIT_AFTER
- Use a merged projection for query, key and value, and split heads after splitting the query, key and value representations.
- MERGED_SPLIT_BEFORE = 1
MERGED_SPLIT_BEFORE
- Use a merged projection for query, key and value, and split heads before splitting the query, key and value representations. This ordering is incompatible with head sharing in keys or values.
- SEPARATE = 0
SEPARATE
- Use separate projections for query, key and value.
- class curated_transformers.layers.QkvSplit
Query, key, value splitting strategies.
After the input projection of the attention layer, we have an array with shape
(batch_size, seq_len, n_heads * head_width)
wheren_heads
is the sum of the number of query, key, and value heads. We need to split up the array into separate arrays for query, key, and value heads.Subclasses of this class implement different splitting strategies.
- abstract split(*, projection, head_width, n_query_heads, n_key_value_heads)
Split attention heads in the projection in query, key, and value heads.
- Parameters:
- Return type:
Tuple
[Tensor
,Tensor
,Tensor
]- Returns:
Query, key, value tensors.
Shapes:
Query:
(batch_size, n_query_heads, seq_len, head_width)
Key:
(batch_size, n_key_value_heads, seq_len, head_width)
Value:
(batch_size, n_key_value_heads, seq_len, head_width)
- class curated_transformers.layers.AttentionHeads(*, n_query_heads, n_key_value_heads, qkv_split)
Construct an attention head configuration. This constructor must not be used directly, its signature may change even within a semver version. Use the factory methods instead.
- Parameters:
n_query_heads (
int
) – Number of query heads.n_key_value_heads (
int
) – Number of key/value heads.qkv_split (
QkvSplit
) – How query, key, and value should be split when usingMERGED_SPLIT_AFTER
. Not used for other query, key, value modes.
- classmethod key_value_broadcast(*, n_query_heads, n_key_value_heads, qkv_split)
Construct a head configuration where query has a larger number of heads than key and value. Key/value heads are broadcast to correspond to the number of query heads.
- Parameters:
n_query_heads (
int
) – Number of attention heads. Must be a multiple ofn_key_value_heads
.n_key_value_heads (
int
) – Number of key and value heads.qkv_split (
QkvSplit
) – How query, key, and value should be split when usingMERGED_SPLIT_AFTER
. Not used for other query, key, value modes.
- Return type:
- classmethod multi_query(n_query_heads, qkv_split)
Construct a multi-query attention configuration: key has one head, value has one head, query has
n_query_heads
heads (Shazeer et al., 2019). The key head and the value head are broadcast to the shape of the query.- Parameters:
n_query_heads (
int
) – Number of query heads.qkv_split (
QkvSplit
) – How query, key, and value should be split when usingMERGED_SPLIT_AFTER
. Not used for other query, key, value modes.
- Return type:
- classmethod uniform(n_attention_heads, qkv_split)
Construct a head configuration where query, key, and value have the same number of attention heads.
- Parameters:
n_attention_heads (
int
) – Number of attention heads.qkv_split (
QkvSplit
) – How query, key, and value should be split when usingMERGED_SPLIT_AFTER
. Not used for other query, key, value modes.
- Return type:
- class curated_transformers.layers.AttentionMask(bool_mask)
Mask for attention calculation. Sequence elements for which the corresponding mask element is set to
False
are ignored during attention calculation.- Parameters:
bool_mask (
Tensor
) – The boolean mask.
- apply_logit_mask(input)
Use the attention mask to mask attention logits.
- Parameters:
input (
Tensor
) –Attention logits to apply the mask to.
Shape:
(batch_size, heads, query_len, key_len)
- Return type:
Tensor
- Returns:
Logits with the attention mask applied.
Shape:
(batch_size, heads, query_len, key_len)
- property device: device
Return the device of the mask.
- extend_length(count, fill_value)
Extend the attention mask in the sequence length dimension by the given value.
- Parameters:
- Return type:
- Returns:
Extended mask.
- filter_batch_items(mask)
Filter batch sequences from the attention mask.
Sequences for which the mask is
True
are retained.- Parameters:
mask (
Tensor
) –Mask of batch items to retain.
Shape:
(batch_size,)
- Return type:
- Returns:
Filtered mask.
- logit_mask(dtype)
Generate the logit mask for the given
dtype
.Elements of the mask that are
False
are set to the minimum value of thedtype
and the rest to zero. During softmax calculation, adding this mask to the logits will result in (near-)zero probabilities for the elements that areFalse
.- Parameters:
dtype (
dtype
) – Data type of the logit mask.- Return type:
Tensor
- Returns:
Logit mask.
- merge_mask(other)
Merge this attention mask with another attention mask.
- Parameters:
other (
AttentionMask
) – Attention mask to merge.- Return type:
- Returns:
Merged mask.
- property shape: Size
Return the shape of the mask.
- class curated_transformers.layers.KeyValueCache(key, value)
Cache type for layers that cache keys and values.
- Parameters:
key (
Tensor
) – Key.value (
Tensor
) – Value.
- class curated_transformers.layers.AttentionScorer(*args, **kwargs)
Base class of attention scoring implementations.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- abstract forward(*, query, key, value, attention_mask, use_causal_mask)
Apply attention scores to the given key, query and value.
Sequence elements that are marked with
False
in the attention mask are ignored by the attention mechanism (if a mask is provided).- Parameters:
query (
Tensor
) –Query tensor.
Shape:
(batch_size, heads, seq_len, width)
key (
Tensor
) –Key tensor.
Shape:
(batch_size, heads, seq_len, width)
value (
Tensor
) –Value tensor.
Shape:
(batch_size, heads, seq_len, width)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored in attention.use_causal_mask (
bool
) – Mask out succeeding sequence elements whenTrue
.
- Return type:
Tensor
- Returns:
Attention values.
Shape:
(batch_size, heads, seq_len, width)
- class curated_transformers.layers.AttentionLinearBiases(*, n_attention_heads, is_causal, is_inverted)
Bases:
Module
ALiBi: Linear biases for attention (Press et al., 2022).
Construct an ALiBi module.
- Parameters:
- forward(*, attention_scores, inplace=True)
Apply linear biases to (unmasked) attention scores.
- Parameters:
attention_scores (
Tensor
) –Attention scores.
Shape:
(batch_size, heads, query_len, key_len)
inplace (
bool
) – Update attention scores inplace.
- Return type:
Tensor
- Returns:
Attention scores with linear biases.
Shape:
(batch_size, heads, query_len, key_len)
- class curated_transformers.layers.ScaledDotProductAttention(*, dropout_prob, linear_biases)
Bases:
AttentionScorer
Scaled dot-product attention (Vaswani et al., 2017).
Construct a scaled dot-product attention module.
- Parameters:
dropout_prob (
float
) – Dropout to apply to the final hidden representation.linear_biases (
Optional
[AttentionLinearBiases
]) – ALiBi (Press et al., 2022) for attention scores. Not applied ifNone
.
- forward(*, query, key, value, attention_mask, use_causal_mask)
Apply attention scores to the given key, query and value.
Sequence elements that are marked with
False
in the attention mask are ignored by the attention mechanism (if a mask is provided).- Parameters:
query (
Tensor
) –Query tensor.
Shape:
(batch_size, heads, seq_len, width)
key (
Tensor
) –Key tensor.
Shape:
(batch_size, heads, seq_len, width)
value (
Tensor
) –Value tensor.
Shape:
(batch_size, heads, seq_len, width)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored in attention.use_causal_mask (
bool
) – Mask out succeeding sequence elements whenTrue
.
- Return type:
Tensor
- Returns:
Attention values.
Shape:
(batch_size, heads, seq_len, width)
- class curated_transformers.layers.SelfAttention(*, attention_heads, attention_scorer, hidden_width, qkv_mode, rotary_embeds=None, use_bias, device=None)
Bases:
Module
Transformer self-attention layer (Vaswani et al., 2017).
Construct a self-attention layer with rotary position embeddings and attention linear biases.
- Parameters:
attention_heads (
AttentionHeads
) – Attention head configuration.attention_scorer (
AttentionScorer
) – Attention scorer used to calculate the attention values.hidden_width (
int
) – Hidden width of the layer.qkv_mode (
QkvMode
) – Handling mode for query, key and value.rotary_embeds (
Optional
[QueryKeyRotaryEmbeddings
]) – Rotary embeddings. Rotary embeddings will not be used when set toNone
.use_bias (
bool
) – Use biases for linear layers.device (
Optional
[device
]) – Device on which the module is to be initialized.
- forward(input, attention_mask, use_causal_mask=False, cache=None, store_cache=False, positions=None)
Apply self-attention layer to the input.
- Parameters:
input (
Tensor
) –Input to apply self-attention to.
Shape:
(batch_size, seq_len, width)
attention_mask (
AttentionMask
) – Attention mask. Sequence elements for which the corresponding mask element is set toFalse
are ignored in attention.use_causal_mask (
bool
) – Mask out succeeding sequence elements whenTrue
.cache (
Optional
[KeyValueCache
]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.store_cache (
bool
) – Whether to cache the key/value representations for future reuse.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
- Return type:
Tuple
[Tensor
,Optional
[KeyValueCache
]]- Returns:
Layer output.
Shape:
(batch_size, seq_len, width)
Embeddings
These modules implement various positional embeddings used by the Transformer.
- class curated_transformers.layers.SinusoidalPositionalEmbedding(*, width, max_len, normalize=True, device=None)
Bases:
Module
Sinusoidal positional embeddings (Vaswani et al., 2017).
Construct a sinusoidal positional embedding module.
- Parameters:
- forward(input)
Returns the positional embedding for the input.
- Parameters:
input (
Tensor
) –Input tensor.
Shape:
(batch_size, seq_len)
- Return type:
Tensor
- Returns:
Positional embedding for the input.
Shape:
(seq_len, width)
- class curated_transformers.layers.RotaryEmbeddings(width, *, seq_len=512, base=10000, device=None)
Bases:
Module
Rotary embeddings (Su et al., 2021).
Construct a rotary embedding module. The rotary embedding will be precomputed for up to
seq_len
positions. The embedding will be recomputed when a longer sequence is found in the input.- Parameters:
- forward(input, *, positions=None)
Apply rotary embeddings to the input.
- Parameters:
input (
Tensor
) –Input to apply the rotary embeddings to.
Shape:
(batch_size, n_heads, seq_len, width_per_head)
positions (
Optional
[Tensor
]) –Positions of the inputs. If no positions are provided, they are assumed to be
[0, seq_len)
.Shape:
(batch_size, seq_len)
- Returns:
Input with the rotary embeddings applied.
Shape:
(batch_size, n_heads, seq_len, width_per_head)
- class curated_transformers.layers.QueryKeyRotaryEmbeddings(*, base=10000, fraction, head_width, device=None)
Bases:
Module
Rotary embeddings (Su et al., 2021) applied to query and key representations.
Construct a rotary embedding module.
- Parameters:
- forward(*, query, key, cache=None, positions=None)
Apply rotary embeddings to the query and key.
- Parameters:
query (
Tensor
) –Query representations.
Shape:
(batch_size, head, seq_len, width_per_head)
key (
Tensor
) –Key representations.
Shape:
(batch_size, head, seq_len, width_per_head)
cache (
Optional
[KeyValueCache
]) – Key/value cache to avoid recomputing key/value representations for tokens that were previously seen.positions (
Optional
[Tensor
]) –Input positions. Positions are needed to look up rotary embeddings. Normally, these positions are calculated automatically. But if the positions deviate for some reason, they can be provided through this argument.
Shape:
(batch_size, seq_len)
- Return type:
Tuple
[Tensor
,Tensor
]- Returns:
Query and key with the rotary embeddings applied.
Shape:
(batch_size, head, seq_len, width_per_head)
Feed-forward Layers
- class curated_transformers.layers.PointwiseFeedForward(*, activation, hidden_width, intermediate_width, use_bias, use_gate, device=None)
Bases:
Module
Point-wise feed-forward layer (Vaswani et al., 2017).
This layer is applied pointwise, meaning that the same transformation is applied to each sequence element. This transformation is:
\[g(xW_1 + b_1)W_2 + b_2\]\(W_1\) and \(b_1\) transform the input to an intermediate width, \(g\) is a non-linear activation function and \(W_2\) and \(b_2\) transform the output of the activation back to the input width.
Gated Linear Units (Dauphin et al., 2016; Shazeer, 2020) are also supported. Gating applies the following transformation:
\[(g(xW_g + b_g) * (xW_1 + b_1))W_2 + b_2\]\(W_g\) and \(b_g\) are the affine transformation for the gate.
Construct a pointwise feed-forward layer module.
- Parameters:
activation (
Module
) – Activation used by the pointwise feed-forward layers. The hidden input shape must be the same as the output shape (as is typical for elementwise activations).hidden_width (
int
) – The input and output width of the layer.intermediate_width (
int
) – The width of the projection to which the non-linearity is applied.use_bias (
bool
) – Use biases for linear layers.use_gate (
bool
) – Use Gated Linear Units.device (
Optional
[device
]) – Device on which the module is to be initialized.
- forward(input)
Apply the point-wise feed-forward layer to the input.
- Parameters:
input (
Tensor
) –Input.
Shape:
(batch_size, seq_len, width)
- Return type:
Tensor
- Returns:
Layer output.
Shape:
(batch_size, seq_len, width)
Activations
- class curated_transformers.layers.Activation(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Activation functions.
- GELU = 'gelu'
Gaussian Error Linear Unit (Hendrycks et al., 2016).
- GELUFast = 'gelu_fast'
Gaussian Error Linear Unit (Hendrycks et al., 2016) approximation used by GPT-NeoX (Black et al., 2022).
- GELUNew = 'gelu_new'
Gaussian Error Linear Unit (Hendrycks et al., 2016) approximation.
- ReLU = 'relu'
Rectified Linear Unit (Fukushima, 1969).
- SiLU = 'silu'
Sigmoid Linear Unit (Hendrycks et al., 2016).
- class curated_transformers.layers.GELUFast(*args, **kwargs)
Bases:
Module
GELU (Hendrycks et al., 2016) approximation used by GPT-NeoX (Black et al., 2022).
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(input)
Apply the GELU activation on the input.
- Parameters:
input (
Tensor
) –Input tensor.
Shape:
(batch_size, seq_len, width)
- Return type:
Tensor
- class curated_transformers.layers.GELUNew(*args, **kwargs)
Bases:
Module
GELU (Hendrycks et al., 2016) approximation, called
gelu_new
in many transformer models.Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(input)
Apply the GELU activation on the input.
- Parameters:
input (
Tensor
) –Input tensor.
Shape:
(batch_size, seq_len, width)
- Return type:
Tensor
Normalization
- class curated_transformers.layers.RMSNorm(width, *, eps, device=None)
Bases:
Module
Root Mean Square (RMS) normalization (Zhang et al., 2019).
Construct a RMS normalization module.
- Parameters:
- forward(input)
Apply RMS normalization to a tensor.
- Parameters:
input (
Tensor
) – The tensor to apply normalization to.- Return type:
Tensor
- Returns:
Normalized tensor.
Model Outputs
These dataclasses encapsulate the outputs produced by the different modules.
- class curated_transformers.models.ModelOutput(all_outputs)
Base class for model outputs.
- Parameters:
all_outputs (
List
[Tensor
]) – The first element is the output of the embedding layer. The rest of the elements are the states of each encoder hidden layer respectively.
Return the hidden representation of all the layers.
- Returns:
Hidden representations of all the layers.
Shape:
(batch_size, seq_len, width)
- property embedding_layer: Tensor
Return the output of the embedding layer.
- Returns:
Embedding layer output.
Shape:
(batch_size, seq_len, width)
Return the hidden representations of a given layer.
- Parameters:
idx (
int
) – Layer index. Must be in[0, n_hidden_layers)
.- Return type:
Tensor
- Returns:
Hidden representation of the layer.
Shape:
(batch_size, seq_len, width)
Return the hidden representation of the last layer.
- Returns:
Last hidden representation of the last layer.
Shape:
(batch_size, seq_len, width)
- class curated_transformers.models.ModelOutputWithCache(all_outputs, cache)
Bases:
Generic
[CacheT
],ModelOutput
Output of decoder modules.
- class curated_transformers.models.CausalLMOutputWithCache(all_outputs, cache, logits)
Bases:
Generic
[CacheT
],ModelOutputWithCache
[CacheT
]Output of causal language model modules.
- Parameters:
logits (
Tensor
) – Logits of the distributions of predicted tokens.
Model Configs
These dataclasses encapsulate the configurable parameters of the Transformer model.
- class curated_transformers.models.RotaryEmbeddingConfig(rotary_base, rotary_fraction)
Configuration options for rotary embeddings (Su et al., 2021).
- class curated_transformers.models.TransformerAttentionLayerConfig(dropout_prob, hidden_width, n_query_heads, n_key_value_heads, rotary_embeddings, use_alibi, use_bias, use_parallel_attention)
Configuration options for self-attention.
- Parameters:
dropout_prob (
float
) – Dropout probabilty to apply after attention.hidden_width (
int
) – Hidden width of the transformer.n_query_heads (
int
) – Number of attention heads.n_key_value_heads (
int
) – Number of key and value heads.rotary_embeddings (
Optional
[RotaryEmbeddingConfig
]) – Rotary embedding configuration.use_alibi (
bool
) – Use ALiBi linear biases.use_bias (
bool
) – Use bias in linear layers.use_parallel_attention (
bool
) – Use parallel attention.
- class curated_transformers.models.TransformerEmbeddingLayerConfig(dropout_prob, embedding_width, layer_norm_eps, n_positions, n_pieces, n_types)
Configuration options for embeddings.
- Parameters:
dropout_prob (
float
) – Dropout probabilty for the embedding layer.embedding_width (
int
) – Width of the embedding representations.layer_norm_eps (
float
) – Epsilon for layer normalization.n_positions (
Optional
[int
]) – Maximum length of position embeddings.n_pieces (
int
) – Vocabulary size (number of embeddings).n_types (
Optional
[int
]) – Token type vocabulary size (number of token type embeddings).
- class curated_transformers.models.TransformerFeedForwardLayerConfig(activation, hidden_width, intermediate_width, use_bias, use_gate)
Configuration options for transformer feed-forward layers.
- Parameters:
activation (
Activation
) – Activation in the feed-forward layerhidden_width (
int
) – Hidden width of the transformer.intermediate_width (
int
) – Intermediate width in the feed-forward layer.use_bias (
bool
) – Use bias in linear layers.use_gate (
bool
) – Use Gated Linear Units.
- class curated_transformers.models.TransformerLayerConfig(attention, dropout_prob, feedforward, layer_norm_eps, n_hidden_layers)
Configuration options for transformer layers.
- Parameters:
attention (
TransformerAttentionLayerConfig
) – Attention layer config.dropout_prob (
float
) – Dropout probabilty to apply after hidden layers.feedforward (
TransformerFeedForwardLayerConfig
) – Feed-forward layer config.layer_norm_eps (
float
) – Epsilon for layer normalization.n_hidden_layers (
int
) – Number of hidden layers.
- class curated_transformers.models.TransformerConfig(embedding, layer, dtype)
Configuration options for a transformer model.
- Parameters:
embedding (
TransformerEmbeddingLayerConfig
) – Embedding layer config.layer (
TransformerLayerConfig
) – Transformer hidden layer config.dtype (
dtype
) – Default data type used by the model’s parameters.