Tokenizers
Inputs
Each tokenizer accepts a Iterable[str]
or a Iterable[InputChunks]
. In
most cases, passing a list of strings should suffice. However, passing
InputChunks
can be useful when special pieces need to be added
to the input.
When the tokenizer is called with a list of strings, each string is
automatically converted to a TextChunk
, which represents a text chunk
that should be tokenized. The other type of supported chunk is the
SpecialPieceChunk
. The piece stored by this type of chunk is not
tokenized but looked up directly.
- class curated_transformers.tokenizers.InputChunks(iterable=(), /)
Bases:
List
[Union
[SpecialPieceChunk
,TextChunk
]]A list of chunks.
- merge_text_chunks()
Merge multiple contiguous text chunks and before/after text in special piece chunks.
- Return type:
MergedInputChunks
- class curated_transformers.tokenizers.SpecialPieceChunk(piece, after=None, before=None)
A chunk that contains a special piece. This piece is not tokenized, but looked up directly in the vocabulary. Can additionally store strings that should be appended to a text chunk before or prepended to a text chunk after the special piece.
Outputs
All tokenizers encode raw strings into pieces. The pieces are
stored in a special container PiecesWithIds
.
- class curated_transformers.tokenizers.PiecesWithIds(ids, pieces)
Bases:
object
Encoded output of tokenizers.
- Parameters:
- attention_mask(*, pad_left=False, device=None)
Generate the attention masks. The mask is equivalent to:
ids.padded_tensor(padding_id) != padding_id
- Parameters:
- Return type:
- Returns:
The attention mask.
Shape:
(batch_size, max_seq_len)
- padded_tensor(*, padding_id=0, pad_left=False, device=None)
Generate a padded tensor of the piece identifiers.
- Parameters:
padding_id (
int
) – Piece identifier of the padding piece. The actual identifier generally doesn’t matter when an attention mask is used (and as long as it is a valid vocabulary index).pad_left (
bool
) – By default sequences shorter than the longest sequence are right-padded. Use left-padding when set toTrue
.device (
Optional
[device
]) – Device on which the padded tensor is created.
- Return type:
Tensor
- Returns:
The padded piece ids.
Shape:
(batch_size, max_seq_len)
The encoded pieces can be decoded to produce raw strings.
Downloading
Each tokenizer type provides a from_hf_hub
function that will load a
tokenizer from Hugging Face Hub. If you want to load a tokenizer without
committing to a specific tokenizer type, you can use the AutoTokenizer
class. This class also provides a from_hf_hub
method to load a tokenizer,
but will try to infer the tokenizer type automatically.
- class curated_transformers.tokenizers.AutoTokenizer
Tokenizer loaded from the Hugging Face Model Hub.
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None)
Construct a tokenizer and load its parameters from an fsspec filesystem.
- Parameters:
- Return type:
- Returns:
The tokenizer.
- classmethod from_hf_hub(*, name, revision='main')
Infer a tokenizer type and load it from Hugging Face Hub.
- Parameters:
- Return type:
- Returns:
The tokenizer.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.
- classmethod from_repo(repo)
Construct and load a tokenizer from a repository.
- Parameters:
repository – The repository to load from.
- Return type:
- Returns:
Loaded tokenizer.
Architectures
Tokenizer architectures are separated into two layers: non-legacy tokenizers and legacy tokenizers. Non-legacy tokenizers wrap tokenizers from the Hugging Face tokenizers library, whereas legacy tokenizers wrap model-specific tokenizers bundled with the Hugging Face transformers library.
- class curated_transformers.tokenizers.TokenizerBase
Bases:
ABC
Base class for all tokenizers.
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- abstract decode(input, *, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- abstract encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
Non-Legacy
- class curated_transformers.tokenizers.Tokenizer(*, tokenizer, config, special_tokens_map)
Bases:
TokenizerBase
,FromHF
Wraps the tokenizers from the
tokenizers
package. It supports a wide range of piece tokenizers, including word piece, byte pair encoding, and sentencepiece unigram tokenizers. This is the tokenizer that should be used in the majority of cases. The other tokenizers in thecurated-transformers
package should only be used when you have a legacy tokenizer that is not in Hugging Facetokenizer.json
format.Construct a tokenizer.
- Parameters:
- decode(input, *, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- property eos_piece: str | None
Get the end-of-sequence piece.
- Returns:
The end-of-sequence piece or
None
when this piece is not defined.
- classmethod from_dir(path)
Load the tokenizer from a directory with a
tokenizer.json
file.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.
- classmethod from_json(tokenizer_json, config_json=None, special_tokens_map_json=None)
Load the tokenizer from serialized JSON strings.
Legacy
- class curated_transformers.tokenizers.legacy.LegacyTokenizer
Bases:
TokenizerBase
Base class for legacy tokenizers.
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- class curated_transformers.tokenizers.legacy.PreEncoder
Callable applied before encoding.
- abstract __call__(chunks)
Apply the pre-encoder on the chunks.
- Parameters:
chunks (
Iterable
[InputChunks
]) – Input chunks of each input sequence.- Return type:
- Returns:
Modified input chunks.
- class curated_transformers.tokenizers.legacy.PostEncoder
Callable applied after encoding.
- abstract __call__(pieces)
Apply the post-encoder on the pieces.
- Parameters:
pieces (
PiecesWithIds
) – Encoded output of the tokenzier.- Return type:
- Returns:
Modified encoded output.
- class curated_transformers.tokenizers.legacy.PreDecoder
Callable applied before decoding.
- class curated_transformers.tokenizers.legacy.PostDecoder
Callable applied after decoding.
- class curated_transformers.tokenizers.legacy.ByteBPETokenizer(*, vocab, merges, special_pieces=None)
Bases:
LegacyTokenizer
Piece tokenizer using byte-level byte pair encoding (Gage, 1994, Sennrich et al., 2016).
Construct a byte BPE tokenizer.
- Parameters:
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- class curated_transformers.tokenizers.legacy.WordPieceTokenizer(*, vocab, special_pieces)
Bases:
LegacyTokenizer
Piece tokenizer using WordPiece tokenization (Devlin et al., 2018).
Construct a tokenizer from a
curated-tokenizers
WordPiece processor.- Parameters:
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- class curated_transformers.tokenizers.legacy.SentencePieceTokenizer(*, processor)
Bases:
LegacyTokenizer
Piece tokenizer using SentencePiece encoding (Kudo et al., 2018).
Construct a tokenizer from
curated-tokenizers
SentencePiece processor.- Parameters:
processor (
SentencePieceProcessor
) – The processor to wrap.
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
Model-Specific Tokenizers
- class curated_transformers.tokenizers.legacy.BERTTokenizer(*, vocab, special_pieces=None, bos_piece='[CLS]', eos_piece='[SEP]', unk_piece='[UNK]', lowercase=False, strip_accents=False, tokenize_chinese_chars=True)
Bases:
WordPieceTokenizer
,LegacyFromHF
Legacy tokenizer for BERT (Devlin et al., 2018) models.
Construct a BERT tokenizer from a
curated-tokenizers
WordPiece processor.- Parameters:
bos_piece (
str
) – The piece used to mark the beginning of a sequence.eos_piece (
str
) – The piece used to mark the end of a sequence.unk_piece (
str
) – The piece used to mark unknown tokens.lowercase (
bool
) – Lowercase text.strip_accents (
bool
) – Strip accents from text.tokenize_chinese_chars (
bool
) – Tokenize Chinese characters.
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- property eos_piece: str | None
Get the end-of-sequence piece.
- Returns:
The end-of-sequence piece or
None
when this piece is not defined.
- classmethod from_files(*, vocab_file, bos_piece='[CLS]', eos_piece='[SEP]', unk_piece='[UNK]', lowercase=False, strip_accents=False)
Construct a tokenizer from the vocabulary file.
- Parameters:
vocab_file (
RepositoryFile
) – The vocabulary file.bos_piece (
str
) – The piece to use to mark the beginning of a sequence.eos_piece (
str
) – The piece to use to mark the end of a sequence.unk_piece (
str
) – The piece used to mark unknown tokens.lowercase (
bool
) – Lowercase text.strip_accents (
bool
) – Strip accents from text.
- Return type:
TypeVar
(Self
, bound= BERTTokenizer)
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None)
Construct a tokenizer and load its parameters from an fsspec filesystem.
- classmethod from_hf_hub(*, name, revision='main')
Construct a tokenizer and load its parameters from Hugging Face Hub.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.
- class curated_transformers.tokenizers.legacy.CamemBERTTokenizer(*, processor, bos_piece='<s>', eos_piece='</s>')
Bases:
SentencePieceTokenizer
,LegacyFromHF
Legacy tokenizer for CamemBERT (Martin et al., 2020) models.
Construct a CamemBERT tokenizer from a
curated-tokenizers
SentencePiece processor.- Parameters:
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- property eos_piece: str | None
Get the end-of-sequence piece.
- Returns:
The end-of-sequence piece or
None
when this piece is not defined.
- classmethod from_files(*, model_file, bos_piece='<s>', eos_piece='</s>')
Construct a tokenizer from vocabulary and merge files.
- Parameters:
model_file (
RepositoryFile
) – The SentencePiece model file.bos_piece (
str
) – The piece to use to mark the beginning of a sequence.eos_piece (
str
) – The piece to use to mark the end of a sequence.
- Return type:
TypeVar
(Self
, bound= CamemBERTTokenizer)
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None)
Construct a tokenizer and load its parameters from an fsspec filesystem.
- classmethod from_hf_hub(*, name, revision='main')
Construct a tokenizer and load its parameters from Hugging Face Hub.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.
- class curated_transformers.tokenizers.legacy.RoBERTaTokenizer(*, vocab, merges, special_pieces=None, bos_piece='<s>', eos_piece='</s>')
Bases:
ByteBPETokenizer
,LegacyFromHF
Legacy tokenizer for RoBERTa (Liu et al., 2019) models.
Construct a RoBERTa tokenizer.
- Parameters:
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- property eos_piece: str | None
Get the end-of-sequence piece.
- Returns:
The end-of-sequence piece or
None
when this piece is not defined.
- classmethod from_files(*, vocab_file, merges_file, bos_piece='<s>', eos_piece='</s>')
Construct a tokenizer from vocabulary and merge files.
- Parameters:
vocab_file (
RepositoryFile
) – The vocabulary file.merges_file (
RepositoryFile
) – The merges file.bos_piece (
str
) – The piece to use to mark the beginning of a sequence.eos_piece (
str
) – The piece to use to mark the end of a sequence.
- Return type:
TypeVar
(Self
, bound= RoBERTaTokenizer)
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None)
Construct a tokenizer and load its parameters from an fsspec filesystem.
- classmethod from_hf_hub(*, name, revision='main')
Construct a tokenizer and load its parameters from Hugging Face Hub.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.
- class curated_transformers.tokenizers.legacy.LlamaTokenizer(*, processor, add_bos_piece=True, add_eos_piece=False)
Bases:
SentencePieceTokenizer
,LegacyFromHF
Legacy tokenizer for Llama (Touvron et al., 2023 [a], Touvron et al., 2023 [b]) models.
Construct a Llama tokenizer from a
curated-tokenizers
SentencePiece processor.- Parameters:
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- property eos_piece: str | None
Get the end-of-sequence piece.
- Returns:
The end-of-sequence piece or
None
when this piece is not defined.
- classmethod from_files(*, model_file, add_bos_piece=True, add_eos_piece=False)
Construct a Llama tokenizer from a SentencePiece model.
- Parameters:
model_file (
RepositoryFile
) – The SentencePiece model file.add_bos_piece (
bool
) – Add a begin-of-sequence piece.add_eos_piece (
bool
) – Add an end-of-sequence piece.
- Return type:
TypeVar
(Self
, bound= LlamaTokenizer)
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None)
Construct a tokenizer and load its parameters from an fsspec filesystem.
- classmethod from_hf_hub(*, name, revision='main')
Construct a tokenizer and load its parameters from Hugging Face Hub.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.
- class curated_transformers.tokenizers.legacy.XLMRTokenizer(*, processor)
Bases:
SentencePieceTokenizer
,LegacyFromHF
Legacy tokenizer for XLM-RoBERTa (Conneau et al., 2019) models.
Construct a XLM-RoBERTa tokenizer from a
curated-tokenizers
SentencePiece processor.- Parameters:
processor (
SentencePieceProcessor
) – The processor to wrap.
- __call__(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- decode(input, skip_special_pieces=True)
Reconstruct string sequences from piece identifiers.
- encode(input)
Split one or more texts into pieces.
- Parameters:
input (
Union
[Iterable
[InputChunks
],Iterable
[str
]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.- Return type:
- Returns:
Pieces in each sequence.
- property eos_piece: str | None
Get the end-of-sequence piece.
- Returns:
The end-of-sequence piece or
None
when this piece is not defined.
- classmethod from_files(*, model_file)
Construct a XLM-R tokenizer from a SentencePiece model.
- Parameters:
model_file (
RepositoryFile
) – The SentencePiece model file.- Return type:
TypeVar
(Self
, bound= XLMRTokenizer)
- classmethod from_fsspec(*, fs, model_path, fsspec_args=None)
Construct a tokenizer and load its parameters from an fsspec filesystem.
- classmethod from_hf_hub(*, name, revision='main')
Construct a tokenizer and load its parameters from Hugging Face Hub.
- classmethod from_hf_hub_to_cache(*, name, revision='main')
Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.