Tokenizers

Inputs

Each tokenizer accepts a Iterable[str] or a Iterable[InputChunks]. In most cases, passing a list of strings should suffice. However, passing InputChunks can be useful when special pieces need to be added to the input.

When the tokenizer is called with a list of strings, each string is automatically converted to a TextChunk, which represents a text chunk that should be tokenized. The other type of supported chunk is the SpecialPieceChunk. The piece stored by this type of chunk is not tokenized but looked up directly.

class curated_transformers.tokenizers.InputChunks(iterable=(), /)

Bases: List[Union[SpecialPieceChunk, TextChunk]]

A list of chunks.

merge_text_chunks()

Merge multiple contiguous text chunks and before/after text in special piece chunks.

Return type:

MergedInputChunks

class curated_transformers.tokenizers.SpecialPieceChunk(piece, after=None, before=None)

A chunk that contains a special piece. This piece is not tokenized, but looked up directly in the vocabulary. Can additionally store strings that should be appended to a text chunk before or prepended to a text chunk after the special piece.

Parameters:
  • piece (str) – Piece to look up in the vocabulary.

  • after (Optional[str]) – Text to prepend to the succeeding text chunk.

  • before (Optional[str]) – Text to append to the preceding text chunk.

class curated_transformers.tokenizers.TextChunk(text)

A chunk of text that should be tokenized.

Parameters:

text (str) – Text that should be tokenized.

Outputs

All tokenizers encode raw strings into pieces. The pieces are stored in a special container PiecesWithIds.

class curated_transformers.tokenizers.PiecesWithIds(ids, pieces)

Bases: object

Encoded output of tokenizers.

Parameters:
  • ids (List[List[int]]) – Piece identifiers of each input sequence.

  • pieces (List[List[str]]) – Piece strings of each input sequence.

attention_mask(*, pad_left=False, device=None)

Generate the attention masks. The mask is equivalent to: ids.padded_tensor(padding_id) != padding_id

Parameters:
  • pad_left (bool) – By default sequences shorter than the longest sequence are right-padded. Use left-padding when set to True.

  • device (Optional[device]) – Device on which the attention mask is created.

Return type:

AttentionMask

Returns:

The attention mask.

Shape: (batch_size, max_seq_len)

padded_tensor(*, padding_id=0, pad_left=False, device=None)

Generate a padded tensor of the piece identifiers.

Parameters:
  • padding_id (int) – Piece identifier of the padding piece. The actual identifier generally doesn’t matter when an attention mask is used (and as long as it is a valid vocabulary index).

  • pad_left (bool) – By default sequences shorter than the longest sequence are right-padded. Use left-padding when set to True.

  • device (Optional[device]) – Device on which the padded tensor is created.

Return type:

Tensor

Returns:

The padded piece ids.

Shape: (batch_size, max_seq_len)

The encoded pieces can be decoded to produce raw strings.

Downloading

Each tokenizer type provides a from_hf_hub function that will load a tokenizer from Hugging Face Hub. If you want to load a tokenizer without committing to a specific tokenizer type, you can use the AutoTokenizer class. This class also provides a from_hf_hub method to load a tokenizer, but will try to infer the tokenizer type automatically.

class curated_transformers.tokenizers.AutoTokenizer

Tokenizer loaded from the Hugging Face Model Hub.

classmethod from_fsspec(*, fs, model_path, fsspec_args=None)

Construct a tokenizer and load its parameters from an fsspec filesystem.

Parameters:
  • fs (AbstractFileSystem) – Filesystem.

  • model_path (str) – The model path.

  • fsspec_args (Optional[FsspecArgs]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.

Return type:

TokenizerBase

Returns:

The tokenizer.

classmethod from_hf_hub(*, name, revision='main')

Infer a tokenizer type and load it from Hugging Face Hub.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

Return type:

TokenizerBase

Returns:

The tokenizer.

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_repo(repo)

Construct and load a tokenizer from a repository.

Parameters:

repository – The repository to load from.

Return type:

TokenizerBase

Returns:

Loaded tokenizer.

Architectures

Tokenizer architectures are separated into two layers: non-legacy tokenizers and legacy tokenizers. Non-legacy tokenizers wrap tokenizers from the Hugging Face tokenizers library, whereas legacy tokenizers wrap model-specific tokenizers bundled with the Hugging Face transformers library.

class curated_transformers.tokenizers.TokenizerBase

Bases: ABC

Base class for all tokenizers.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

abstract decode(input, *, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

abstract encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

abstract property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

abstract piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

Non-Legacy

class curated_transformers.tokenizers.Tokenizer(*, tokenizer, config, special_tokens_map)

Bases: TokenizerBase, FromHF

Wraps the tokenizers from the tokenizers package. It supports a wide range of piece tokenizers, including word piece, byte pair encoding, and sentencepiece unigram tokenizers. This is the tokenizer that should be used in the majority of cases. The other tokenizers in the curated-transformers package should only be used when you have a legacy tokenizer that is not in Hugging Face tokenizer.json format.

Construct a tokenizer.

Parameters:
  • tokenizer (Tokenizer) – The tokenizers tokenizer to use.

  • config (Optional[Dict[str, Any]]) – Additional tokenizer configuration.

  • special_tokens_map (Optional[Dict[str, Any]]) – Map of special tokens.

decode(input, *, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

classmethod from_dir(path)

Load the tokenizer from a directory with a tokenizer.json file.

Parameters:

path (Path) – Path to the tokenizer directory.

Return type:

TypeVar(Self, bound= Tokenizer)

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_json(tokenizer_json, config_json=None, special_tokens_map_json=None)

Load the tokenizer from serialized JSON strings.

Parameters:
  • tokenizer_json (str) – The JSON string of the serialized tokenizer.

  • config_json (Optional[str]) – The JSON string of the tokenizer config.

  • special_tokens_map_json (Optional[str]) – The JSON string of the special tokens map.

Return type:

TypeVar(Self, bound= Tokenizer)

classmethod from_repo(repo)

Construct and load a tokenizer from a repository.

Parameters:

repository – The repository to load from.

Return type:

TypeVar(Self, bound= Tokenizer)

Returns:

Loaded tokenizer.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

Legacy

class curated_transformers.tokenizers.legacy.LegacyTokenizer

Bases: TokenizerBase

Base class for legacy tokenizers.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

class curated_transformers.tokenizers.legacy.PreEncoder

Callable applied before encoding.

abstract __call__(chunks)

Apply the pre-encoder on the chunks.

Parameters:

chunks (Iterable[InputChunks]) – Input chunks of each input sequence.

Return type:

List[InputChunks]

Returns:

Modified input chunks.

class curated_transformers.tokenizers.legacy.PostEncoder

Callable applied after encoding.

abstract __call__(pieces)

Apply the post-encoder on the pieces.

Parameters:

pieces (PiecesWithIds) – Encoded output of the tokenzier.

Return type:

PiecesWithIds

Returns:

Modified encoded output.

class curated_transformers.tokenizers.legacy.PreDecoder

Callable applied before decoding.

abstract __call__(input)

Apply the pre-decoder on the input.

Parameters:

input (Iterable[Iterable[int]]) – Piece identifiers of each input sequence.

Return type:

List[List[int]]

Returns:

Modified piece identifiers.

class curated_transformers.tokenizers.legacy.PostDecoder

Callable applied after decoding.

abstract __call__(output)

Apply the post-decoder on the output.

Parameters:

output (Iterable[str]) – Decoded strings from the tokenizer.

Return type:

List[str]

Returns:

Modified decoded strings.

class curated_transformers.tokenizers.legacy.ByteBPETokenizer(*, vocab, merges, special_pieces=None)

Bases: LegacyTokenizer

Piece tokenizer using byte-level byte pair encoding (Gage, 1994, Sennrich et al., 2016).

Construct a byte BPE tokenizer.

Parameters:
__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

abstract property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

class curated_transformers.tokenizers.legacy.WordPieceTokenizer(*, vocab, special_pieces)

Bases: LegacyTokenizer

Piece tokenizer using WordPiece tokenization (Devlin et al., 2018).

Construct a tokenizer from a curated-tokenizers WordPiece processor.

Parameters:
__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

abstract property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

class curated_transformers.tokenizers.legacy.SentencePieceTokenizer(*, processor)

Bases: LegacyTokenizer

Piece tokenizer using SentencePiece encoding (Kudo et al., 2018).

Construct a tokenizer from curated-tokenizers SentencePiece processor.

Parameters:

processor (SentencePieceProcessor) – The processor to wrap.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

Model-Specific Tokenizers

class curated_transformers.tokenizers.legacy.BERTTokenizer(*, vocab, special_pieces=None, bos_piece='[CLS]', eos_piece='[SEP]', unk_piece='[UNK]', lowercase=False, strip_accents=False, tokenize_chinese_chars=True)

Bases: WordPieceTokenizer, LegacyFromHF

Legacy tokenizer for BERT (Devlin et al., 2018) models.

Construct a BERT tokenizer from a curated-tokenizers WordPiece processor.

Parameters:
  • vocab (Dict[str, int]) – The word piece vocabulary.

  • special_pieces (Optional[Dict[str, int]]) – Special pieces.

  • bos_piece (str) – The piece used to mark the beginning of a sequence.

  • eos_piece (str) – The piece used to mark the end of a sequence.

  • unk_piece (str) – The piece used to mark unknown tokens.

  • lowercase (bool) – Lowercase text.

  • strip_accents (bool) – Strip accents from text.

  • tokenize_chinese_chars (bool) – Tokenize Chinese characters.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

classmethod from_files(*, vocab_file, bos_piece='[CLS]', eos_piece='[SEP]', unk_piece='[UNK]', lowercase=False, strip_accents=False)

Construct a tokenizer from the vocabulary file.

Parameters:
  • vocab_file (RepositoryFile) – The vocabulary file.

  • bos_piece (str) – The piece to use to mark the beginning of a sequence.

  • eos_piece (str) – The piece to use to mark the end of a sequence.

  • unk_piece (str) – The piece used to mark unknown tokens.

  • lowercase (bool) – Lowercase text.

  • strip_accents (bool) – Strip accents from text.

Return type:

TypeVar(Self, bound= BERTTokenizer)

classmethod from_fsspec(*, fs, model_path, fsspec_args=None)

Construct a tokenizer and load its parameters from an fsspec filesystem.

Parameters:
  • fs (AbstractFileSystem) – Filesystem.

  • model_path (str) – The model path.

  • fsspec_args (Optional[FsspecArgs]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub(*, name, revision='main')

Construct a tokenizer and load its parameters from Hugging Face Hub.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_repo(repo)

Construct and load a tokenizer from a repository.

Parameters:

repository – The repository to load from.

Return type:

TypeVar(SelfLegacyFromHF, bound= LegacyFromHF)

Returns:

Loaded tokenizer.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

class curated_transformers.tokenizers.legacy.CamemBERTTokenizer(*, processor, bos_piece='<s>', eos_piece='</s>')

Bases: SentencePieceTokenizer, LegacyFromHF

Legacy tokenizer for CamemBERT (Martin et al., 2020) models.

Construct a CamemBERT tokenizer from a curated-tokenizers SentencePiece processor.

Parameters:
  • processor (SentencePieceProcessor) – The processor to wrap.

  • bos_piece (str) – The piece to use to mark the beginning of a sequence.

  • eos_piece (str) – The piece to use to mark the end of a sequence.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

classmethod from_files(*, model_file, bos_piece='<s>', eos_piece='</s>')

Construct a tokenizer from vocabulary and merge files.

Parameters:
  • model_file (RepositoryFile) – The SentencePiece model file.

  • bos_piece (str) – The piece to use to mark the beginning of a sequence.

  • eos_piece (str) – The piece to use to mark the end of a sequence.

Return type:

TypeVar(Self, bound= CamemBERTTokenizer)

classmethod from_fsspec(*, fs, model_path, fsspec_args=None)

Construct a tokenizer and load its parameters from an fsspec filesystem.

Parameters:
  • fs (AbstractFileSystem) – Filesystem.

  • model_path (str) – The model path.

  • fsspec_args (Optional[FsspecArgs]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub(*, name, revision='main')

Construct a tokenizer and load its parameters from Hugging Face Hub.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_repo(repo)

Construct and load a tokenizer from a repository.

Parameters:

repository – The repository to load from.

Return type:

TypeVar(SelfLegacyFromHF, bound= LegacyFromHF)

Returns:

Loaded tokenizer.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

class curated_transformers.tokenizers.legacy.RoBERTaTokenizer(*, vocab, merges, special_pieces=None, bos_piece='<s>', eos_piece='</s>')

Bases: ByteBPETokenizer, LegacyFromHF

Legacy tokenizer for RoBERTa (Liu et al., 2019) models.

Construct a RoBERTa tokenizer.

Parameters:
  • vocab (Dict[str, int]) – The word piece vocabulary.

  • merges (List[Tuple[str, str]]) – Merges.

  • special_pieces (Optional[Dict[str, int]]) – Special pieces.

  • bos_piece (str) – Beginning of sequence piece.

  • eos_piece (str) – End of sequence piece.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

classmethod from_files(*, vocab_file, merges_file, bos_piece='<s>', eos_piece='</s>')

Construct a tokenizer from vocabulary and merge files.

Parameters:
  • vocab_file (RepositoryFile) – The vocabulary file.

  • merges_file (RepositoryFile) – The merges file.

  • bos_piece (str) – The piece to use to mark the beginning of a sequence.

  • eos_piece (str) – The piece to use to mark the end of a sequence.

Return type:

TypeVar(Self, bound= RoBERTaTokenizer)

classmethod from_fsspec(*, fs, model_path, fsspec_args=None)

Construct a tokenizer and load its parameters from an fsspec filesystem.

Parameters:
  • fs (AbstractFileSystem) – Filesystem.

  • model_path (str) – The model path.

  • fsspec_args (Optional[FsspecArgs]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub(*, name, revision='main')

Construct a tokenizer and load its parameters from Hugging Face Hub.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_repo(repo)

Construct and load a tokenizer from a repository.

Parameters:

repository – The repository to load from.

Return type:

TypeVar(SelfLegacyFromHF, bound= LegacyFromHF)

Returns:

Loaded tokenizer.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

class curated_transformers.tokenizers.legacy.LlamaTokenizer(*, processor, add_bos_piece=True, add_eos_piece=False)

Bases: SentencePieceTokenizer, LegacyFromHF

Legacy tokenizer for Llama (Touvron et al., 2023 [a], Touvron et al., 2023 [b]) models.

Construct a Llama tokenizer from a curated-tokenizers SentencePiece processor.

Parameters:
  • processor (SentencePieceProcessor) – The processor to wrap.

  • add_bos_piece (bool) – Add a begin-of-sequence piece.

  • add_eos_piece (bool) – Add an end-of-sequence piece.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

classmethod from_files(*, model_file, add_bos_piece=True, add_eos_piece=False)

Construct a Llama tokenizer from a SentencePiece model.

Parameters:
  • model_file (RepositoryFile) – The SentencePiece model file.

  • add_bos_piece (bool) – Add a begin-of-sequence piece.

  • add_eos_piece (bool) – Add an end-of-sequence piece.

Return type:

TypeVar(Self, bound= LlamaTokenizer)

classmethod from_fsspec(*, fs, model_path, fsspec_args=None)

Construct a tokenizer and load its parameters from an fsspec filesystem.

Parameters:
  • fs (AbstractFileSystem) – Filesystem.

  • model_path (str) – The model path.

  • fsspec_args (Optional[FsspecArgs]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub(*, name, revision='main')

Construct a tokenizer and load its parameters from Hugging Face Hub.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_repo(repo)

Construct and load a tokenizer from a repository.

Parameters:

repository – The repository to load from.

Return type:

TypeVar(SelfLegacyFromHF, bound= LegacyFromHF)

Returns:

Loaded tokenizer.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.

class curated_transformers.tokenizers.legacy.XLMRTokenizer(*, processor)

Bases: SentencePieceTokenizer, LegacyFromHF

Legacy tokenizer for XLM-RoBERTa (Conneau et al., 2019) models.

Construct a XLM-RoBERTa tokenizer from a curated-tokenizers SentencePiece processor.

Parameters:

processor (SentencePieceProcessor) – The processor to wrap.

__call__(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

decode(input, skip_special_pieces=True)

Reconstruct string sequences from piece identifiers.

Parameters:
  • input (Iterable[Iterable[int]]) – The piece identifiers to reconstruct the strings from.

  • skip_special_pieces (bool) – Skip special pieces during decoding.

Return type:

List[str]

Returns:

The decoded strings.

encode(input)

Split one or more texts into pieces.

Parameters:

input (Union[Iterable[InputChunks], Iterable[str]]) – Sequences to tokenize. If the sequences are strings, they are automatically converted to chunks.

Return type:

PiecesWithIds

Returns:

Pieces in each sequence.

property eos_piece: str | None

Get the end-of-sequence piece.

Returns:

The end-of-sequence piece or None when this piece is not defined.

classmethod from_files(*, model_file)

Construct a XLM-R tokenizer from a SentencePiece model.

Parameters:

model_file (RepositoryFile) – The SentencePiece model file.

Return type:

TypeVar(Self, bound= XLMRTokenizer)

classmethod from_fsspec(*, fs, model_path, fsspec_args=None)

Construct a tokenizer and load its parameters from an fsspec filesystem.

Parameters:
  • fs (AbstractFileSystem) – Filesystem.

  • model_path (str) – The model path.

  • fsspec_args (Optional[FsspecArgs]) – Implementation-specific keyword arguments to pass to fsspec filesystem operations.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub(*, name, revision='main')

Construct a tokenizer and load its parameters from Hugging Face Hub.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

Return type:

TypeVar(SelfFromHF, bound= FromHF)

Returns:

The tokenizer.

classmethod from_hf_hub_to_cache(*, name, revision='main')

Download the tokenizer’s serialized model, configuration and vocab files from Hugging Face Hub into the local Hugging Face cache directory. Subsequent loading of the tokenizer will read the files from disk. If the files are already cached, this is a no-op.

Parameters:
  • name (str) – Model name.

  • revision (str) – Model revision.

classmethod from_repo(repo)

Construct and load a tokenizer from a repository.

Parameters:

repository – The repository to load from.

Return type:

TypeVar(SelfLegacyFromHF, bound= LegacyFromHF)

Returns:

Loaded tokenizer.

piece_to_id(piece)

Get the ID for a single piece.

Parameters:

piece (str) – The piece to look up the identifier for.

Return type:

Optional[int]

Returns:

The piece identifier or None when the piece is unknown.