Tokenizers#

class nemo.collections.common.tokenizers.AutoTokenizer(
pretrained_model_name: str,
vocab_file: str | None = None,
merges_file: str | None = None,
mask_token: str | None = None,
bos_token: str | None = None,
eos_token: str | None = None,
pad_token: str | None = None,
sep_token: str | None = None,
cls_token: str | None = None,
unk_token: str | None = None,
additional_special_tokens: List | None = [],
use_fast: bool | None = True,
trust_remote_code: bool | None = False,
include_special_tokens: bool = False,
)#

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer.

__init__(
pretrained_model_name: str,
vocab_file: str | None = None,
merges_file: str | None = None,
mask_token: str | None = None,
bos_token: str | None = None,
eos_token: str | None = None,
pad_token: str | None = None,
sep_token: str | None = None,
cls_token: str | None = None,
unk_token: str | None = None,
additional_special_tokens: List | None = [],
use_fast: bool | None = True,
trust_remote_code: bool | None = False,
include_special_tokens: bool = False,
)#
Parameters:
  • pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to the documentation of the from_pretrained method here: https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer. The list of all supported models can be found here: https://huggingface.co/models

  • vocab_file – path to file with vocabulary which consists of characters separated by newlines.

  • mask_token – mask token

  • bos_token – the beginning of sequence token

  • eos_token – the end of sequence token. Usually equal to sep_token

  • pad_token – token to use for padding

  • sep_token – token used for separating sequences

  • cls_token – class token. Usually equal to bos_token

  • unk_token – token to use for unknown tokens

  • additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

  • use_fast – whether to use fast HuggingFace tokenizer

  • include_special_tokens – when True, converting text to ids will include special tokens / prompt tokens (if any), yielding self.tokenizer(text).input_ids

class nemo.collections.common.tokenizers.SentencePieceTokenizer(
model_path: str,
special_tokens: Dict[str, str] | List[str] | None = None,
legacy: bool = False,
ignore_extra_whitespaces: bool = True,
chat_template: Dict | None = None,
trim_spm_separator_after_special_token=True,
spm_separator='▁',
)#

Sentencepiecetokenizer google/sentencepiece.

Parameters:
  • model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()

  • special_tokens – either list of special tokens or dictionary of token name to token value

  • legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

  • ignore_extra_whitespaces – whether to ignore extra whitespaces in the input text while encoding. Note: This is done for the current models tokenizers that don’t handle extra whitespaces as by default tokenizer learned to ignore it. To check if the tokenizer by default ignores extra whitespaces refer to self.removed_extra_spaces attribute of the tokenizer. We added a parameter to process_asr_tokenizer.py for upcoming models to handle it inbuilt.

__init__(
model_path: str,
special_tokens: Dict[str, str] | List[str] | None = None,
legacy: bool = False,
ignore_extra_whitespaces: bool = True,
chat_template: Dict | None = None,
trim_spm_separator_after_special_token=True,
spm_separator='▁',
)#
class nemo.collections.common.tokenizers.TokenizerSpec#

Inherit this class to implement a new tokenizer.

__init__()#