Tokenization in NLP

Tokenization converts raw text into model-friendly subword or token sequences, balancing vocabulary size and coverage. Choices (byte-pair encoding, unigram, WordPiece) impact handling of rare words, multilingual text, and downstream model efficiency.

Go to List