Byte-Pair Encoding Algorithm

The focus of the second issue of One Minute NLP is BPE, a popular tokenization algorithm which is used by most LLMs including GPT, Llama, and Mistral.

Jul 08, 2024

Byte-Pair Encoding Algorithm

Byte-Pair Encoding (BPE) is a tokenization algorithm which is used by most LLMs (e.g., GPT, Llama) to build their vocabularies. Tokens created by BPE are called subwords because they are often smaller than words – these are typically the most common subwords that appear in the pretraining set. To determine these subwords, BPE starts with individual characters as the token set and then iteratively merges the two most common consecutive tokens to form a new token until a target vocabulary size is reached. BPE can deal with unknown words during inference because new words can be represented by some sequence of existing subwords. WordPiece and SentencePiece tokenization are popular alternatives to BPE and work similarly.

Consider an example corpus: “llama llama red pajama“.

In BPE tokenization, the starting vocabulary consists of individual characters. The corresponding corpus representation would look like this (dashes indicate token boundaries):

Iteration: 0
Vocabulary: l a m r e d p j
Corpus representation: l-l-a-m-a l-l-a-m-a r-e-d p-a-j-a-m-a

Vocabulary and corpus representation after the first iteration (a+m were the most frequent consecutive tokens and are merged into a new token am):

Iteration: 1
Vocabulary: l a m r e d p j am
Corpus representation: l-l-am-a l-l-am-a r-e-d p-a-j-am-a

Vocabulary and corpus representation after the second iteration:

Iteration: 2
Vocabulary: l a m r e d p j am ama
Corpus representation: l-l-ama l-l-ama r-e-d p-a-j-ama

Vocabulary and corpus representation after the fifth iteration:

Iteration: 5
Vocabulary: l a m r e d p j am ama ll llama re
Corpus representation: llama llama re-d p-a-j-ama

One Minute NLP

Byte-Pair Encoding Algorithm

The focus of the second issue of One Minute NLP is BPE, a popular tokenization algorithm which is used by most LLMs including GPT, Llama, and Mistral.

Byte-Pair Encoding Algorithm

Further Reading

Do you want to learn more NLP concepts?

Discussion about this post