How to train your LLM: Tokenizer (5/10)

Thu, Jun 1, 2023

Read in 1 minutes

We'll focus on training a tokenizer for your Language Learning Model (LLM) in Python. Tokenization is the process of splitting text into individual tokens or words, enabling effective language analysis. We'll explore various tokenization approaches and provide practical examples and Python code snippets to guide you through training and using a tokenizer in your LLM. By the end of this part, you'll have the knowledge and tools to train a tokenizer that aligns with your LLM's requirements, enhancing its language processing capabilities.

Tokenizers are made up of an algorithm and vocabulary
Many standard tokenizers are available from Hugging Face
Train our own custom vocabulary from the underlying training data