How to train your LLM: Preprocessing (4/10)

Wed, May 31, 2023

Read in 1 minutes

We'll focus on the essential step of preprocessing your data for training a Language Learning Model (LLM) in Python. Preprocessing involves transforming raw text data into a suitable format, including cleaning, tokenization, normalization, and feature extraction. Follow along as we explore practical techniques to prepare your data effectively for LLM training, setting the foundation for subsequent parts of the series.

How to train your LLM: Preprocessing (4/10)

Downloading the dataset
Merging all the datasets
Anonymize the data by removing emails, IP addresses, secret keys
Remove auto-generated code
- Detected using standard regex and other heuristics
Remove code that doesn’t compile or is not parseable
- Only possible for a subset of languages
Filters based on average line length, maximum line length, pct alphanumeric chars
Metrics of quality (number of gh issues, stars, etc) for removing bugs