Wed, May 31, 2023
Read in 1 minutes
We'll focus on the essential step of preprocessing your data for training a Language Learning Model (LLM) in Python. Preprocessing involves transforming raw text data into a suitable format, including cleaning, tokenization, normalization, and feature extraction. Follow along as we explore practical techniques to prepare your data effectively for LLM training, setting the foundation for subsequent parts of the series.
Downloading the dataset
Merging all the datasets
Anonymize the data by removing emails, IP addresses, secret keys
Remove auto-generated code
Remove code that doesn’t compile or is not parseable
Filters based on average line length, maximum line length, pct alphanumeric chars
Metrics of quality (number of gh issues, stars, etc) for removing bugs