Sat, Mar 25, 2023
Read in 7 minutes
How to know if code is well written, structured, readable and will stand the test of time? Following this general set of rules will ensure that best practices are implemented and so the code will be of high quality.
Version Control is critical for good tracking of each code modification and it’s really good when working in conjunction with other developers.
Version control is handled with git . The implementation can be local, in-house or cloud.
For cloud the most popular platforms are:
Every project should contain the code files and additional markdown files in an understable structure. The minimum is:
Additionally, and much recommended, is to use a widespread and trustworthy structure with *cookiecutter.*
The documentation can be found here.
An example use for Data Science projects in general goes like this (by cookiecutter data-science):
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science
And results in a project structure like this:
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
``
Every project needs some form of documentation. Even inline comments are the bare minimum, but one should expect to fulfill at least some docstrings inside classes and methods.
The recommended use-case is Sphinx or MkDocs, making it auto-deployable using CI/CD tools like Github Actions.
def get_random_ingredients(kind=None):
"""
Return a list of random ingredients as strings.
:param kind: Optional "kind" of ingredients.
:type kind: list[str] or None
:raise lumache.InvalidKindError: If the kind is invalid.
:return: The ingredients list.
:rtype: list[str]
"""
return ["shells", "gorgonzola", "parsley"]
Links:
In Python there is one standard above all:
Every programmer needs to know how to properly write Python code by following the PEP8 convention.
There are a number of tools to check that at the testing stage, using pre-commit hooks with black and flake8 will ensure only PEP8 compliant code will get commited, else it will throw an error and point it out.
Other tools are integrated in major IDEs like PyCharm or VSCode.
The code that doesn’t work should never be pushed into the main development branch.
Instead, when a broken path is detected, one should open a PR, branch the code at that point, and fix it ASAP. Then petition to merge it into the main development branch.
This methodology is effective at protecting the main development timeline and is called “Zero defects methodology”.
The main power of Python is the widespread availability and ease of use of its infinite packages with infinite use-cases.
So whenever you need some special and repeatable method, first take a look around because chances are that someone faced it before… and packaged the solution.
So don’t waste your time.
Use PyPI, conda, github… but don’t re-write.
In python there are several data structures, and each one is better suited for different use-cases. Choosing badly will lead to slow performance or maybe major re-writing of the code.
Always take into account the general data structures as well as user or package created ones:
If you follow PEP8 convention then you’re almost there.
Also, the code has to be easy to understand. Without unused imports or methods, only necessary comments, keep all the comments with the same style, maximum line length set at 79 or 80 chars.
Variable names should be descriptive for anyone even without reading the rest of the code.
# Don't write like this
x = 1
y = 2
def my_function(a,b):
z = a * b
return z
compute = my_function(x,y)
print("Result is: ",compute)
#
# Should be LIKE THIS
#
heigth = 1
width = 2
def area_rectangle(side_a:float,side_b:float)->float:
"""
Returns the area of a rectangle in float by
entering both sides and multiplying them.
:param side_a: Length of the first side
:param side_b: Lenght of the second side
:return: The computed area.
:rtype: float
"""
return a*b
area = area_rectangle(heigth,width)
print(f"The rectangle has {area} sqm")
And yes, f’strings are very useful for that matter!
One virtual environment for every project. This is a golden rule. Every project starts with the virtual env creation.
There are many options for that matter:
The reason behind virtual environments is avoiding package collisions and keeping configurations local.
This makes the code reproducible and so, reliable.
There is good reason to not have everything in one file with thousands of functions.
Python is a very good object-oriented easy to understand language. In order to exploit this feature, one should code taking into account that:
By using this principle it is really easy to make object-oriented code reusable, readable, modular, encapsulated, and inheritable.
Import only the needed methods, not all of them! You will avoid crashes, name collisions, better use of memory and speed. And it is also a good security measure.
# This is bad
from sklearn.metrics import *
# This good
from sklearn.metrics import confusion_matrix
# Also this is good practice
# importing multiple methods at once
from sklearn.metrics import confusion_matrix, \\
accuracy_score, f1_score
Do not turn off error reporting while coding and testing. This should be avoided for obvious reasons. Only turning it off in production code when the errors are known and do not affect the actual product.
# Never do it like this
import warnings
warnings.filterwarnings("ignore")
Use distutils for altering path variables
Handle secret keys and sensitive information using an external package, never hard-encode them in your python files!
Several packages do this: python-dotenv , AWS’s boto3, Hashicorp’s hvac, etc.
Example with python-dotenv:\
Create a .env file and write inside the key-value pairs.
API_KEY=test-key
API_SECRET=test-secret
Add the .env file in the .gitignore file.
In the main.py file, import and load the dotenv package with each secret by the key. The value will be passed through the secure package.
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("API_KEY")
api_secret = os.getenv("API_SECRET")
print("API_KEY: ", api_key)
print("API_SECRET: ", api_secret)
Writing good python code is easy and achievable if one follows a simple set of rules.
In this article I presented my general rules for that purpose, but there may be others.
This is the foundation of the checks needed for MLOps, Data Science, DevOps and general python coding.