This week I conducted several interviews with data scientists about challenges in their workflow, one issue universally stood out: the time spent deciphering a colleague's Jupyter Notebook. To help avoid this frustration, here are four best practices to ensure your team isn’t pulling their hair out when reviewing your work.
1. Organize with Clear Sections and Headings
Use markdown cells to divide your notebook into logical sections like Introduction, Data Loading, Data Cleaning, Modeling, Results, etc. This helps create a flow that is easy to follow. Within these cells, use descriptive headings (#, ##, ###) for each section and add context about what the code is doing.
Example:
# Project Title: Analyzing Sales Data
In this project, we will analyze sales data to identify trends and patterns. The notebook is divided into several sections, each with a specific focus.
## 1. Data Loading
We will begin by importing the necessary libraries and loading the dataset. The data consists of monthly sales figures for different regions.
### Step 1.1: Importing Libraries
We will import common libraries such as pandas and matplotlib for data manipulation and visualization.
2. Add Explanatory Comments
Notebooks are great for quickly exploring ideas, but this scratchpad approach can lead to a bad habit of not including code comments. Remember to provide concise and meaningful comments throughout your code to explain what each section or key line does. It will probably be useful to you when you come back to the notebook in the future.
Example:
# Remove rows with missing values to avoid errors in the model
df = df.dropna()
3. Limit the Length of Code Cells
It's hard to remember to break up code while in the flow of writing, so one of the biggest offenders is code cells that do too much. Avoid long, cluttered code blocks. Instead, break up your code into smaller, functional cells, each focused on a single task. This will enhance readability and make it easier to debug.
Example:
Instead of a long cell:
# Load data, clean, and preprocess all in one cell
Split into smaller cells:
# Load data
df = pd.read_csv('data.csv')
# Clean data
df = df.dropna()
# Preprocess data
df['column'] = df['column'].apply(lambda x: x.lower())
4. Use Descriptive Variable and Function Names
Another victim of the scratchpad mindset, lazy variable names that come back to bite when the notebook needs to be revisited or shared. Choose variable and function names that describe their purpose or the data they hold. This practice also reduces the need for excessive comments and can help make the code self-explanatory.
Example:
Instead of:
x = df.groupby('region').mean()
Use:
average_sales_by_region = df.groupby('region').mean()
Conclusion
All of these are pretty basic but quick to be tossed aside. Follow these tips and your teammates will surely thank you! Once you have optimized your notebook’s readability and are checking it into Git for review, GitNotebooks makes reviewing the diffs as easy as reading your notebook.