Small PRs, Big Impact: A Git Workflow for Data Scientists
Collaborating on projects using Git can be challenging – especially for scientists, academics, and those without a software engineering background. This was certainly true for me early in my career. Since then, I’ve learned a lot from the incredible people I’ve worked with as a machine learning engineer, and it’s a topic I’ve read quite a bit about.
The main bottleneck in any project – software, data science, or otherwise – is almost always the human element. A small investment in learning how to smooth out the collaboration process pays off substantially. I want to share some of what I’ve learned to make Git workflows more accessible to data scientists.
A Typical Git Workflow
Here’s the general workflow when working on a Git project:
git clone
a repository- Plan a change to the repository
git checkout -b <branch>
to create a new branch- Make changes
git push
- Open a pull request (PR) on GitHub to merge
<branch>
intomain
A major problem arises when a branch includes too many changes. As a reviewer, it’s hard not to shudder at the sight of 800+ changed lines. This leads to several issues:
- Reviewers procrastinate – reviewing so many changes is a gruelling task
- Reviews take longer
- Important details are missed
Reading code is much harder than writing it. There’s significant cognitive load involved in giving good-quality reviews (which raises the question – why bother with reviews if they’re ineffective?). Large PRs are unsustainable and don’t scale well across a team.
The key takeaway:
Smaller and more focused PRs tend to spend less time stuck in review limbo.
Tips for a High-Quality Review
To ensure your PRs are easier to review and more likely to be approved quickly:
- Keep them small and focused
- Include tests and docstrings early
- Provide code that reviewers can run to verify the changes work
- Bonus points for screenshots or other testing evidence
- Don’t wait until the end to get feedback – ping someone on Slack and start a conversation
Breaking Down Large Changes
For large changes, I often create a “dev” branch where I can experiment freely. Once I’ve got a working draft, I split it into smaller, reviewable PRs.
For example, say I’m implementing a new data processing pipeline. I might split the work into PRs like:
- Implement skeleton code with placeholder endpoints
- Add data downloading function
- Add data cleaning function
- Add data transformation function
- Add data uploading function
This keeps each PR focused and easier to reason about.
Example Workflow
We’ll check out two copies of the repo on our local machine:
1$ git clone git@github.com:<user>/<repo>.git <repo>
2$ git clone git@github.com:<user>/<repo>.git <repo>-dev
Go to your -dev
version of the repo and use this purely for development:
1$ cd <repo>-dev
2$ git checkout -b <name>/dev/<task-name>
3# Make *all* the changes needed
4$ git add .
5$ git commit -m 'Draft implementation of feature'
6$ git push
You can open a draft PR for this branch, but if it’s too large, it’ll be hard to review. So let’s break it up using our other repo copy:
1$ cd ../<repo>
2$ git pull # Make sure main is up to date
3$ git checkout -b <name>/feat/add-skeleton-and-docs
4$ code .
5# Make *only* the changes needed for skeleton code with placeholders
6# You can copy-paste code or use `git diff`/`git apply`
7$ git add .
8$ git commit -m 'feat: Add skeleton for new service'
9$ git push
Now we can open a PR for <name>/feat/add-skeleton-and-docs
. It’s smaller, more
focused, and will get reviewed much faster.
If there’s PR feedback, we can sync our dev
branch after the PR is merged:
1$ cd ../<repo>-dev
2$ git checkout <name>/dev/<task-name>
3$ git fetch
4$ git merge origin/main
5# Check what's left to merge
6$ git diff origin/main
7# Check which files still have differences
8$ git diff origin/main --numstat
Repeat this process until all the code from the dev
branch has been merged via
small PRs. Then, you can safely delete the dev
branch!