ProbableOdyssey

Small PRs, Big Impact: A Git Workflow for Data Scientists

Collaborating on projects using Git can be challenging – especially for scientists, academics, and those without a software engineering background. This was certainly true for me early in my career. Since then, I’ve learned a lot from the incredible people I’ve worked with as a machine learning engineer, and it’s a topic I’ve read quite a bit about.

The main bottleneck in any project – software, data science, or otherwise – is almost always the human element. A small investment in learning how to smooth out the collaboration process pays off substantially. I want to share some of what I’ve learned to make Git workflows more accessible to data scientists.

A Typical Git Workflow

Here’s the general workflow when working on a Git project:

A major problem arises when a branch includes too many changes. As a reviewer, it’s hard not to shudder at the sight of 800+ changed lines. This leads to several issues:

Reading code is much harder than writing it. There’s significant cognitive load involved in giving good-quality reviews (which raises the question – why bother with reviews if they’re ineffective?). Large PRs are unsustainable and don’t scale well across a team.

The key takeaway:

Smaller and more focused PRs tend to spend less time stuck in review limbo.

Tips for a High-Quality Review

To ensure your PRs are easier to review and more likely to be approved quickly:


Breaking Down Large Changes

For large changes, I often create a “dev” branch where I can experiment freely. Once I’ve got a working draft, I split it into smaller, reviewable PRs.

For example, say I’m implementing a new data processing pipeline. I might split the work into PRs like:

This keeps each PR focused and easier to reason about.


Example Workflow

We’ll check out two copies of the repo on our local machine:

1$ git clone git@github.com:<user>/<repo>.git <repo>
2$ git clone git@github.com:<user>/<repo>.git <repo>-dev

Go to your -dev version of the repo and use this purely for development:

1$ cd <repo>-dev
2$ git checkout -b <name>/dev/<task-name>
3# Make *all* the changes needed
4$ git add .
5$ git commit -m 'Draft implementation of feature'
6$ git push

You can open a draft PR for this branch, but if it’s too large, it’ll be hard to review. So let’s break it up using our other repo copy:

1$ cd ../<repo>
2$ git pull  # Make sure main is up to date
3$ git checkout -b <name>/feat/add-skeleton-and-docs
4$ code .
5# Make *only* the changes needed for skeleton code with placeholders
6# You can copy-paste code or use `git diff`/`git apply`
7$ git add .
8$ git commit -m 'feat: Add skeleton for new service'
9$ git push

Now we can open a PR for <name>/feat/add-skeleton-and-docs. It’s smaller, more focused, and will get reviewed much faster.

If there’s PR feedback, we can sync our dev branch after the PR is merged:

1$ cd ../<repo>-dev
2$ git checkout <name>/dev/<task-name>
3$ git fetch
4$ git merge origin/main
5# Check what's left to merge
6$ git diff origin/main
7# Check which files still have differences
8$ git diff origin/main --numstat

Repeat this process until all the code from the dev branch has been merged via small PRs. Then, you can safely delete the dev branch!

Reply to this post by email ↪