How to reproduce your analysis
What do you do when things don’t seem to be behaving the same way now that they did before? Here are a few techniques to avoid that dread scenario.
“But didn’t you say you got 80% when you ran this analysis last month?”
That little question prompted a very miserable few weeks of data science detective work for me in grad school.
What do you do when things don’t seem to be behaving the same way now that they did before?
It’s tempting, oh my god so tempting, to just stick with the new answer. Until some voice inside wonders ‘but what if the previous answer was right and it’s this new one that’s wrong?’. Or it could be even worse - maybe they’re both wrong. Or maybe you’re just a dummy, and you’ve misremembered the number from before.
Some version of this dilemma will strike everyone working in data science, and probably more than once. When I was bitten by this problem in grad school for the first time, I had to become the world’s most boring archaeologist. I pored over my own version control changelogs, filesystem backups, figures and tables from old presentations, and entries from my lab journal. I tried different random seeds, re-running it on multiple machines, and painstakingly reverting to older versions of the analysis library. I went through everything I could think of that might have changed, to try and understand what was different.
This kind of debugging is painful, it’s slow, and you may never get to the bottom of things.
After that experience, I swore I’d put in extra effort up front wherever I could, to avoid that laborious detective work trying to reproduce old results again. Of course, I’ve been bitten by reproducibility bugs since then anyway… but hopefully less often.
So this is about reproducibility - can you re-run the same analysis 6 months later and get the same answer? Or for a stricter test - could someone else run your analysis long after you’ve gone and get the same answer?
Let’s start with some basics.
Have a front door
Your analysis should have a front door that you always enter through. If it’s a machine learning modelling problem, then create a pipeline that you always call in the same way.
Document the important things
Document things, at least somewhat. How do you call your pipeline? Are there any gotchas? In six months’ time, your memory for all these arcane details will have faded, so do your future self a favour and provide some hints.
Use version control
Use version control. If you’re new to it, maybe start with GitHub Desktop rather than the command-line. Use version control, even if you’re working on your own. Commit often, and use descriptive messages. That way, if you do have trouble reproducing your results from months ago when you run it again, you can try checking out a commit from that time.
- If the code from the past reproduces the results from the past, you’re in luck. You’ve found the smoking gun, and there’s a straightforward debugging method. Do a binary search (see git bisect) through your commits between then and now to find the moment when the results changed.
- If the code from the past doesn’t reproduce the results from the past, then it’s time to bring out the bigger guns.
Make your datasets immutable
You also need to make sure your datasets haven’t changed underfoot. There are lots of ways to achieve this. You could stuff your data into version control too, but this quickly becomes infeasible for large or changing datasets. You could store your data in an S3 bucket with versioning and logging enabled.
Better still, make your datasets immutable. That is, you don’t ever change a dataset - instead, make a copy with a new name, and modify that. Each dataset should have a unique and permanent name, kind of like a ‘primary key’. That way, each analysis can refer unambiguously to the dataset it was run on. For very large datasets, you might have to find a more efficient way to store things while still treating the data as immutable.
Store your results
Store the results from running your analysis somewhere safe and searchable. Don’t do what I did and just stuff them into a shoebox full of receipts, fortune cookie messages, Nietzsche quotes, and phone numbers from girls in gas masks that you met in an underground Berlin techno bar. A spreadsheet is a good start.
Keep a lab journal
Keep a lab journal, a kind of diary that you scribble in as you proceed. Describe your thinking process, and make notes of your progress.
To sum it up:
So, your code is versioned. Your datasets are versioned and immutable. You’re storing your results. And you’re keeping a lab journal. That will get you a long way to being able to reproduce your results.