Originally published back in 2020, now updated with extra anger.
While mentoring some juniors, I started to think about the rules of thumb for analysing data that I've built up over the years. I'm certainly not the world's greatest data scientist (or bioinformatician, epi-informatician, statistician, AI engineer, or anything else), but I do have a lot of hard-earned, battle-worn experience and it’s worthwhile capturing them here. These are obviously shaped by my experience in analysing healthcare and biomedical data. In no particular order:
Analysis should be code
Non-trivial1 data analysis should be done with code, proper code, in a script—not Excel, JMP, a plotting package, or macros. Why?
It's reproducible
It's documentable
It's versionable
It can be worked on by multiple people
It's debuggable
It's easier to revise rather than just repeating it wholesale
You can re-run it easily on other datasets or the same dataset
It might seem like the long (hard) way. But for non-trivial tasks, in the long run, it’s actually faster.
I remember with horror a colleague who insisted on splitting and filtering a 20K-strong patient cohort on a huge list of complex criteria, by using a single massive Excel spreadsheet. It involved a stupendous number of sub-sheets, macros and scripting, linked in a thick mesh of cross-references. It took months to put together. Did it work? There was no way of knowing - it just didn't obviously not work.
That’s not a way to good results. Which leads me to …
Not obviously wrong does not necessarily mean right
There’s an old programming quip along the lines of “If it compiles, ship it”. But a lot of data analysis code implicitly works like that - if there’s not an obvious problem or error, we quietly assume that it works.
Reality isn’t so kind.
Treat your analysis like proper code, like a program that has to run robustly and independently from you, that lives in an unreliable and variable world. Sprinkle it with asserts, validations, dummy checks, preconditions & postconditions (e.g. how many records should we be processing in this stage, what range should the values be in, are missing values possible or allowed, are the IDs well-formatted and clean). Trust but verify but with less trust2. Why?
For many problems, our ability to spot a bad answer is severely limited.
You might accidentally load the wrong dataset - the one that hasn't been cleaned or is from a previous analysis - and silently get the wrong answers.
Your code might work flawlessly with good data, but with bad data, it might silently hand you the wrong answer. This is true of all too many R and bioinformatic scripts. Try this: get a gene expression dataframe, transpose the axes (swap the gene names with the record ids), feed it through an analytic pipeline and see if it fails.
It's better to fail fast, to stop immediately upon an error occurring rather than having to check the end results and repeat the whole analysis. Fail early and fail often.
Don't invent your own data formats
This is an old adage in bioinformatics that applies elsewhere. Why?
You're probably not as good at designing formats as you think
How will anyone else know how to parse your data? Will they have to write a custom parser?
You'll probably never get around to formally documenting your format, just making things hard for other people (see previous point).
If there a published or popular format that's 90% right for you? Maybe you should use that. Meta-formats, or formats built on top of other formats like storing data in JSON or YAML, are acceptable. At least then, there are tools to read and write them, and the base format will stop a lot of bad design decisions.
As an example lesson, look at the long history of ambiguities, extensions and inconsistencies in the NEXUS phylogenetic format or even something as simple as the NEWICK trees. Another university-created software program for phylogenetics used to have to evolve and correct its own proprietary format so often that saved analyses were essentially locked to the version they were created with.
Store data in humane common ways
Store data in CSV. If you can't do that, store in JSON or YAML. If you can't do that, SQLite. Why?
It's human-readable & thus debuggable
(Except for SQLite) It's plain text, thus in a pinch you can look or edit it with a text editor
We can deal with different text encodings easily
It can be read and written by a huge variety of software
The behaviour of and the definition of those formats are well-defined and understood.
Obviously, there are datasets that don't fit easily into this schema because of being too big or inherently binary or just weird. But much of the data we use can be readily handled by the above tools. If you have to go big, think about using something common like HDF5.
I dislike storing data in Excel because it tends to think it knows what I want better than I do and start changing data3. But it’s so prevalent and widely used in the industry (a huge number of biotechs store their lab data in Excel 😭) that it might be easier to go with the flow. You also get something that your stakeholders understand and can use, and there are some nice toolkits in Python for writing and manipulating Excel tables.
XML has its place, but as a data format for computers, not people. As a rule of thumb, most systems that insist on using XML upfront are deeply pathological.
Reports are generated straight from analysis
Summaries and results should be auto-generated from analysis rather than cut and pasted into Word or PowerPoint. Why?
It’s faster, once it's set up, and faster to repeat
It’s reproducible
It prevents mistakes and omissions when you update
Copy-and-pasting into Powerpoint tends to encourage a style with excess verbiage and showmanship, with discretely massaged results being shamefully tacked on the end, and questions discouraged4
This admittedly can take a while to set up the first time, but in the long term, it pays off. I once worked on a detailed monthly report of clinical trials that I tried to help by automating. Unfortunately, every trial and every page needed something different, some additional text, some modifications, commentary, adjustments … so that monthly report took 2 weeks to prepare 🙄
No closed shop, proprietary tools
Why?
You'll want to study them, see how they do their work, check that they do their work correctly
You'll want to share your work, and you can't do that with people who don't have a license for
It avoids lock-in. The two critical questions for any analysis platform or tool are: What can't it do? When that happens, can I get my data out to use it somewhere else?
Make sure that debug and dev modifications are blindingly obvious
Make any debug code blindingly obvious, strongly labelled, easy to find and easy to switch off in a single step. Why? You will forget to switch off or delete the debug code and so will accidentally ship analyses that only look at the first 100 records, use dummy datasets, short-cut algorithms or make unsuitable assumptions.
We once had a web-tool that ran observed epi data against a historical database. Every once in a while, it would crash for no obvious reason. The development version ran flawlessly. One day, I got frustrated and sat down to puzzle it out. I sprinkled diagnostic print statements through the code of the production version, repeatedly entering data into the tool until it crashed, narrowing down the problem. Eventually, I found the culprit: a single line of code that referred to the debug version of the database, clearly placed there to work out some bug. In development, this worked fine because the only database the code saw was the debug one. In production, most of the data was being pulled from the production db, of course, except in this one place. And the two database versions were similar but not identical. This web-tool had been running for a year, delivering potentially incorrect results.
If there's more than one substantial processing or analysis step, and this analysis will be run more than once, use a pipeline or workflow
Why?
It'll save you time in the long term
It documents your work
It makes your work reproducible
It makes your work easier to revise
The actual choice of workflow software is a Religious Issue, but I'm partial to Snakemake, and many people like Nextflow and Drake. Just use something.
Perhaps you can't prove this is true, but is it useful?
Often, you can't treat the result of an analysis as being true in a rigorous, statistical sense. Why?
Data is biased, or cherry-picked
A la Brian Wansink5, you've tried so many different approaches and slices of the data that one of them was bound to show something
The algorithms you're using have only the veneer of a statistical basis, and / or obscure assumptions that may or may not be satisfied. Biomedical data, in particular, is heterogeneous, biased, and incomplete. Who knows if it meets the necessary assumptions?
This is not to say that statistics is unimportant, but it is sometimes difficult and can be unreliable or easily misused. You can easily lie with statistics, and maybe it’s impossible to place a statistical measure on your results. So, if your results aren't necessarily true, what can you do? You can ask, "This is better than what? And how can I show that?" Look for ways of independently validating or reproducing your findings. Or, if your answers aren't guaranteed to be true, do they show that something is not true? Look for ways for answers to be useful, to be advisory on other approaches.
Another way to say this, is to quote Bill Clinton’s favorite response to any statement: Compared to what? This lead binds the target well. Compared to what? This trial design is fast and efficient. Compared to what? This accurately describes the stratification of the patient population. Compared to what?
(Alice Wong rightly noted this was a bit of a drive-by statement: stats, p-hacking, multiplicity, etc. are vast subjects. But I can't do justice to them in this space, other than to underline their complexity. Sometimes, we have to work out how to move forward when the maths can’t be applied.)
Validate
The result of any model is a hypothesis. It’s something that is cooked up from a particular set of data, that relies on a set of assumptions, that may involve confounders or inconvenient correlations, that may contain mistakes. It’s not right, it’s a pointer at what might be right. Find a way to externally validate it.
Maybe you should use less SQL
Author’s note: this hails from the original version of this article, at a point when I was encountering a lot of solutions that consisted of massive and impenetrable SQL one-liners. The times have changed, and now, a lot of analysts don’t even SQL. Which is a loss. I’m leaving this here because the basic advice is solid. Nowadays, I would instead say that a little SQL can be a powerful thing.
Why?
Complex subsetting and massaging of data can be difficult to express in SQL, because it's depauperate as a programming language, and often those operations are cumbersome and verbose.
Perhaps because of this, there's a tradition of enormous multi-line SQL statements to get work done. It is difficult to understand what these do or debug them. You end up taking the code on faith.
SQL puts the internal representation of domain objects is right in your face, when you should be thinking of their external representation, what the objects represent.
Asserts, checks and defensive programming? What's that?
And there are dialects of SQL so forget porting your analysis.
If you're pulling data live from a database, how do you version that?
In summary, excessive use of SQL moves towards the code being a black box.
Obviously (gestures at 50 years of use), SQL is a tremendously useful tool. But it's not a great tool for communicating intent or writing good, complex code. Fortunately, you could just use a small amount of SQL to do a broad extract of data and then do the heavy lifting in a real programming language. Or you could use an abstraction layer (like SQAlchemy) so that you're manipulating objects, not rows scattered across tables.
What’s “non-trivial”? To my mind, anything that’s not instantly obvious and matters and has to be shared or given to other people.
An ex-colleague once described my programming style as “paranoid”. Guilty as charged. It’s not paranoia if the the world is out to get you.
See https://stackoverflow.com/questions/155097/microsoft-excel-mangles-diacritics-in-csv-files, http://dataabinitio.com/?p=798, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 …
Look up “the Cognitive Style of Powerpoint”.
https://www.linkedin.com/pulse/brian-wansink-dark-side-p-hacking-sandeep-bhasin-ph-d--ud6dc/ . My suspicion is that a lot of the scientific narrative is unconsciously driven by weird little selections and biases, such that all and any results are potentially fragile to false discovery.