I thought the recent story of the Venter PNAS publication (Lippert et al., 2017), the extremely quick critisicm from bioarxiv (Erlich 2017), and the piece by Nature that summarize the criticisms (for those that want the quick-and-dirty) would be of general interest to my twitter followers.
One of the things I found most interesting was how quickly a paper was published (1 day!?). I probably do not need to remind anyone that writing/publishing an article takes a lot of time so what were the reasons for this. Well it turns out that Lieppert et al., first submitted to Science and Dr. Yaniv Elrich (Associate Professor at Columbia University) was actually a reviewer – he told the editor that paper was “Arsenic-life” weak, which explains how he got the jump-start.
Two other things I thought were surprising was that Venter chose all 3 reviewers at PNAS which stinks of unethical competition. Secondly, the code was not made available upon publication, meaning Dr. Erlich had to engage in some forensic bioinformatics (although he probably built this model after reviewing it for science). What I found funny was Venter employed “close to ~30 authors and [used] a range of complex algorithms” and Dr. Erlich got better results with his own model (using just sex+age+ancestry. That’s right, using no genetic features which was the main claim of the paper) “after on hour of work” (quote excerpts from the bioarxiv paper).
Take home thoughts:
Something that annoys many bioinformaticians is reproducible science (e.g. data availability, source code, versions, and command line options, etc.). When I look at a paper, I want to be able to do exactly what you did (and preferably do so using raw data deposited alongside your publication, rather than digging around in databases and piecing together what you did after the fact). I think most of us can now agree that sharing data is a must (see backlash against NEJMs criticism of “research parasites”), yet it still happens (e.g. newly published genomes are often embargoed for a year so the lab that produced the data has more time to mine it). FWIW probably won’t think highly of your paper if you don’t include code and data (I’ll think you have something to hide), besides sharing data also gets you more citations. So please use a github repos (with a DOI in the manuscript) and data repository (e.g. figshare) at the very least; if your pipeline uses many different software packages combine with a docker container (to run all the software).
This also got me thinking about the importance of being able to sniff out bad science. Just because something has published in a journal with a high-impact factor (it’s in Nature/Cell/Science/PNAS/NEJM it must be brilliant), or presume that anything from Dr. Famous’ lab must be right, does not mean its “good” science. Let us all try to be cynical about everything we read (and believe).