A warning against "research parasites" may open the door to better data sharing.
It all started with an editorial published by the editors of The New England Journal of Medicine (NEJM), innocuously titled, "Data Sharing."1
The authors, Dan L. Longo, MD, and Jeffrey M. Drazen, MD, praised the idea of data sharing in theory, but contend that one of the unintended consequences of sharing research data may be the emergence of a new class of researchers known as "research parasites," described as "people who had nothing to do with the design and execution of the study but use another group's data for their own ends, possibly stealing from the research productivity planned by the data-gatherers, or even use the data to try to disprove what the original investigators had posited."
Before long, a heated debate broke out in the research community about the merits of open data sharing, with proponents of data sharing arguing that science moves forward through collaboration, not through exclusion. Data sharing, they maintained, is the necessary evolution of medical research. Using hashtags like #researchparasite, #IAmaResearchParasite, and #opendata they expressed their support for data sharing and breaking down the silo mentality in scientific research. On the other side, researchers are concerned about the unfair use of data without attribution.
The vastly opposing viewpoints on the issue suggest that data sharing might not be as simple in reality as it is in theory. The recent debate brought to light the divide that exists between the clinical investigators who collect the data and the scientists who later want access to those data – the hosts and the parasites, if you will. ASH Clinical News spoke with members of the research community about the ethics behind data sharing, the potential drawbacks of the open-data movement, and the questions that still need to be answered.
The Case for Data Sharing
"One of the big reasons why a bunch of us have been advocating for data sharing is we think that not sharing data is actually a violation of the promise we made to patients," Vinay Prasad, MD, MPH, a hematologist-oncologist and an assistant professor of medicine at the Oregon Health and Sciences University, said.
Much of the initial conversation about data sharing has focused on clinical trial data for several reasons, according to Dr. Prasad. Clinical trial data are ideal for data sharing because of the implicit consent process already built into trial designs. Patients, in most cases, are already told their data will be used to help future patients, with the implication that the data may be used beyond the scope of one study.
The volume of data collected during a clinical trial can also make it difficult for a single research team to have the time and resources to fully use all of the data that have been gathered, making collaboration among teams a necessity.
"I'll be perfectly honest, I've done a lot of research over the years and, in my mind, I always plan to go back to that dataset one day and look for other things," Dr. Prasad said. "I've learned over time that the reality is that life sets in, you get moved to other projects, and you just don't get to do those things."
Aside from these practical factors, proponents also argue that data sharing can accelerate and improve scientific discovery.
"For example, competing experts may apply an improved statistical analysis that finds a hidden discovery that the original data-generators missed," Rafael Irizarry, PhD, professor of biostatistics at the Dana-Farber Cancer Institute at Harvard University, told ASH Clinical News. "Furthermore, examination of data by many experts can help correct errors missed by the analyst of the original project."
Michael Hoffman, PhD, a scientist at the Princess Margaret Cancer Centre and an assistant professor at the University of Toronto, believes the "most offensive" aspect of the NEJM editorial is that the authors chastise potential "research parasites" for potentially disproving original investigators' work. He contends that a bedrock of science itself is the idea that any conclusions determined by a scientist will be under the scrutiny of other scientists in the field.
"That's what science is all about: It's about putting forward ideas and testing them with various datasets," Dr. Hoffman said. "That editors of NEJM do not seem to understand this is incredibly concerning."
So, proponents say, giving outside scientists access to research data is a matter of advancing science. At a minimum, researchers who conduct government-funded trials should be required to share any data they gather from the process with the public, as the trials are funded by taxpayers.
"They are the actual owners," Dr. Irizarry says, "so there is an argument to be made that the public's data are being held hostage."
… And the Case Against
While many scientists and publications advocate for open data sharing, the path to complete data enlightenment is not without obstacles. And, if pro–data sharing advocates get their wish, should they be worried about research parasites feeding off of all that open data?
One such caveat Drs. Longo and Drazen noted in their editorial is that "outside researchers" (those who were not part of the generation and collection of data) may not fully understand the choices the original investigators made when they defined the study's parameters. "Special problems arise if data are to be combined from independent studies and considered comparable," they wrote.
Part of the concern, some have said, is the inability of those who generate the data to control who uses the data – and to what end. The editorial also questioned whether, in an open-data society, the data-generators would receive the proper attribution they deserved, and whether outsiders could "scoop" the generators on their own planned research projects.
However, none of the sources ASH Clinical News spoke with had heard of any incidents of data theft, and Dr. Prasad said a PubMed literature search he conducted for the term "data parasite" yielded no results. That's not to say that they don't exist. By the definition provided by Drs. Longo and Drazen, anyone who has ever used and built upon pre-existing research could be considered a "data parasite."
Who falls under that category? "Pretty much anyone who has ever conducted a meta-analysis or systematic review, anyone who has written a review article, anyone who has ever used data stored in a consortium, or any fellow that uses data from their attending," Dr. Prasad said. By these definitions, "I am a data parasite."
To prevent being "scooped," Dr. Hoffman added, clinical investigators who believe there is more information they want to analyze in their data should refrain from publishing a study until they have had time to analyze the data fully. Otherwise, he said, researchers are essentially "having their cake and eating it too" by wanting to conduct publicly funded research while also monopolizing that area of research for themselves.
David Shaywitz, MD, PhD, visiting scientist at the Department of Biomedical Informatics at Harvard Medical School, said part of the larger issue is the divide between clinical investigators and data scientists – and the differing motivations for each.
"In the course of the vitriol around this, all of the positions got presented in an extreme way," he said. While most believe in the idea of data sharing in theory, it gets muddy when it comes to the details.
Clinical investigators spend years and a significant amount of their time designing studies, recruiting patients, implementing the trial, and understanding the underlying disease. Part of their motivation, he said, is to use the resulting data to make discoveries they have devoted their careers to. "They want to be the first to see what the results are," he explained. "They are collecting the data because they want to participate in the analysis."
On the other hand, data scientists are dependent upon other people to provide the data, so advocating for open data sharing does, in turn, promote their own self-interests.
The challenge will be in motivating clinical investigators to devote the time and resources necessary to recruiting patients and collecting the data, even if they won't be the sole consumers of their data.
Although Dr. Prasad is a proponent of open data sharing, he said he has experience on both sides of the issue. Just a few months ago, he was asked to share some of his data and admitted that he "hemmed and hawed" before ultimately deciding to provide the data.
"Ideally, you want to see the people who spent the time writing the grant and doing the work to have the privilege of getting the first peek at these data. At the end of the day, though, the investigators are being compensated for that work," he said. "So, you have to weigh the interests of the doctors who do the work against the interests of the public. And I think the public has to win."
No Publication Without Public Data
Some fields have embraced open data sharing more fully than others. For instance, in genomics, most journals require all published microarray gene-expression data to be entered into one of several public repositories, including GEO and ArrayExpress.
"This has been an incredible success, leading to new discoveries, new databases that combine studies, and the development of widely used statistical methods and software built with these data," Dr. Irizarry said.
Medical journals have also updated their policies with data-sharing provisions. The non-profit open-access publisher PLOS, for instance, implemented a new data policy in 2014 that requires authors to include a data-availability statement in all research articles published in their family of PLOS journals, in an effort to support data sharing and "foster scientific progress."2
"PLOS journals have requested data be available since their inception, but we believe that providing more specific instructions for authors regarding appropriate data deposition options, and providing more information in the published article as to how to access data, is important for readers and users of the research we publish," the PLOS editorial team wrote in an announcement of its new policy.
Blood, the official journal of the American Society of Hematology (ASH), asks that authors make renewable materials, datasets, and protocols available to other investigators without unreasonable restrictions. "Blood adheres to the belief that authors should include in their publications the data, algorithms, or other information that are integral to the publication or make it freely and readily accessible," the policy states. "Authors should use public repositories for data whenever possible and make patented material available under a license for research use."3
Can Investigators and Data Analysts Exist Symbiotically?
As data sharing becomes more prevalent, differing views have emerged about what, if any, relationship needs to exist between those who collect the data and those who later use it.
The NEJM editors proposed that the best path forward is a symbiotic relationship between clinical investigators and data scientists.
Dr. Shaywitz agreed, adding that, ideally, both sides would work together from the beginning to form a partnership to take advantage of each party's expertise – all the way from study conception through to its execution. "There should be an elaborate and continued dialogue, because there are a lot of subtleties to data collection," he said.
Dr. Irizarry also concurred that symbiotic data sharing is the most effective approach to repurposing data, but said he doesn't believe it needs to be compulsory. "Competition is one of the key ingredients of the scientific enterprise. Having many groups competing almost always beats out a small group of collaborators," he said, adding that those who generate the data may not have the time to collaborate with everyone who is interested in the data.
In his own experience as a trained statistician, though, he has learned that it is hard to make a meaningful contribution through data analysis or method development without a clear understanding of the scientific problem, which the original investigators most certainly have.
"Most difficult scientific challenges have nuances that only the subject-matter expert can effectively describe," Dr. Irizarry explained. "Failing to understand these usually leads an analyst to chase false leads, interpret results incorrectly, or waste time solving a problem no one cares about." Successful collaboration, he believes, involves a constant back-and-forth between a data analyst and a subject-matter expert, though the subject-matter doesn't necessarily have to be the one generating the data.
Others, like Dr. Hoffman, think data sharing should occur automatically, regardless of any symbiotic relationship with the original clinical investigator. "It should just be the expectation, and in many fields it is the expectation that when people publish, they make the underlying data available," he said.
Incentivizing Data Sharing
The NEJM editorial set off a Twitter firestorm, with users proudly declaring #IAmaResearchParasite, but it also shed light on the larger debate about the concept of open data sharing and data exclusivity. Is there a happy medium in sight?
Dr. Shaywitz shared an idea from Bob Wachter, MD, a professor of medicine at University of California, San Francisco: Unless the scientific community moves away from the moral case for open data and embraces the business case instead, no true solution will be possible.
"Practically, you have to acknowledge the real-world incentive that people are operating under," Dr. Shaywitz said, adding that, to change behavior, people first have to understand why data-gatherers are hesitant to share their data.
One way forward is to institute data-sharing guidelines. The International Committee of Medical Journal Editors (ICMJE), for one, is seeking comments on its proposed set of guidelines that are designed to help foster the clinical trial data sharing that's now mandated by an increasing number of foundations, government agencies, and industries.4
Under the proposed guidelines, researchers must submit a plan for data sharing as part of their clinical trial registration and give "substantial credit" to those who generate and share clinical trial data. Clinical investigators also would have to share de-identified individual patient data related to the results of the submitted article within six months of publication.
Allowing for the data sharing to occur after a paper is published, though, means that journals will have limited power in enforcing that rule, according to Dr. Hoffman. "A lot of people will let that six months pass and do nothing." In the past, when he or other colleagues have tried to follow up to gain access to data, he is met either with no response or with authors saying the data are not ready to be made publicly available.
Aside from mandating data be made available and be attributed correctly, guidelines could also include certain provisions that would incentivize data sharing, Dr. Irrizarry said. For example, guidelines could encourage funding agencies and the scientific community to reward data-generators when their data are used by others – if the resulting research was as rigorously reviewed as the original analysis.
Discussing the ICJME's proposal in a recent editorial, ASH Clinical News' Editor-in-Chief Mikkael A. Sekeres, MD, MS, and Brian J. Bolwell, MD, noted important caveats to the call to share clinical trial participants' data: "the threat to patient privacy and the ability to conduct cancer clinical trials."
"De-identifying data is actually quite hard," Drs. Sekeres and Bolwell, both from the Cleveland Clinic, Taussig Cancer Institute, explained, and some smaller centers might not have the resources to meet these mandates. "Introducing a requirement that clinical trial data be made publicly available requires infrastructure in the form of databases, computer servers, and personnel, which adds to the price tag for these studies. At a certain financial inflection point, studies may not be conducted, even when they ask important research questions for patients who desperately need new therapies."
Adding public disclosure of data to the informed-consent process also puts patients in an unfair position, they added, since it will require clinical trial participants to "opt-in" to sharing their data. "Our patients would face what amounts to a Faustian bargain: agree to allow your data to be made public, or you cannot enroll on a clinical trial," Drs. Sekeres and Bolwell wrote. "This is not the sort of choice we want to force people to make when they are about to undergo treatment for their cancer."
"While we appreciate the scientific rationale for the proposal to make clinical trial data publicly available," they concluded, "in the end it cannot trump the rights of our cancer patients to maintain their privacy, have a full range of clinical trials available, and make treatment decisions free of conflict."5
Opening Up the Dialogue About Open Data
The authors of the NEJM editorial had a different opinion on how best to motivate investigators to gather data: co-authorship. It's concept of "how data sharing should work" includes "[reporting] the new findings with relevant co-authorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested."1
Others feel that co-authorship isn't the answer. "It's essentially people expecting to do one thing and get double-credit for it. I think that's potentially unethical and, frankly, kind of ridiculous," Dr. Hoffman said.
Dr. Shaywitz proposed another possible solution: returning the power of the data to the patients themselves, allowing them to decide where the data should be used.
"One of the exciting aspects of the National Institutes of Health's Precision Medicine Initiative is its goal of empowering study participants with their own data," he said. "They are trying to create the functionality that will allow the participants to donate their data at the click of a button, essentially. People can â€˜opt in' at whatever level they want to and with whatever project they want to."
Despite the fervent reactions that the "research parasites" argument elicited in people, everyone we spoke with agreed that it raises important questions about data sharing – and all members of the research community will have to work together to find the answers.â€”By Jill Sederstrom
- Longo DL, Drazen JM. Data Sharing. N Engl J Med. 2016;374:276-7.
- Bloom T, Ganley E, Winker M, PLOS Data Group. "Data access for the open access literature: PLOS's data policy." Accessed April 5, 2016 from http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001797.
- Blood. "Author Guide – Editorial Policies for Authors." Accessed April 5, 2016 from http://www.bloodjournal.org/page/authors/author-guide/Editorial-policies-for-authors?sso-checked=true#data_share.
- Taichman DB, Backus J, Baethge C, et al. Sharing clinical trial data: a proposal from the International Committee of Medical Journal Editors. N Engl J Med. 2016;374:384-6.
- Sekeres MA, Bolwell BJ. Will cancer patients be the next victims of the data privacy debate? FoxNews.com. Accessed April 19, 2016 from http://www.foxnews.com/opinion/2016/02/27/are-cancer-patients-next-victim-data-privacy-blurred-lines.html.