The structural biology community has a long history of data sharing practices, facilitated in large part by the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RSCB PDB). Established in 1971, PDB has served as the primary repository for structural data for over 50 years and as of June 2024, houses over 221,000 structures. Here, structural biology researchers within the CIVICs Program describe how these data sharing practices have advanced their own research and share their
perspectives on the importance and future of structural biology data sharing.
Ian Wilson, Ph.D. is the Chair of the Department of Integrative Structural and Computational Biology at The Scripps Research Institute and is an investigator within the Sinai-Emory Multi–Institutional CIVIC (SEM CIVIC). Dr. Wilson’s laboratory focuses on structural studies of the immune system, with special emphasis on understanding the interaction between host immune factors and viral pathogens. Dr. Wilson recalls the evolution of data sharing within the structural biology community. When PDB was first launched, depositing structures was optional. Requiring structure deposition evolved over many years and through continuous discussions on if and when structures should be shared. According to Dr. Wilson, it took time for the community to embrace these sharing obligations, but journal prerequisites of data deposition for publication finally accelerated these efforts. More importantly, from Dr. Wilson’s perspective, was the role of PDB in creating and enforcing data standards for structure deposition – “It’s not just a matter of depositing data, but depositing structured, searchable datasets.”
Jarrod Mousa, Ph.D., of Florida State University and the Center for Influenza Vaccine Research for High-Risk Populations (CIVR-HRP) credits these accepted data sharing practices with rapid advancement of the field. “It’s made the field move a lot faster […] You can find either very similar structures or things that have never been done. If you’re trying to build a new antibody, you can find an antibody structure that’s usually pretty similar and you just have to model.” For early career researchers, such as John Dzimianski, Ph.D., a postdoctoral fellow in Rebecca DuBois’s laboratory at University of California, Santa Cruz and CIVR-HRP, access to these data has been vital to developing technical, rigorous skills necessary to succeed in structural biology research. “The community can benefit from the knowledge, but also reapply new methods to old data […] If you wanted to train somebody on a really nice dataset, there’s access to really clean data.” Dr. Dzimianski also believes data sharing promotes accountability and transparency between structural biologists. “There’s also the intrinsic benefit of a greater degree of accountability […] So there’s a practical usefulness and then the inherent incentive for things to be done at a high quality and double checked by others to make sure they’re meeting a reasonable standard.”
As sharing structures to PDB, or other repositories, such as the Electron Microscopy Data Bank (EMDB) and the Electron Microscopy Public Image Archive (EMPIAR), has become commonplace, the field has moved to consider what other types of structural biology data should be regularly shared. For the CIVICs investigators interviewed for this article, the next logical step in data sharing requires deposition of the raw, original data used to solve the structures, something that is not currently common practice. According to Rebecca DuBois, Ph.D., the value of these raw data varies by methodology: “X-ray crystallography is decades old now, so the software to process that data is pretty established. And so, it’s unlikely that you’re going to get a lot more information out of that raw data. But on the other hand, cryogenic electronic microscopy [cryo-EM], that data could potentially have more information if someone analyzed them in a different way or as the software evolves.” Though the value of structural data sharing is apparent, a major hurdle to storing raw cryo-EM data is file size. A single cryo-EM run generates a few terabytes of data. For reference, in 2023, PDB reported storage of 214,121 structures used a total of 1,242 gigabytes (1.242 terabytes). Dr. Wilson believes that concerted financial and labor investments are necessary for raw data storage to be successful, particularly when depositing cryo-EM data. “A challenge is that there are so much data it can be overwhelming. Where do you store them and who manages it? For data to be useful to the community, there has to be somebody on the backend, full time and devoted to curating the data for future use, and this will require adequate funding and resources to move forward.”
Beyond implementing infrastructure and standards to support raw data deposition, what’s next for the structural biology field and what role will data sharing play in this future? Goran Bajic, Ph.D., of the Icahn School of Medicine at Mount Sinai and SEM CIVIC, believes there are already hints of what’s to come. “It’s integrative biology, integrating structural biology with cell biology, meaning you have the atomic resolution information, but on a whole cell or whole tissue level […] This is where I think data sharing practices will help, because we’ll be able to develop new methods and make everything more streamlined. The question is, how to standardize the data and make them freely available to everyone?” For this vision of integrative biology to come to fruition, other biology fields will need to adopt data standards that enable structured, searchable data access.
The structural biology community serves as an example of the merits of data sharing. As other fields develop their own data sharing policies and standards, it’s worth using the structural biology field as a blueprint. For the structural biology community, the legacy of data sharing over decades made these practices foundational to the field and left no doubt about their value or essential role in advancing structural biology discoveries.