Developing a comprehensive data management strategy can be hard to implement but it will save everyone involved in the research a lot of time over the course of the project. Check out the data management and storage platforms available at the School of Medicine, and then dive into the guidance around good data management practices in the sections below. Overwhelmed? Becker Library is here to help. Submit a consultation request for assistance working through the data management considerations for your research. Data Interviews can also assist investigators with developing a workflow for good research data management and sharing (DMS) practices and writing a DMS plan that is mandated by many funders including the NIH.


Poor Data Management Examples

Does this data management approach sound familiar? 

“I will store all data on at least one, and possibly up to 50, hard drives in my lab. The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is. Backups will rarely, if ever, be done. (Brown 2010)” 

How about this video? 

Data Management and Sharing in 3 Short Acts : NYU Health Sciences Library (CC-BY License) 

Brown’s comical data management practices and the video summarize some typical issues with storing data without data management in mind. Data lacking organization, quality controls and security can quickly become an incomprehensible and unusable data dumpster. The best time to formulate a data management plan is before you start collecting the research data. Having a management plan in place before the study begins ensures the data will be organized, well documented, and stored securely during the project and long after it is completed. Waiting until the end of a project often results in lost data, lack of notes, or sometimes a lack of proper permissions to even analyze and publish a particular dataset.


Data Management and Storage Platforms

The following list represents the data management and storage platforms that Washington University in St. Louis offers to the School of Medicine research community.

REDCap

REDCap (Research Electronic Data Capture) is a secure, web-based application for building and managing online databases and surveys. Developed at Vanderbilt University and made freely available to other institutions, our Washington University instance of REDCap is managed by the Institute for Informatics, Data Science and Biostatistics.

LabArchives (WashU ELN)

LabArchives is a cloud-based electronic lab notebook (ELN) that makes organizing, storing, and sharing lab data fast, simple and accessible on all digital platforms. The professional (research) and classroom editions of LabArchives, as well as the Scheduler and Inventory modules, are available to everyone in the WashU community at no cost, courtesy of the Vice Chancellor for Research and the Chief Information Officer. Check out the LabArchives ELN information page and FAQs for additional details. The LabArchives HELP page is also available for learning about all LabArchives products and how to use them.

If you have any technical issues or access issues with WashU LabArchives, please contact the LabArchives support team via email (support@labarchives.com).

Additional WashU data storage platforms


Data Security Policies

At the beginning of your research project, it is important to review all applicable policies regarding data security from Washington University in St. Louis. Below is the list of WashU data security policies currently in place.


FAIR Principles, Metadata and Data Curation

NIH encourages data management and sharing practices to be consistent with the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. The FAIR principles emphasize machine actionability (i.e. the capacity of computational systems to find, access, interoperate, and reuse data with no or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.

To learn more, visit the GO FAIR initiative

A key element of making the data findable is metadata. Metadata and other documentation (e.g. data dictionaries, README files, study protocols) associated with a dataset allow users to understand how the data were collected and how to interpret the data.

Data Curation is the processing of data to ensure long-term accessibility, sharing, and preservation. The Data Curation Network (DCN) developed a standardized set of C-U-R-A-T-E-D steps and the CURATED checklists to aid individuals in the curation process.


Data Management Planning

Before an idea can bear fruit, there must be a description of methods or processes that will generate data for analysis. This will include: 1) an experimental protocol or description of the experimental workflow, 2) define the variables to be measured, 3) provide information about the study population or model organism, 4) describe the biological materials used in the study (reagents, biospecimens etc.), and 5) identify the equipment and software used to collect the data. The processes and methods include all research activities a researcher performs as well as all reagents, research participants, equipment and software necessary to produce the data.

This stage is the time to decide how best to organize the data, what quality control criteria and standards will be used when collecting the data and what workflows and parameters need to be documented.

  • The experimental workflow and parameters must also be documented. This includes information about:
    • Study participants (sex, age, ethnicity, behaviors, etc. for human participants)
    • Equipment (technology, manufacturer, settings, software, etc.)
    • Materials (reagents, chemicals, specimens, etc.)

Furthermore, it is important at this stage to identify discipline standards for data collection. The use of data standards and common data elements (CDEs) makes the data interoperable with other existing datasets and makes it easier for others in the field to properly understand your data.


Data Collection

There are several ways that a dataset can be collected and generated, including:

  • Quantitative measurements collected from humans, animals, cells, or other biological organisms
  • Observations, such as a study of human behavior
  • Simulated data produced by a computational model, where outputs are calculated for different inputs, or different assumptions about how the system functions
  • New data generated from existing datasets, such as using subsets of U.S. Census data to better understand the health status of a particular geographic region (secondary data)
  • ​​​​​​​Data collected through interviews, focus groups or surveys (qualitative data collection)

Good data management practices lay out file naming conventions, file structure, variable naming conventions, and storage and backup plans and schedules. However, establishing these conventions and plans is not enough to ensure that they will be adhered to.

Good intentions for data management at the beginning of a project can easily be neglected in the busy, high-pressure environment of a research lab unless workflows are established that make adherence an integrated part of the research process.

During the collection process, make sure researchers understand the importance of data management and have sufficient time to adhere to file naming conventions, file structures, storage plans and backup plans and collect data according to discipline standards and document variable names.


Data Analysis

Processing Data: Raw data often requires some form of processing, either by a computer or by a human to put the data in a form that can be analyzed. There are many ways that data can be processed for analysis, for example:

  • cleaning the data, such as assigning a consistent value to non-responses in a survey
  • extracting data from an image, such as measuring the size of a tumor in an MRI
  • filtering signals, such as removing noise in an EKG measurement to enhance the signal
  • normalizing, transforming, or formatting data to better suit the analysis process

Whatever processing the data undergoes, the end result is that the data are in a form ready to be analyzed.

Data Analysis and Research Findings: Data analysis is the process of applying quantitative, statistical and visualization approaches to a dataset to test a hypothesis, evaluate findings or glean new insight into a phenomenon. Some examples of data analysis include:

  • Statistical analysis using a programming language such as R or Python
  • Using a statistical program such as STATA, SAS, SPSS or GraphPad Prism
  • Creating visualizations such as charts, graphs, and maps
  • Coding interviews or focus groups using a qualitative research method

Before data is processed and analyzed, it is essential to design and implement a plan for storing the analysis files. Not only should the analysis process be documented, but documentation should also include information about any software used in the analysis, including the version and operating system. For collaborative data analysis, virtual research environments can be set up to ensure that all collaborators have access to the data as well as the necessary tools to analyze the data.

It is crucial to establish a clear connection between analyzed and raw data files using file naming conventions and documentation. This will allow researchers to easily identify and if necessary, reanalyze raw data files in published tables, figures, and plots.

The following checklist can help to create a workflow for managing processed and analyzed data:

  • Design and implement a plan for storing analysis files
  • Determine where the analysis files will live in relation to the raw data
  • Document the analysis process
  • Document software used for analysis

Additional Resources

For more information on data management, view the slides from Becker Library’s Data Management Planning and Practice workshop.

RDMkit

The ELIXIR Research Data Management Kit (RDMkit) has been designed as a guide for life scientists to better manage their research data following the FAIR Principles. The contents are generated and maintained by the ELIXIR community as part of the ELIXIR-CONVERGE project and are based on the various steps of the data lifecycle (Plan, Collect, Process, Analyze, Preserve, Share and Reuse).  It provides a combination of tools and resources for research data management based on role, domain, and tasks.