With the development of new experimental technologies, biologists are faced with an avalanche of data to be computationally analyzed for scientific advancements and discoveries to emerge. Faced with ...the complexity of analysis pipelines, the large number of computational tools, and the enormous amount of data to manage, there is compelling evidence that many if not most scientific discoveries will not stand the test of time: increasing the reproducibility of computed results is of paramount importance.
The objective we set out in this paper is to place scientific workflows in the context of reproducibility. To do so, we define several kinds of reproducibility that can be reached when scientific workflows are used to perform experiments. We characterize and define the criteria that need to be catered for by reproducibility-friendly scientific workflow systems, and use such criteria to place several representative and widely used workflow systems and companion tools within such a framework. We also discuss the remaining challenges posed by reproducible scientific workflows in the life sciences. Our study was guided by three use cases from the life science domain involving in silico experiments.
•Use cases from the Life Sciences highlighting reproducibility and reuse needs.•Terminology to describe reproducibility levels in scientific workflows.•Criteria to define reproducible-friendly workflow systems and evaluation of systems.•Challenges and opportunities in scientific workflows reproducibility.
Monitoring the ecological status of coastal ecosystems is essential to track the consequences of anthropogenic pressures and assess conservation actions. Monitoring requires periodic measurements ...collected in situ, replicated over large areas and able to capture their spatial distribution over time. This means developing tools and protocols that are cost-effective and provide consistent and high-quality data, which is a major challenge. A new tool and protocol with these capabilities for non-extractively assessing the status of fishes and benthic habitats is presented here: the KOSMOS 3.0 underwater video system.
The KOSMOS 3.0 was conceived based on the pre-existing and successful STAVIRO lander, and developed within a digital fabrication laboratory where collective intelligence was contributed mostly voluntarily within a managed project. Our suite of mechanical, electrical, and software engineering skills were combined with ecological knowledge and field work experience.
Pool and aquarium tests of the KOSMOS 3.0 satisfied all the required technical specifications and operational testing. The prototype demonstrated high optical performance and high consistency with image data from the STAVIRO. The project's outcomes are shared under a Creative Commons Attribution CC-BY-SA license. The low cost of a KOSMOS unit (~1400 €) makes multiple units affordable to modest research or monitoring budgets.
We report the development of the ReproGenomics Viewer (RGV), a multi- and cross-species working environment for the visualization, mining and comparison of published omics data sets for the ...reproductive science community. The system currently embeds 15 published data sets related to gametogenesis from nine model organisms. Data sets have been curated and conveniently organized into broad categories including biological topics, technologies, species and publications. RGV's modular design for both organisms and genomic tools enables users to upload and compare their data with that from the data sets embedded in the system in a cross-species manner. The RGV is freely available at http://rgv.genouest.org.
There is an ongoing explosion of scientific datasets being generated, brought on by recent technological advances in many areas of the natural sciences. As a result, the life sciences have become ...increasingly computational in nature, and bioinformatics has taken on a central role in research studies. However, basic computational skills, data analysis, and stewardship are still rarely taught in life science educational programs, resulting in a skills gap in many of the researchers tasked with analysing these big datasets. In order to address this skills gap and empower researchers to perform their own data analyses, the Galaxy Training Network (GTN) has previously developed the Galaxy Training Platform (https://training.galaxyproject.org), an open access, community-driven framework for the collection of FAIR (Findable, Accessible, Interoperable, Reusable) training materials for data analysis utilizing the user-friendly Galaxy framework as its primary data analysis platform. Since its inception, this training platform has thrived, with the number of tutorials and contributors growing rapidly, and the range of topics extending beyond life sciences to include topics such as climatology, cheminformatics, and machine learning. While initially aimed at supporting researchers directly, the GTN framework has proven to be an invaluable resource for educators as well. We have focused our efforts in recent years on adding increased support for this growing community of instructors. New features have been added to facilitate the use of the materials in a classroom setting, simplifying the contribution flow for new materials, and have added a set of train-the-trainer lessons. Here, we present the latest developments in the GTN project, aimed at facilitating the use of the Galaxy Training materials by educators, and its usage in different learning environments.
The SHARC Interest Group of the Research Data Alliance was established to improve research crediting and rewarding mechanisms for scientists who wish to organise their data (and material resources) ...for community sharing. This requires that data are findable and accessible on the Web, and comply with shared standards making them interoperable and reusable in alignment with the FAIR principles. It takes considerable time, energy, expertise and motivation. It is imperative to facilitate the processes to encourage scientists to share their data. To that aim, supporting FAIR principles compliance processes and increasing the human understanding of FAIRness criteria-i.e., promoting FAIRness literacy-and not only the machine-readability of the criteria, are critical steps in the data sharing process. Appropriate human-understandable criteria must be the first identified in the FAIRness assessment processes and roadmap. This paper reports on the lessons learned from the RDA SHARC Interest Group on identifying the processes required to prepare FAIR implementation in various communities not specifically data skilled, and on the procedures and training that must be deployed and adapted to each practice and level of understanding. These are essential milestones in developing adapted support and credit back mechanisms not yet in place.
In the French West Indies, more than 20 species of cetaceans have been observed over the last decades. The recognition of this hotspot of biodiversity of marine mammals, observed in the French ...Exclusive Economic Zone of the West Indies, motivated the French government to create in 2010 a marine protected area (MPA) dedicated to the conservation of marine mammals: the Agoa Sanctuary. Threats that cetacean populations face are multiple, but well-documented. Cetacean conservation can only be achieved if relevant and reliable data are available, starting by occurrence data. In the Guadeloupe Archipelago and in addition to some data collected by the Agoa Sanctuary, occurrence data are mainly available through the contribution of citizen science and of local stakeholders (i.e. non-profit organisations (NPO) and whale-watchers). However, no observation network has been coordinated and no standards exist for cetacean presence data collection and management.
In recent years, several whale watchers and NPOs regularly collected cetacean observation data around the Guadeloupe Archipelago. Our objective was to gather datasets from three Guadeloupean whale watchers, two NPOs and the Agoa Sanctuary, that agreed to share their data. These heterogeneous data went through a careful process of curation and standardisation in order to create a new extended database, using a newly-designed metadata set. This aggregated dataset contains a total of 4,704 records of 21 species collected in the Guadeloupe Archipelago from 2000 to 2019. The database was called Kakila ("who is there?" in Guadeloupean Creole). The Kakila database was developed following the FAIR principles with the ultimate objective of ensuring sustainability. All these data were transferred into the PNDB repository (Pöle National de Données de Biodiversité, Biodiversity French Data Hub, https://www.pndb.fr).
In the Agoa Sanctuary and surrounding waters, marine mammals have to interact with increasing anthropogenic pressure from growing human activities. In this context, the Kakila database fulfils the need for an organised system to structure marine mammal occurrences collected by multiple local stakeholders with a common objective: contribute to the knowledge and conservation of cetaceans living in the French Antilles waters. Much needed data analysis will enable us to identify high cetacean presence areas, to document the presence of rarer species and to determine areas of possible negative interactions with anthropogenic activities.
Environmental DNA (eDNA) data enables biodiversity to be monitored at unprecedented resolution and scale. There is great potential in combining knowledge from traditional and innovative methods such ...as eDNA for biodiversity assessment. eDNA use cases are increasing in aquatic and marine environments, and studies on soils have been developed in recent years.
PatriNat*1 (under the guardianship of the French Biodiversity Office (OFB), National Museum of Natural History (MNHN), National Center for Scientific Research (CNRS), and Research Institute for Development (IRD)) is a French data and expertise center working with environmental and research stakeholders to develop data exchanges at all levels. We discuss what eDNA data is (Fig. 1), the different types of data, and the importance of their storage and accessibility.
As the amount of eDNA data increases, public agencies need to propose FAIR (Findable, Accessible, Interoperable, Reusable) tools and methods to facilitate their use and foster the development of relevant scientific expertise. We give an overview of the French eDNA data landscape and links with existing standards SINP*2 (National Heritage Inventory Information System), Darwin Core*3(Wieczorek et al. 2012) and workflows (Fig. 2).
A priority of eDNA data is to have reliable reference bases and FAIR metadata. PatriNat's new tool will provide access to expertly validated genetic sequence data on species present in France, and is urgently needed for research but also knowledge, monitoring, public policies, and potential law enforcement purposes. We therefore present this technical database built in conjunction with, among other initiatives, DiSSCo*4, iBOL*5 and TaxRef*6 (Gargominy et al. 2022).
It will manage 3 data types:
Intrinsic sequence data (marker, sequencing methods, etc.).
Sequence management (organization, identifiers, URLs, etc.).
Voucher data
Intrinsic sequence data (marker, sequencing methods, etc.).
Sequence management (organization, identifiers, URLs, etc.).
Voucher data
It will use the nomenclature of Chakrabarty et al. (2013) as well as:
The Global Genome Biodiversity Network (GGBN) standard*7 (Droege et al. 2016)
The Minimum information about any sequence (MIxS) standard (Yilmaz et al. 2011)
The Biological Collection Data (ABCD) standard*8 (Holetschek et al. 2012)
The Collections Descriptions terms*9 (Woodburn et al. 2021)
The Global Genome Biodiversity Network (GGBN) standard*7 (Droege et al. 2016)
The Minimum information about any sequence (MIxS) standard (Yilmaz et al. 2011)
The Biological Collection Data (ABCD) standard*8 (Holetschek et al. 2012)
The Collections Descriptions terms*9 (Woodburn et al. 2021)
The DwC DNA extension*10 can be used for sharing parts of its contents to the Global Biodiversity Information Facility, but referencing sequences associated with specimens/vouchers will require other standards.
Dendritic cells are sentinels of the immune system distributed throughout the body, that following danger signals will migrate to secondary lymphoid organs to induce effector T cell responses. We ...have identified, in a rodent model of graft rejection, a new molecule expressed by dendritic cells that we have named LIMLE (RGD1310371). To characterize this new molecule, we analyzed its regulation of expression and its function. We observed that LIMLE mRNAs were rapidly and strongly up regulated in dendritic cells following inflammatory stimulation. We demonstrated that LIMLE inhibition does not alter dendritic cell maturation or cytokine production following Toll-like-receptor stimulation. However, it reduces their ability to stimulate effector T cells in a mixed leukocyte reaction or T cell receptor transgenic system. Interestingly, we observed that LIMLE protein localized with actin at some areas under the plasma membrane. Moreover, LIMLE is highly expressed in testis, trachea, lung and ciliated cells and it has been shown that cilia formation bears similarities to formation of the immunological synapse which is required for the T cell activation by dendritic cells. Taken together, these data suggest a role for LIMLE in specialized structures of the cytoskeleton that are important for dynamic cellular events such as immune synapse formation. In the future, LIMLE may represent a new target to reduce the capacity of dendritic cells to stimulate T cells and to regulate an immune response.
Linux container technologies, as represented by Docker, provide an alternative to complex and time-consuming installation processes needed for scientific software. The ease of deployment and the ...process isolation they enable, as well as the reproducibility they permit across environments and versions, are among the qualities that make them interesting candidates for the construction of bioinformatic infrastructures, at any scale from single workstations to high throughput computing architectures. The Docker Hub is a public registry which can be used to distribute bioinformatic software as Docker images. However, its lack of curation and its genericity make it difficult for a bioinformatics user to find the most appropriate images needed. BioShaDock is a bioinformatics-focused Docker registry, which provides a local and fully controlled environment to build and publish bioinformatic software as portable Docker images. It provides a number of improvements over the base Docker registry on authentication and permissions management, that enable its integration in existing bioinformatic infrastructures such as computing platforms. The metadata associated with the registered images are domain-centric, including for instance concepts defined in the EDAM ontology, a shared and structured vocabulary of commonly used terms in bioinformatics. The registry also includes user defined tags to facilitate its discovery, as well as a link to the tool description in the ELIXIR registry if it already exists. If it does not, the BioShaDock registry will synchronize with the registry to create a new description in the Elixir registry, based on the BioShaDock entry metadata. This link will help users get more information on the tool such as its EDAM operations, input and output types. This allows integration with the ELIXIR Tools and Data Services Registry, thus providing the appropriate visibility of such images to the bioinformatics community.
"FAIR (Findable, Accessible, Interoperable, Reusable) principles" (Wilkinson et al. 2016) and "open science" are two complementary movements in biodiversity science. Although we need to transition to ...making scientific data and associated material more FAIR, this does not necessarily imply open data or open source algorithms. Here, based on the experience of the French Biodiversity Data Hub ("Pôle national de données de Biodiversité" - PNDB), which is an e-infrastructure for and by researchers, we want to showcase how focusing on openness can be a strategy to efficiently reach greater FAIRness. Following DataOne guidance, we can build a complete data/metadata ecosystem allowing us to structure heterogeneous environmental information systems. Using the Galaxy analysis platform and its related initiatives (Galaxy training network, European Erasmus+ Gallantries project, bioconda, bioContainer), we can thus illustrate how we can create transparent, peer-reviewed and accessible tools and workflows and collaborative training material. The Galaxy platform also facilitates use of high performance computing infrastructure through notably the European Open Science Cloud marketplace. Finally, through our experiences contributing to open source projects like EML (Ecological Metadata Language (Michener et al. 1997)) Assembly Line, EDI (Environmental Data Initiative, or PAMPA (Indicators of Marine
Protected Areas
performance for managing coastal ecosystems, resources and their uses), a French platform to help protected areas managers to standardize and analyse their data, we also show how building open source "doors" through the R Shiny programming language to these environments can be beneficial for all.