Abstract Motivation Hi-C is gaining prominence as a method for mapping genome organization. With declining sequencing costs and a growing demand for higher-resolution data, efficient tools for ...processing Hi-C datasets at different resolutions are crucial. Over the past decade, the .hic and Cooler file formats have become the de-facto standard to store interaction matrices produced by Hi-C experiments in binary format. Interoperability issues make it unnecessarily difficult to convert between the two formats and to develop applications that can process each format natively. Results We developed hictk, a toolkit that can transparently operate on .hic and .cool files with excellent performance. The toolkit is written in C++ and consists of a C++ library with Python and R bindings as well as CLI tools to perform common operations directly from the shell, including converting between .hic and .mcool formats. We benchmark the performance of hictk and compare it with other popular tools and libraries. We conclude that hictk significantly outperforms existing tools while providing the flexibility of natively working with both file formats without code duplication. Availability and implementation The hictk library, Python bindings and CLI tools are released under the MIT license as a multi-platform application available at github.com/paulsengroup/hictk. Pre-built binaries for Linux and macOS are available on bioconda. Python bindings for hictk are available on GitHub at github.com/paulsengroup/hictkpy, while R bindings are available on GitHub at github.com/paulsengroup/hictkR.
Abstract Summary Genome-wide DNA methylation (DNAm) profiling is indispensable for unveiling how DNAm regulates biological pathways and individual phenotypes. However, managing and analyzing ...extensive DNAm data generated from large cohort studies present computational obstacles. Apache Parquet is a data file format that allows for efficient data storage, retrieval, and manipulation, alleviating computational hurdles associated with conventional row-based formats. We here introduce MethParquet, the first R package leveraging the columnar Parquet format for efficient DNAm data analysis. It can be used for data extraction, methylation risk score calculation, epigenome-wide association analyses, and other standard post-quality control tasks. The package flexibly implements diverse regression models. Via a public methylation dataset, we show the efficiency of this package in reducing running time and RAM usage in large-scale EWAS. Availability and implementation The MethParquet R package is publicly available on the GitHub repository https://github.com/ZWangTen/MethParquet. It includes a vignette and a toy dataset derived from a public resource.
Abstract Motivation With single-cell DNA methylation studies yielding vast datasets, existing data formats struggle with the unique challenges of storage and efficient operations, highlighting a need ...for improved solutions. Results BAllC (Binary All Cytosines) emerges as a tailored format for methylation data, addressing these challenges. BAllCools, its complementary software toolkit, enhances parsing, indexing, and querying capabilities, promising superior operational speeds and reduced storage needs. Availability and implementation https://github.com/jksr/ballcools
Abstract Summary Subcluster analysis is a powerful means to improve clustering and characterization of single cell RNA-Seq data. However, there are no existing tools to systematically integrate ...results from multiple subclusters, which creates hurdles for accurate data quantification, visualization, and interpretation in downstream analysis. To address this issue, we developed Ragas, an R package that integrates multi-level subclustering objects for streamlined analysis and visualization. A new data structure was implemented to seamlessly connect and assemble miscellaneous single cell analyses from different levels of subclustering, along with several new or enhanced visualization functions. Moreover, a re-projection algorithm was developed to integrate nearest-neighbor graphs from multiple subclusters in order to maximize their separability on the combined cell embeddings, which significantly improved the presentation of rare and homogeneous subpopulations. Availability and implementation The Ragas package and its documentation can be accessed through https://github.com/jig4003/Ragas and its source code is also available at https://zenodo.org/records/11244921.
Abstract Motivation Metabolomics, as an essential tool in systems biology, is now widely accessible to researchers of all levels. Yet challenges remain in data analysis and result interpretation. To ...address these challenges, we introduced MetaboReport, a versatile and interactive web app that simplifies metabolomics experiment design, data preprocessing, exploration, statistical analysis, visualization, and reporting. Results MetaboReport produces a comprehensive HTML report, including project details, an introduction, interactive plots and tables, statistical results and an in-depth explanations and interpretation of the results. MetaboReport is particularly tailored for research labs and metabolomics core facilities that provide metabolomics services, allowing them to efficiently manage and document different metabolomics projects, and effectively report the metabolomics results to users. Availability and implementation MetaboReport is freely accessible on https://metaboreport.com, with source code available on GitHub (https://github.com/YonghuiDong/MetReport). Alternatively, users can install MetaboReport as a standalone desktop app (https://metaboreport.sourceforge.io).