The accurate detection of variants is essential for genomics-based studies. Currently, there are various tools designed to detect genomic variants, however, it has always been a challenge to decide ...which tool to use, especially when various major genome projects have chosen to use different tools. Thus far, most of the existing tools were mainly developed to work on short-read data (i.e., Illumina); however, other sequencing technologies (e.g. PacBio, and Oxford Nanopore) have recently shown that they can also be used for variant calling. In addition, with the emergence of artificial intelligence (AI)-based variant calling tools, there is a pressing need to compare these tools in terms of efficiency, accuracy, computational power, and ease of use.
In this study, we evaluated five of the most widely used conventional and AI-based variant calling tools (BCFTools, GATK4, Platypus, DNAscope, and DeepVariant) in terms of accuracy and computational cost using both short-read and long-read data derived from three different sequencing technologies (Illumina, PacBio HiFi, and ONT) for the same set of samples from the Genome In A Bottle project. The analysis showed that AI-based variant calling tools supersede conventional ones for calling SNVs and INDELs using both long and short reads in most aspects. In addition, we demonstrate the advantages and drawbacks of each tool while ranking them in each aspect of these comparisons.
This study provides best practices for variant calling using AI-based and conventional variant callers with different types of sequencing data.
Summary
Soya bean is a major source of edible oil and protein for human consumption as well as animal feed. Understanding the genetic basis of different traits in soya bean will provide important ...insights for improving breeding strategies for this crop. A genome‐wide association study (GWAS) was conducted to accelerate molecular breeding for the improvement of agronomic traits in soya bean. A genotyping‐by‐sequencing (GBS) approach was used to provide dense genome‐wide marker coverage (>47 000 SNPs) for a panel of 304 short‐season soya bean lines. A subset of 139 lines, representative of the diversity among these, was characterized phenotypically for eight traits under six environments (3 sites × 2 years). Marker coverage proved sufficient to ensure highly significant associations between the genes known to control simple traits (flower, hilum and pubescence colour) and flanking SNPs. Between one and eight genomic loci associated with more complex traits (maturity, plant height, seed weight, seed oil and protein) were also identified. Importantly, most of these GWAS loci were located within genomic regions identified by previously reported quantitative trait locus (QTL) for these traits. In some cases, the reported QTLs were also successfully validated by additional QTL mapping in a biparental population. This study demonstrates that integrating GBS and GWAS can be used as a powerful complementary approach to classical biparental mapping for dissecting complex traits in soya bean.
Key message
Next-generation sequencing (NGS) has revolutionized plant and animal research by providing powerful genotyping methods. This review describes and discusses the advantages, challenges and, ...most importantly, solutions to facilitate data processing, the handling of missing data, and cross-platform data integration.
Next-generation sequencing technologies provide powerful and flexible genotyping methods to plant breeders and researchers. These methods offer a wide range of applications from genome-wide analysis to routine screening with a high level of accuracy and reproducibility. Furthermore, they provide a straightforward workflow to identify, validate, and screen genetic variants in a short time with a low cost. NGS-based genotyping methods include whole-genome re-sequencing, SNP arrays, and reduced representation sequencing, which are widely applied in crops. The main challenges facing breeders and geneticists today is how to choose an appropriate genotyping method and how to integrate genotyping data sets obtained from various sources. Here, we review and discuss the advantages and challenges of several NGS methods for genome-wide genetic marker development and genotyping in crop plants. We also discuss how imputation methods can be used to both fill in missing data in genotypic data sets and to integrate data sets obtained using different genotyping tools. It is our hope that this synthetic view of genotyping methods will help geneticists and breeders to integrate these NGS-based methods in crop plant breeding and research.
Highly parallel SNP genotyping platforms have been developed for some important crop species, but these platforms typically carry a high cost per sample for first-time or small-scale users. In ...contrast, recently developed genotyping by sequencing (GBS) approaches offer a highly cost effective alternative for simultaneous SNP discovery and genotyping. In the present investigation, we have explored the use of GBS in soybean. In addition to developing a novel analysis pipeline to call SNPs and indels from the resulting sequence reads, we have devised a modified library preparation protocol to alter the degree of complexity reduction. We used a set of eight diverse soybean genotypes to conduct a pilot scale test of the protocol and pipeline. Using ApeKI for GBS library preparation and sequencing on an Illumina GAIIx machine, we obtained 5.5 M reads and these were processed using our pipeline. A total of 10,120 high quality SNPs were obtained and the distribution of these SNPs mirrored closely the distribution of gene-rich regions in the soybean genome. A total of 39.5% of the SNPs were present in genic regions and 52.5% of these were located in the coding sequence. Validation of over 400 genotypes at a set of randomly selected SNPs using Sanger sequencing showed a 98% success rate. We then explored the use of selective primers to achieve a greater complexity reduction during GBS library preparation. The number of SNP calls could be increased by almost 40% and their depth of coverage was more than doubled, thus opening the door to an increase in the throughput and a significant decrease in the per sample cost. The approach to obtain high quality SNPs developed here will be helpful for marker assisted genomics as well as assessment of available genetic resources for effective utilisation in a wide number of species.
Mineral nutrients play a crucial role in the biochemical and physiological functions of biological systems. The enhancement of seed mineral content via genetic improvement is considered as the most ...promising and cost-effective approach compared alternative means for meeting the dietary needs. The overall objective of this study was to perform a GWAS of mineral content (Ca, K, P and S) in seeds of a core set of 137 soybean lines that are representative of the diversity of early maturing soybeans cultivated in Canada (maturity groups 000-II).
This panel of 137 soybean lines was grown in five environments (in total) and the seed mineral content was measured using a portable x-ray fluorescence (XRF) spectrometer. The association analyses were carried out using three statistical models and a set of 2.2 million SNPs obtained from a combined dataset of genotyping-by-sequencing and whole-genome sequencing. Eight QTLs significantly associated with the Ca, K, P and S content were identified by at least two of the three statistical models used (in two environments) contributing each from 17 to 31% of the phenotypic variation. A strong reproducibility of the effect of seven out these eight QTLs was observed in three other environments. In total, three candidate genes were identified involved in transport and assimilation of these mineral elements.
There have been very few GWAS studies to identify QTLs associated with the mineral element content of soybean seeds. In addition to being new, the QTLs identified in this study and candidate genes will be useful for the genetic improvement of soybean nutritional quality through marker-assisted selection. Moreover, this study also provides details on the range of phenotypic variation encountered within the Canadian soybean germplasm.
Key message
E10 is a new maturity locus in soybean and FT4 is the predicted/potential functional gene underlying the locus.
Flowering and maturity time traits play crucial roles in economic soybean ...production. Early maturity is critical for north and west expansion of soybean in Canada. To date, 11 genes/loci have been identified which control time to flowering and maturity; however, the molecular bases of almost half of them are not yet clear. We have identified a new maturity locus called “
E10
” located at the end of chromosome Gm08. The gene symbol
E10e10
has been approved by the Soybean Genetics Committee. The
e10e10
genotype results in 5–10 days earlier maturity than
E10E10
. A set of presumed
E10E10
and
e10e10
genotypes was used to identify contrasting SSR and SNP haplotypes. These haplotypes, and their association with maturity, were maintained through five backcross generations. A functional genomics approach using a predicted protein–protein interaction (PPI) approach (Protein–protein Interaction Prediction Engine, PIPE) was used to investigate approximately 75 genes located in the genomic region that SSR and SNP analyses identified as the location of the
E10
locus. The PPI analysis identified FT4 as the most likely candidate gene underlying the
E10
locus. Sequence analysis of the two FT4 alleles identified three SNPs, in the 5′UTR, 3′UTR and fourth exon in the coding region, which result in differential mRNA structures. Allele-specific markers were developed for this locus and are available for soybean breeders to efficiently develop earlier maturing cultivars using molecular marker assisted breeding.
Genotyping-by-sequencing (GBS) is a rapid, flexible, low-cost, and robust genotyping method that simultaneously discovers variants and calls genotypes within a broad range of samples. These ...characteristics make GBS an excellent tool for many applications and research questions from conservation biology to functional genomics in both model and non-model species. Continued improvement of GBS relies on a more comprehensive understanding of data analysis, development of fast and efficient bioinformatics pipelines, accurate missing data imputation, and active post-release support. Here, we present the second generation of Fast-GBS (v2.0) that offers several new options (e.g., processing paired-end reads and imputation of missing data) and features (e.g., summary statistics of genotypes) to improve the GBS data analysis process. The performance assessment analysis showed that Fast-GBS v2.0 outperformed other available analytical pipelines, such as GBS-SNP-CROP and Gb-eaSy. Fast-GBS v2.0 provides an analysis platform that can be run with different types of sequencing data, modest computational resources, and allows for missing-data imputation for various species in different contexts.
Silicon (Si) confers several benefits to many plant species when absorbed as silicic acid through nodulin 26-like intrinsic proteins (NIPs). The NIPs belong to major intrinsic protein (MIP) family, ...members of which form channels with high selectivity to control transport of water and different solutes. Here, comparative genomic analysis of the MIPs was performed to investigate the presence of Si transporter MIPs in soybean. Thorough analysis of phylogeny, gene organization, transcriptome profiling and protein modeling was performed to characterize MIPs in rice, Arabidopsis and soybean. Based on several attributes, two putative Si transporter genes,
GmNIP2
-
1
and
GmNIP2
-
2
, were identified, characterized and cloned from soybean. Expression of both genes was detected in shoot and root tissues, and decreased as Si increased. The protein encoded by
GmNIP2
-
2
showed functionality for Si transport when expressed in
Xenopus
oocytes, thus confirming the genetic capability of soybean to absorb the element. Comparative analysis of MIPs in plants provides opportunities to decipher gene evolution, functionality and selectivity of nutrient uptake mechanisms. Exploitation of this strategy has helped to uncover unique features of MIPs in soybean. The identification and functional characterization of Si transporters can be exploited to optimize the benefits that plants can derive from Si absorption.
Sclerotinia stem rot (SSR) is one of the most important pests in cool soybean growing regions of the Northeastern United States and Canada. However, the intensity of infestations varies considerably ...from year to year according to weather conditions, thus making it difficult for breeders to select under uniform disease pressure. Selection for resistance to SSR would be greatly facilitated by the use of molecular markers. In this work, a collection of 130 lines was inoculated using the cotton pad method and was genetically characterized using a genotyping‐by‐sequencing (GBS) protocol optimized for soybean. Genome‐wide association mapping (AM) and linkage disequilibrium (LD) analyses were performed with 7864 single nucleotide polymorphisms (SNPs). Linkage disequilibrium varied considerably over physical distance, reaching a r2 value of 0.2 after 8.5 Mb in the pericentromeric region and 0.5 Mb in the telomeric region. The mixed linear model (MLM) performed very well in accounting for population structure and relatedness, as only 5.5% of the observed p‐values were < 0.05. The strongest association was found on chromosome Gm15 (p‐value = 1.38 × 10–6; q‐value adjusted p‐value = 0.011). Two additional SNP markers in the vicinity had a q‐value < 0.1. This marker was validated in the progeny of a biparental cross, where F4:6 lines carrying the susceptibility allele developed lesions 17.6 mm longer than lines carrying the resistance allele. Interestingly, other genes contributing to resistance to pathogens have been reported in this region of Gm15. Three other association peaks having a q‐value < 0.1 were detected on chromosomes Gm01, Gm19, and Gm20.
Key message
We were able to obtain good prediction accuracy in genomic selection with ~ 2000 GBS-derived SNPs. SNPs in genic regions did not improve prediction accuracy compared to SNPs in intergenic ...regions.
Since genotyping can represent an important cost in genomic selection, it is important to minimize it without compromising the accuracy of predictions. The objectives of the present study were to explore how a decrease in the unit cost of genotyping impacted: (1) the number of single nucleotide polymorphism (SNP) markers; (2) the accuracy of the resulting genotypic data; (3) the extent of coverage on both physical and genetic maps; and (4) the prediction accuracy (PA) for six important traits in barley. Variations on the genotyping by sequencing protocol were used to generate 16 SNP sets ranging from ~ 500 to ~ 35,000 SNPs. The accuracy of SNP genotypes fluctuated between 95 and 99%. Marker distribution on the physical map was highly skewed toward the terminal regions, whereas a fairly uniform coverage of the genetic map was achieved with all but the smallest set of SNPs. We estimated the PA using three statistical models capturing (or not) the epistatic effect; the one modeling both additivity and epistasis was selected as the best model. The PA obtained with the different SNP sets was measured and found to remain stable, except with the smallest set, where a significant decrease was observed. Finally, we examined if the localization of SNP loci (genic vs. intergenic) affected the PA. No gain in PA was observed using SNPs located in genic regions. In summary, we found that there is considerable scope for decreasing the cost of genotyping in barley (to capture ~ 2000 SNPs) without loss of PA.