Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI ...counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.
Classification and regression trees (CART) prove to be a true alternative to full parametric models such as linear models (LM) and generalized linear models (GLM). Although CART suffer from a biased ...variable selection issue, they are commonly applied to various topics and used for tree ensembles and random forests because of their simplicity and computation speed. Conditional inference trees and model-based trees algorithms for which variable selection is tackled via fluctuation tests are known to give more accurate and interpretable results than CART, but yield longer computation times. Using a closed-form maximum likelihood estimator for GLM, this paper proposes a split point procedure based on the explicit likelihood in order to save time when searching for the best split for a given splitting variable. A simulation study for non-Gaussian response is performed to assess the computational gain when building GLM trees. We also propose a benchmark on simulated and empirical datasets of GLM trees against CART, conditional inference trees and LM trees in order to identify situations where GLM trees are efficient. This approach is extended to multiway split trees and log-transformed distributions. Making GLM trees possible through a new split point procedure allows us to investigate the use of GLM in ensemble methods. We propose a numerical comparison of GLM forests against other random forest-type approaches. Our simulation analyses show cases where GLM forests are good challengers to random forests.
Floods are among the natural hazards that have adverse efects on human lives, livelihoods, economies and infrastructure. Dry climates of southern Africa have, over the years, experienced an increase ...in the frequency of tropical cyclone induced floods. However, understanding the key factors that influence susceptibility to lfoods has remained largely unexplored in these dry climates. Therefore, this study sought to model flood hazards and determine key factors that significantly explain the probability of flood occurrence in the southern parts of Beitbridge District, Zimbabwe. To achieve these objectives, logistic regression was used to predict spatial variations in flood hazards following cyclone Dineo in 2017. Before spatial prediction of flood hazard, environmental variables were tested for multicollinearity using the Pearson correlation coeficient. Only two environmental variables, i.e., elevation and rainfall, were not significantly correlated and were thus used in the subsequent flood hazard modelling. Results demonstrate that two variables significantly (p < 0.05) predicted spatial variations in flood hazard in the southern parts of the Beitbridge District with relatively high accuracy defined by the area under the curve (AUC = 0.98). In addition, results indicate that ~56 % of the study area is regarded as highly susceptible to floods. Given the projected increase in extreme events such as intense rainfall as a result of climate change, floods will be expected to correspondingly increase in these semi-arid regions. Results presented in this study underscore the importance of geospatial techniques in flood-hazard modelling, which is the key input in sustainable land-use planning. It can thus be concluded that spatial analytical techniques play a key role in flood early warning systems aimed at supporting and building resilient communities in the face of climate change-induced floods.
We consider estimation in generalized linear models when there are many potential predictors and some of them may not have influence on the response of interest. In the context of two competing ...models where one model includes all predictors and the other restricts variable coefficients to a candidate linear subspace based on subject matter or prior knowledge, we investigate the relative performances of Stein type shrinkage, pretest, and penalty estimators (L1GLM, adaptive L1GLM, and SCAD) with respect to the unrestricted maximum likelihood estimator (MLE). The asymptotic properties of the pretest and shrinkage estimators including the derivation of asymptotic distributional biases and risks are established. In particular, we give conditions under which the shrinkage estimators are asymptotically more efficient than the unrestricted MLE. A Monte Carlo simulation study shows that the mean squared error (MSE) of an adaptive shrinkage estimator is comparable to the MSE of the penalty estimators in many situations and in particular performs better than the penalty estimators when the dimension of the restricted parameter space is large. The Steinian shrinkage and penalty estimators all improve substantially on the unrestricted MLE. A real data set analysis is also presented to compare the suggested methods.
This paper attempts to identify the impact of food wastage on economic growth using the data for 165 countries over the 2014–2018 period. With the help of ordinary least squares (OLS) and generalized ...linear model (GLM), the study shows that food wastage and poverty impact GDP growth negatively. Poverty and food wastage are positively related. Reducing food wastage can lead to poverty reduction, stimulating GDP growth. Measures are required to reduce food wastage, especially in middle‐income countries with a high undernourishment rate. Food wastage reflects poor regulatory capacity, and strengthening the institutional quality can also reduce the wastage of food.
1. Freshwater conservation has received less attention than its terrestrial or marine counterparts. Given the accelerated rate of change and intensive human use that freshwater ecosystems are ...submitted to, it is urgent to focus more attention on fresh waters. Existing conservation planning tools - such as Marxan - need to be modified to account for the special nature of these systems. Connectivity plays a key role in freshwater ecosystems. Threats are mediated along river corridors, and the condition of the entire catchment influences river biodiversity downstream. This needs to be considered in conservation planning. 2. The probabilities of occurrence of nine native freshwater fish species in a Mediterranean river basin, obtained from Multivariate Adaptive Regression Splines‐ Generalized Linear Model (MARS‐GLM) models, were used as features to develop spatial conservation priorities. The priorities accounted for complementarity and spatial design issues. 3. To deal with the connected nature of rivers, we modified Marxan's boundary length penalty, avoiding the selection of isolated planning units and forcing the inclusion of closer upstream areas. We introduced ‘virtual boundaries' between non‐headwater stream segments and added distance‐weighted penalties to the overall connectivity cost (CP) when stream segments upstream of the selected planning units are not selected. 4. This approach to prioritising connectivity is concordant with ecological theory, as it considers the natural and roughly exponential decay of upstream influences with distance. It accounts for the natural capacity of rivers to mitigate impacts when designing reserves. When connectivity was not emphasised, Marxan prioritised natural corridors for longitudinal movements. In contrast, whole sub‐basins were prioritised when connectivity was emphasised. Changing the relative emphasis on connectivity substantially changed the spatial prioritisation; our conservation investment could move from one basin to another. 5. Our novel approach to dealing with directional connectivity enables managers of freshwater systems to set ecologically meaningful spatial conservation priorities.
We introduce glmulti, an R package for automated model selection and multi-model inference with glm and related functions. From a list of explanatory variables, the provided function glmulti builds ...all possible unique models involving these variables and, optionally, their pairwise interactions. Restrictions can be specified for candidate models, by excluding specific terms, enforcing marginality, or controlling model complexity. Models are fitted with standard R functions like glm. The n best models and their support (e.g., (Q)AIC, (Q)AICc, or BIC) are returned, allowing model selection and multi-model inference through standard R functions. The package is optimized for large candidate sets by avoiding memory limitation, facilitating parallelization and providing, in addition to exhaustive screening, a compiled genetic algorithm method. This article briey presents the statistical framework and introduces the package, with applications to simulated and real data.
Salt-affected soils are a major problem worldwide for crop production. Bioinocula such as plant growth-promoting bacteria (PGPB) and arbuscular mycorrhizal fungi (AMF) can help plants to thrive in ...these areas but interactions between them and with soil conditions can modulate the effects on their host. To test potential synergistic effects of bioinoculants with intrinsically different functional relationships with their host in buffering the effect of saline stress, maize plants were grown under increasing soil salinity (0–5 g NaCl kg−-1 soil) and inoculated with two PGPB strains (Pseudomonas reactans EDP28, and Pantoea alli ZS 3-6), one AMF (Rhizoglomus irregulare), and with the combination of both. We then modelled biomass, ion and nutrient content in maize plants in response to increasing salt concentration and microbial inoculant treatments using generalized linear models. The impacts of the different treatments on the rhizosphere bacterial communities were also analyzed. Microbial inoculants tended to mitigate ion imbalances in plants across the gradient of NaCl, promoting maize growth and nutritional status. These effects were mostly prominent in the treatments comprising the dual inoculation (AMF and PGPB), occurring throughout the gradient of salinity in the soil. The composition of bacterial communities of the soil was not affected by microbial treatments and were mainly driven by salt exposure. The tested bioinocula are most efficient for maize growth and health when co-inoculated, increasing the content of K+ accompanied by an effective decrease of Na+ in plant tissues. Moreover, synergistic effects potentially contribute to expanding crop production to otherwise unproductive soils. Results suggest that the combination of AMF and PGPB leads to interactions that may have a potential role in alleviating the stress and improve crop productivity in salt-affected soils.
•Salinity is increasing in soils throughout the world affecting crops production.•Maize growth under saline conditions can benefit from bioinoculation with PGPB and AMF.•GLM modelling help to predict bioinocula outcomes under increasing salinity.•Co-inoculation of PGPB with AMF improves maize nutritional status and biomass.•Bioinocula is a promising tool to optimize the management of saline areas.
Accurate and reliable predictions of invasive species distributions are urgently needed by land managers for developing management plans and monitoring new potential areas of establishment. ...Presence-only species distribution models are commonly used in these evaluations, however they are rarely tested with independent data over time or compared with presence-absence models fit with the same presence data. Using Maxent, we developed a presence-only model of invasive cheatgrass (Bromus tectorum L.) distribution in Rocky Mountain National Park, Colorado, USA in 2007 fit with limited data, and then tested the model with independent presence and absence data collected between 2008 and 2013. This model was verified using threshold dependent and threshold independent evaluation metrics. Next, we developed a Maxent model with cheatgrass presence data from 2007 through 2013 (i.e. Maxent 2013), and compared this model to a presence-absence method (i.e., generalized linear model; GLM 2013) using the same data. Threshold dependent and threshold independent evaluation metrics suggested Maxent 2013 outperformed GLM 2013, and a two-tailed Wilcoxon signed rank test indicated relative probability outputs were not significantly different between the models in geographic space. Based on known presences and absences of cheatgrass collected in the field, the Maxent 2013 and GLM 2013 relative probability outputs were highly correlated at absence locations but less correlated at presence locations. A Kappa comparison of Maxent 2007 and Maxent 2013 binary output provides evidence that Maxent is robust when fit with limited data. Our results indicate Maxent is an appropriate model for use when land management objectives are supported by limited resources and thus require a conservative, but highly accurate estimate of habitat suitability for invasive species on the landscape.
•A Maxent model fit with limited presence data is tested with independent test data.•Presence-only habitat suitability model is comparable to a presence-absence model.•A suite of model validation methods are highlighted, including comparisons in geographic space.