In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about ...data characteristics, and without using human subjectivity. Such an automated process for univariate data is implemented to achieve this goal by merging the maximum entropy method with single order statistics and maximum likelihood. The only required properties of the random variables are that they are continuous and that they are, or can be approximated as, independent and identically distributed. A quasi-log-likelihood function based on single order statistics for sampled uniform random data is used to empirically construct a sample size invariant universal scoring function. Then a probability density estimate is determined by iteratively improving trial cumulative distribution functions, where better estimates are quantified by the scoring function that identifies atypical fluctuations. This criterion resists under and over fitting data as an alternative to employing the Bayesian or Akaike information criterion. Multiple estimates for the probability density reflect uncertainties due to statistical fluctuations in random samples. Scaled quantile residual plots are also introduced as an effective diagnostic to visualize the quality of the estimated probability densities. Benchmark tests show that estimates for the probability density function (PDF) converge to the true PDF as sample size increases on particularly difficult test probability densities that include cases with discontinuities, multi-resolution scales, heavy tails, and singularities. These results indicate the method has general applicability for high throughput statistical inference.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
A MATLAB function is presented for nonparametric probability density estimation, based on an iterative method that employs the principle of maximum entropy and characteristic properties of single ...order statistics. Featuring a robust and adaptive design, the method is well-suited for high throughput applications. The implementation comprises a MATLAB interface and underlying C++ code with extensible components that can be easily integrated into third party software. The functionality includes plotting capabilities and model independent diagnostics featuring the scaled quantile residual that is invariant to sample size, distribution, and estimation method.
Display omitted
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Previously, we developed a high throughput non-parametric maximum entropy method (PLOS ONE, 13(5): e0196937, 2018) that employs a log-likelihood scoring function to characterize uncertainty in trial ...probability density estimates through a scaled quantile residual (SQR). The SQR for the true probability density has universal sample size invariant properties equivalent to sampled uniform random data (SURD). Alternative scoring functions are considered that include the Anderson-Darling test. Scoring function effectiveness is evaluated using receiver operator characteristics to quantify efficacy in discriminating SURD from decoy-SURD, and by comparing overall performance characteristics during density estimation across a diverse test set of known probability distributions.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
The application of deep neural networks towards solving problems in science and engineering has demonstrated encouraging results with the recent formulation of physics-informed neural networks ...(PINNs). Through the development of refined machine learning techniques, the high computational cost of obtaining numerical solutions for partial differential equations governing complicated physical systems can be mitigated. However, solutions are not guaranteed to be unique, and are subject to uncertainty caused by the choice of network model parameters. For critical systems with significant consequences for errors, assessing and quantifying this model uncertainty is essential. In this paper, an application of PINN for laser bio-effects with limited training data is provided for uncertainty quantification analysis. Additionally, an efficacy study is performed to investigate the impact of the relative weights of the loss components of the PINN and how the uncertainty in the predictions depends on these weights. Network ensembles are constructed to empirically investigate the diversity of solutions across an extensive sweep of hyper-parameters to determine the model that consistently reproduces a high-fidelity numerical simulation.
•A physics informed neural network is designed for solving the heat diffusion equation.•An ensemble method increases accuracy of predictions and quantifies uncertainty.•A weighting heuristic automatically normalizes individual components of loss function.•Equitable convergence amongst competing minimization objectives is enforced.•Network design parameters are optimized for both accuracy and reliability.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
We present a novel nonparametric adaptive partitioning and stitching (NAPS) algorithm to estimate a probability density function (PDF) of a single variable. Sampled data is partitioned into blocks ...using a branching tree algorithm that minimizes deviations from a uniform density within blocks of various sample sizes arranged in a staggered format. The block sizes are constructed to balance the load in parallel computing as the PDF for each block is independently estimated using the nonparametric maximum entropy method (NMEM) previously developed for automated high throughput analysis. Once all block PDFs are calculated, they are stitched together to provide a smooth estimate throughout the sample range. Each stitch is an averaging process over weight factors based on the estimated cumulative distribution function (CDF) and a complementary CDF that characterize how data from flanking blocks overlap. Benchmarks on synthetic data show that our PDF estimates are fast and accurate for sample sizes ranging from 29 to 227, across a diverse set of distributions that account for single and multi-modal distributions with heavy tails or singularities. We also generate estimates by replacing NMEM with kernel density estimation (KDE) within blocks. Our results indicate that NAPS(NMEM) is the best-performing method overall, while NAPS(KDE) improves estimates near boundaries compared to standard KDE.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
Our study measured heterotrophic carbon dioxide (CO
2
) emissions in a drained peatland under potato cultivation in south-western Uganda. Soil carbon losses have not previously been reported for this ...land use, and our study set out to capture the range and temporal variation in emissions, as well as investigate relationships with key environmental variables. Soil chamber-based emission measurements were taken over five days at four points in time over the year to capture daily and monthly variability, including day and night sampling to capture any diurnal variations in temperatures and soil flux. Differences in soil microtopography from mounding of soils for potato beds and drainage trenches had a significant effect on the rate of soil flux. Diurnal sampling showed no significant difference in emissions or soil temperatures in the raised potato beds between day and night. More significant effects on soil flux from environmental drivers, such as water table depth, were observed between months, rather than hours and days. There were significant differences in the relationships between environmental variables and soil flux, depending on if soils had been recently disturbed or not. Area-weighted emissions based on microtopography gave a mean annual emissions factor of 98.79 ± 1.7 t CO
2
ha
-1
y
-1
(± standard error) from this peatland use.
Nonparametric estimation for a probability density function that describes multivariate data has typically been addressed by kernel density estimation (KDE). A novel density estimator recently ...developed by Farmer and Jacobs offers an alternative high-throughput automated approach to univariate nonparametric density estimation based on maximum entropy and order statistics, improving accuracy over univariate KDE. This article presents an extension of the single variable case to multiple variables. The univariate estimator is used to recursively calculate a product array of one-dimensional conditional probabilities. In combination with interpolation methods, a complete joint probability density estimate is generated for multiple variables. Good accuracy and speed performance in synthetic data are demonstrated by a numerical study using known distributions over a range of sample sizes from 100 to 106 for two to six variables. Performance in terms of speed and accuracy is compared to KDE. The multivariate density estimate developed here tends to perform better as the number of samples and/or variables increases. As an example application, measurements are analyzed over five filters of photometric data from the Sloan Digital Sky Survey Data Release 17. The multivariate estimation is used to form the basis for a binary classifier that distinguishes quasars from galaxies and stars with up to 94% accuracy.
Molecular dynamics simulation is commonly employed to explore protein dynamics. Despite the disparate timescales between functional mechanisms and molecular dynamics (MD) trajectories, functional ...differences are often inferred from differences in conformational ensembles between two proteins in structure-function studies that investigate the effect of mutations. A common measure to quantify differences in dynamics is the root mean square fluctuation (RMSF) about the average position of residues defined by C
-atoms. Using six MD trajectories describing three native/mutant pairs of beta-lactamase, we make comparisons with additional measures that include Jensen-Shannon, modifications of Kullback-Leibler divergence, and local
-values from 1-sample Kolmogorov-Smirnov tests. These additional measures require knowing a probability density function, which we estimate by using a nonparametric maximum entropy method that quantifies rare events well. The same measures are applied to distance fluctuations between C
-atom pairs. Results from several implementations for quantitative comparison of a pair of MD trajectories are made based on fluctuations for on-residue and residue-residue local dynamics. We conclude that there is almost always a statistically significant difference between pairs of 100 ns all-atom simulations on moderate-sized proteins as evident from extraordinarily low
-values.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
Without accurate data on soil heterotrophic respiration (Rh), assessments of soil carbon (C) sequestration rate and C balance are challenging to produce. Accordingly, it is essential to determine the ...contribution of the different sources of the total soil CO2 efflux (Rs) in different ecosystems, but to date, there are still many uncertainties and unknowns regarding the soil respiration partitioning procedures currently available. This study compared the suitability and relative accuracy of five different Rs partitioning methods in a subtropical forest: (1) regression between root biomass and CO2 efflux, (2) lab incubations with minimally disturbed soil microcosm cores, (3) root exclusion bags with hand-sorted roots, (4) root exclusion bags with intact soil blocks and (5) soil δ13C–CO2 natural abundance. The relationship between Rh and soil moisture and temperature was also investigated. A qualitative evaluation table of the partition methods with five performance parameters was produced. The Rs was measured weekly from 3 February to 19 April 2017 and found to average 6.1 ± 0.3 MgCha-1yr-1. During this period, the Rh measured with the in situ mesh bags with intact soil blocks and hand-sorted roots was estimated to contribute 49 ± 7 and 79 ± 3 % of Rs, respectively. The Rh percentages estimated with the root biomass regression, microcosm incubation and δ13C–CO2 natural abundance were 54 ± 41, 8–17 and 61 ± 39 %, respectively. Overall, no systematically superior or inferior Rs partition method was found. The paper discusses the strengths and weaknesses of each technique with the conclusion that combining two or more methods optimizes Rh assessment reliability.
The role of soils in provision of energy Smith, Jo; Farmer, Jenny; Smith, Pete ...
Philosophical transactions of the Royal Society of London. Series B. Biological sciences,
09/2021, Volume:
376, Issue:
1834
Journal Article
Peer reviewed
Open access
Soils have both direct and indirect impacts on available energy, but energy provision, in itself, has direct and indirect impacts on soils. Burning peats provides only approximately 0.02% of global ...energy supply yet emits approximately 0.7-0.8% of carbon losses from land-use change and forestry (LUCF). Bioenergy crops provide approximately 0.3% of energy supply and occupy approximately 0.2-0.6% of harvested area. Increased bioenergy demand is likely to encourage switching from forests and pastures to rotational energy cropping, resulting in soil carbon loss. However, with protective policies, incorporation of residues from energy provision could sequester approximately 0.4% of LUCF carbon losses. All organic wastes available in 2018 could provide approximately 10% of global energy supply, but at a cost to soils of approximately 5% of LUCF carbon losses; not using manures avoids soil degradation but reduces energy provision to approximately 9%. Wind farms, hydroelectric solar and geothermal schemes provide approximately 3.66% of energy supply and occupy less than approximately 0.3% of harvested area, but if sited on peatlands could result in carbon losses that exceed reductions in fossil fuel emissions. To ensure renewable energy provision does not damage our soils, comprehensive policies and management guidelines are needed that (i) avoid peats, (ii) avoid converting permanent land uses (such as perennial grassland or forestry) to energy cropping, and (iii) return residues remaining from energy conversion processes to the soil. This article is part of the theme issue 'The role of soils in delivering Nature's Contributions to People'.