We develop a new approach for feature selection via gain penalization in tree-based models. First, we show that previous methods do not perform sufficient regularization and often exhibit sub-optimal ...out-of-sample performance, especially when correlated features are present. Instead, we develop a new gain penalization idea that exhibits a general local-global regularization for tree-based models. The new method allows for full flexibility in the choice of feature-specific importance weights, while also applying a global penalization. We validate our method on both simulated and real data, exploring how the hyperparameters interact and we provide the implementation as an extension of the popular R package ranger .
The relative contributions of both copy number variants (CNVs) and single nucleotide polymorphisms (SNPs) to the additive genetic variance of carcass traits in cattle is not well understood. A ...detailed understanding of the relative importance of CNVs in cattle may have implications for study design of both genomic predictions and genome-wide association studies. The first objective of the present study was to quantify the relative contributions of CNV data and SNP genotype data to the additive genetic variance of carcass weight, fat, and conformation for 945 Charolais, 923 Holstein-Friesian, and 974 Limousin sires. The second objective was to jointly consider SNP and CNV data in a least absolute selection and shrinkage operator (LASSO) regression model to identify genomic regions associated with carcass weight, fat, and conformation within each of the three breeds separately. A genomic relationship matrix (GRM) based on just CNV data did not capture any variance in the three carcass traits when jointly evaluated with a SNP-derived GRM. In the LASSO regression analysis, a total of 987 SNPs and 18 CNVs were associated with at least one of the three carcass traits in at least one of the three breeds. The quantitative trait loci (QTLs) corresponding to the associated SNPs and CNVs overlapped with several candidate genes including previously reported candidate genes such as
MSTN and RSAD2,
and several potential novel candidate genes such as
ACTN2
and
THOC1
. The results of the LASSO regression analysis demonstrated that CNVs can be used to detect associations with carcass traits which were not detected using the set of SNPs available in the present study. Therefore, the CNVs and SNPs available in the present study were not redundant forms of genomic data.
Display omitted
The ability to conduct in-situ real-time process-structure-property checks has the potential to overcome process and material uncertainties, which are key obstacles to improved uptake ...of metal powder bed fusion in industry. Efforts are underway for live process monitoring such as thermal and image-based data gathering for every layer printed. Current crystal plasticity finite element (CPFE) modelling is capable of predicting the associated strength based on a microstructural image and material data but is computationally expensive. This work utilizes a large database of input–output samples from CPFE modelling to develop a trained deep neural network (DNN) model which instantly estimates the output (strength prediction) associated with a given input (microstructure) of multi-phase additive manufactured stainless steels. The DNN model successfully recognizes phase regions and the associated unique crystallographic orientation variations. It also captures differences in macroscopic stress response due to the varying microstructure. However, it is less reliable in terms of fatigue life predictions. The DNN model exhibits high accuracy for the structure–property relationship as a surrogate prediction tool compared to CPFE while significantly reducing the computational cost to just a few seconds.
Archaeological evidence shows that a predecessor of the 2004 Indian Ocean tsunami devastated nine distinct communities along a 40-km section of the northern coast of Sumatra in about 1394 CE. Our ...evidence is the spatial and temporal distribution of tens of thousands of medieval ceramic sherds and over 5,000 carved gravestones, collected and recorded during a systematic landscape archaeology survey near the modern city of Banda Aceh. Only the trading settlement of Lamri, perched on a headland above the reach of the tsunami, survived into and through the subsequent 15th century. It is of historical and political interest that by the 16th century, however, Lamri was abandoned, while low-lying coastal sites destroyed by the 1394 tsunami were resettled as the population center of the new economically and politically ascendant Aceh Sultanate. Our evidence implies that the 1394 tsunami was large enough to impact severely many of the areas inundated by the 2004 tsunami and to provoke a significant reconfiguration of the region’s political and economic landscape that shaped the history of the region in subsequent centuries.
Nowadays, children have access to Internet on a regular basis. Just like the real world, the Internet has many unsafe locations where kids may be exposed to inappropriate content in the form of ...obscene, aggressive, erotic or rude comments. In this work, we address the problem of detecting erotic/sexual content on text documents using Natural Language Processing (NLP) techniques. Following an approach based on Machine Learning techniques, we have assessed twelve models resulting from the combination of three text encoders (Bag of Words, Term Frequency-Inverse Document Frequency and Word2vec) together with four classifiers (Support Vector Machines (SVMs), Logistic Regression, k-Nearest Neighbors and Random Forests). We evaluated these alternatives on a new created dataset extracted from public data on the Reddit Website. The best performance result was achieved by the combination of the text encoder TF-IDF and the SVM classifier with linear kernel with an accuracy of 0.97 and F-score 0.96 (precision 0.96/recall 0.95). This study demonstrates that it is possible to detect erotic content on text documents and therefore, develop filters for minors or according to user’s preferences.
Late Holocene relative sea-level (RSL) reconstructions can be used to estimate rates of land-level (subsidence or uplift) change and therefore to modify global sea-level projections for regional ...conditions. These reconstructions also provide the long-term benchmark against which modern trends are compared and an opportunity to understand the response of sea level to past climate variability. To address a spatial absence of late Holocene data in Florida and Georgia, we reconstructed ~1.3m of RSL rise in northeastern Florida (USA) during the past ~2600years using plant remains and foraminifera in a dated core of high salt-marsh sediment. The reconstruction was fused with tide-gauge data from nearby Fernandina Beach, which measured 1.91±0.26mm/year of RSL rise since 1900 CE. The average rate of RSL rise prior to 1800 CE was 0.41±0.08mm/year. Assuming negligible change in global mean sea level from meltwater input/removal and thermal expansion/contraction, this sea-level history approximates net land-level (subsidence and geoid) change, principally from glacio-isostatic adjustment. Historic rates of rise commenced at 1850–1890 CE and it is virtually certain (P=0.99) that the average rate of 20th century RSL rise in northeastern Florida was faster than during any of the preceding 26 centuries. The linearity of RSL rise in Florida is in contrast to the variability reconstructed at sites further north on the U.S. Atlantic coast and may suggest a role for ocean dynamic effects in explaining these more variable RSL reconstructions. Comparison of the difference between reconstructed rates of late Holocene RSL rise and historic trends measured by tide gauges indicates that 20th century sea-level trends along the U.S. Atlantic coast were not dominated by the characteristic spatial fingerprint of melting of the Greenland Ice Sheet.
•2600years of relative sea-level change reconstructed in Florida.•Historic rate of rise began in the late 19th century.•Spatial variability among reconstructions suggests role for ocean dynamics.•Greenland melt was not the primary cause of 20th century sea-level change.
Summary
We consider the analysis of count data in which the observed frequency of zero counts is unusually large, typically with respect to the Poisson distribution. We focus on two alternative ...modelling approaches: over‐dispersion (OD) models and zero‐inflation (ZI) models, both of which can be seen as generalisations of the Poisson distribution; we refer to these as implicit and explicit ZI models, respectively. Although sometimes seen as competing approaches, they can be complementary; OD is a consequence of ZI modelling, and ZI is a by‐product of OD modelling. The central objective in such analyses is often concerned with inference on the effect of covariates on the mean, in light of the apparent excess of zeros in the counts. Typically, the modelling of the excess zeros per se is a secondary objective, and there are choices to be made between, and within, the OD and ZI approaches. The contribution of this paper is primarily conceptual. We contrast, descriptively, the impact on zeros of the two approaches. We further offer a novel descriptive characterisation of alternative ZI models, including the classic hurdle and mixture models, by providing a unifying theoretical framework for their comparison. This in turn leads to a novel and technically simpler ZI model. We develop the underlying theory for univariate counts and touch on its implication for multivariate count data.
Enabling control over macromolecular ordering and the spatial distribution of structures formed via the mechanisms of molecular self-assembly is a challenge that could yield a range of new functional ...materials. In particular, using the self-assembly of minimalist peptides, to drive the incorporation of large complex molecules will allow a functionalization strategy for the next generation of biomaterial engineering. Here, for the first time, we show that co-assembly with increasing concentrations of a highly charged polysaccharide, fucoidan, the microscale ordering of Fmoc-FRGDF peptide fibrils and subsequent mechanical properties of the resultant hydrogel can be easily and effectively manipulated without disruption to the nanofibrillar structure of the assembly.
In the last decade, various spatial and temporal methodologies were developed to investigate the processes that drive ecological and evolutionary patterns. However, these methods frequently fail to ...acknowledge that the observed patterns result from the overlap of different underlying processes. In order to understand how the patterns are formed, we must have recourse to methods that allow us to disentangle these simultaneous processes. Here we develop a hierarchical spatial predictive process (PP) combined with a separable temporal PP to disentangle and describe those overlapping processes in one very frequent setting in ecology and evolution: multilevel spatio-temporally indexed data. We present our methodology through a case study of fisheries discards and investigate for example whether the inclusion of the hierarchical structure and the temporal processes of the system alter the observed spatial patterns. Recently it is recognized that understanding the processes driving discards is essential to sustainably manage and conserve marine resources. The results show that consideration of multiple underlying processes dramatically changes the pattern and characteristics of the discards hot- and coldspots. In the Irish Sea, the inclusion of the hierarchical structure of the system leads to the reduction of the hot- and coldspots. Simultaneously, our model identifies key bi-annual fluctuations in the temporal process which, together with the variance associated at the level of individual fishing trips in the hierarchical structure of the data explained most of the variance driving discards. Whether the hierarchical, spatial and temporal processes are considered together or not can profoundly alter our understanding of what constitutes an appropriate mitigation measure. Misidentification of hotspots can culminate in inappropriate mitigation practices which can sometimes be irreversible. As the proposed method offers a unified approach for understanding the processes that drive observed patterns, many areas in ecology such as conservation and epidemiological studies can benefit from its use, increasing the effectiveness of management plans.