Challenges of Big Data analysis Fan, Jianqing; Han, Fang; Liu, Han
National Science Review/National science review,
06/2014, Letnik:
1, Številka:
2
Journal Article
Recenzirano
Odprti dostop
Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand,Big Data hold great promises for discovering subtle population paterns and heterogeneities that ...are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage botleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors.hese challenges are distinguished and require new computational and statistical paradigm. his paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-conidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. hey can lead to wrong statistical inferences and consequently wrong scientiic conclusions.
Exosomes are well-known key mediators of intercellular communication and contribute to various physiological and pathological processes. Their biogenesis involves four key steps, including cargo ...sorting, MVB formation and maturation, transport of MVBs, and MVB fusion with the plasma membrane. Each process is modulated through the competition or coordination of multiple mechanisms, whereby diverse repertoires of molecular cargos are sorted into distinct subpopulations of exosomes, resulting in the high heterogeneity of exosomes. Intriguingly, cancer cells exploit various strategies, such as aberrant gene expression, posttranslational modifications, and altered signaling pathways, to regulate the biogenesis, composition, and eventually functions of exosomes to promote cancer progression. Therefore, exosome biogenesis-targeted therapy is being actively explored. In this review, we systematically summarize recent progress in understanding the machinery of exosome biogenesis and how it is regulated in the context of cancer. In particular, we highlight pharmacological targeting of exosome biogenesis as a promising cancer therapeutic strategy.
Hydrogeological models require reliable uncertainty intervals that honestly reflect the total uncertainties of model predictions. The operation of a conventional Bayesian framework only produces ...realistic (interpretable in the context of the natural system) inference results if the model structure matches the data‐generating process, that is, applying Bayes' theorem implicitly assumes the underlying model to be true. With an imperfect model, we may obtain a too‐narrow‐for‐its‐bias uncertainty interval when conditioning on a long time‐series of calibration data, because the assumption of a quasi‐true model becomes too strict. To overcome the problem of overconfident posteriors, we propose a non‐parametric Bayesian method, called Tau‐averaging method: it applies Bayesian analysis on sliding time windows along the data time series for calibration. Thus, it obtains so‐called transitional posteriors per time window. Then, we average these into a wider predictive posterior. With the proposed routine, we explicitly capture the time‐varying impact of model error on prediction uncertainty. The length of the calibration window is optimized to maximize goal‐oriented statistical skill scores for predictive coverage. Our method loosens the perfect‐model‐assumption by conditioning only on small windows of the data set at a time, that is, it assumes that “the model is sufficient to follow the system dynamics for a smaller duration.” We test our method on two cases of soil moisture modeling and show how it improves predictive coverage as compared to the conventional Bayesian approach. Our findings demonstrate that the proposed method convincingly overcomes the overconfidence drawback of Bayesian inference under model misspecification and long calibration time‐series.
Plain Language Summary
Mathematical models mimic environmental systems to match what we see, and to predict what will happen. Unfortunately, such models are always simplifications of reality, balancing their complexity between manageability and accuracy. Consequently, interpreting model‐based conclusions requires caution. Assume a model has ten adjustable parameters to make it match with a system. The best‐possible achievable fit to observations is imperfect. Yet, statistical tools indicate we knew these parameters perfectly well after adjustment, especially when adjusting on long data series. Then, we might start believing that this model's adjusted predictions are perfect. We call this “overconfidence.” Ways to overcome overconfidence include extending models by statistical components, making them predict intervals and probabilities rather than exact numbers. However, adjusting these additional statistical components has been difficult to date. In our new approach, we force the model only to match short time windows of the data, and move this window through the whole data set. As we use little data per window, we reduce the overconfidence effect. Instead, the model adjusts parameters and predicts outputs differently in each window. To make predictions, we combine the outputs into a more robust result, such that the testing data fall inside the intervals generated by our method.
Key Points
We propose a data‐driven Bayesian method to obtain realistic uncertainty estimates despite model errors
Our method builds on a statistically rigorous, time‐windowed Bayesian framework without prior assumptions about error sources or patterns
The method is confirmed to provide realistic predictive coverage with two synthetic test cases and a real‐world lysimeter case study
Structural variations are the greatest source of genetic variation, but they remain poorly understood because of technological limitations. Single-molecule long-read sequencing has the potential to ...dramatically advance the field, although high error rates are a challenge with existing methods. Addressing this need, we introduce open-source methods for long-read alignment (NGMLR; https://github.com/philres/ngmlr ) and structural variant identification (Sniffles; https://github.com/fritzsedlazeck/Sniffles ) that provide unprecedented sensitivity and precision for variant detection, even in repeat-rich regions and for complex nested events that can have substantial effects on human health. In several long-read datasets, including healthy and cancerous human genomes, we discovered thousands of novel variants and categorized systematic errors in short-read approaches. NGMLR and Sniffles can automatically filter false events and operate on low-coverage data, thereby reducing the high costs that have hindered the application of long reads in clinical and research settings.
Deterministic hydrological models with uncertain, but inferred‐to‐be‐time‐invariant parameters typically show time‐dependent model errors. Such errors can occur if a hydrological process is active in ...certain time periods in nature, but is not resolved by the model or by its input. Such missing processes could become visible during calibration as time‐dependent best‐fit values of model parameters. We propose a formal time‐windowed Bayesian analysis to diagnose this type of model error, formalizing the question “In which period of the calibration time‐series does the model statistically disqualify itself as quasi‐true?” Using Bayesian model evidence (BME) as model performance metric, we determine how much the data in time windows of the calibration time‐series support or refute the model. Then, we track BME over sliding time windows to obtain a dynamic, time‐windowed BME (tBME) and search for sudden decreases that indicate an onset of model error. tBME also allows us to perform a formal, sliding likelihood‐ratio test of the model against the data. Our proposed approach is designed to detect error occurrence on various temporal scales, which is especially useful in hydrological modeling. We illustrate this by applying our proposed method to soil moisture modeling. We test tBME as model error indicator on several synthetic and real‐world test cases that we designed to vary in error sources (structure and input) and error time scales. Results prove the successful detection errors in dynamic models. Moreover, the time sequence of posterior parameter distributions helps to investigate the reasons for model error and provide guidance for model improvement.
Key Points
We propose a data‐driven method for model‐structural error detection
Our method rests on a statistically rigorous Bayesian framework without prior assumptions about error sources or patterns
We confirm successful error detection on various temporal scales in synthetic test cases and present insights from a real‐world case study
There exist about 3.7σ positive and 2.4σ negative deviations in the muon and electron anomalous magnetic moments (g−2). Also, some ratios for lepton universality in τ decays have almost 2σ deviations ...from the Standard Model. In this paper, we propose a lepton-specific inert two-Higgs-doublet model. After imposing all the relevant theoretical and experimental constraints, we show that these lepton anomalies can be explained simultaneously in many parameter spaces with m > 200 GeV and mA(mH±) > 500 GeV for appropriate Yukawa couplings between leptons and inert Higgs. The key point is that these Yukawa couplings for μ and τ/e have opposite sign.
Screen-Shooting Resilient Watermarking Fang, Han; Zhang, Weiming; Zhou, Hang ...
IEEE transactions on information forensics and security,
06/2019, Letnik:
14, Številka:
6
Journal Article
Recenzirano
This paper proposes a novel screen-shooting resilient watermarking scheme, which means that if the watermarked image is displayed on the screen and the screen information is captured by the camera, ...we can still extract the watermark message from the captured photo. To realize such demands, we analyzed the special distortions caused by the screen-shooting process, including lens distortion, light source distortion, and moiré distortion. To resist the geometric deformation caused by lens distortion, we proposed an intensity-based scale-invariant feature transform (I-SIFT) algorithm which can accurately locate the embedding regions. As for the loss of image details caused by light source distortion and moiré distortion, we put forward a small-size template algorithm to repeatedly embed the watermark into different regions, so that at least one complete information region can survive from distortions. At the extraction side, we designed a cross-validation-based extraction algorithm to cope with repeated embedding. The validity and correctness of the extraction method are verified by hypothesis testing. Furthermore, to boost the extraction speed, we proposed a SIFT feature editing algorithm to enhance the intensity of the keypoints, based on which, the extraction accuracy and extraction speed can be greatly improved. The experimental results show that the proposed watermarking scheme achieves high robustness for screen-shooting process. Compared with the previous schemes, our algorithm provides significant improvement in robustness for screen-shooting process and extraction efficiency.
In this paper, the convergence and generalization performance of Random Forest is used to improve the classification accuracy of target variables, and the robustness and classification accuracy of ...Random Forest is dramatically improved by conditional Random Forest, which is trained to generate a Random Forest model for head pose estimation. The improved random forest algorithm is designed using logistic regression, and a new classroom teaching model for vocational education is constructed using the improved random forest algorithm. Taking the students of secondary school A in G city as the research object, the teaching model constructed in this study is applied to the classroom of “Information Technology Teaching Literacy” and empirically analyzed from three aspects: the cognitive level of learners, the emotional state and the comparison of students’ performance. The results show that compared with the achievement before teaching, after the teaching is finished, the student’s achievement improves greatly by 0.2542, and the average score is 86.49, which is 18.28. It shows that the teaching practice of the new classroom teaching design of vocational education in this paper has significant results, which can improve the student’s learning achievement and effectively enhance learning efficiency.
The endoplasmic reticulum (ER) is an important organelle involved in cellular homeostasis and control of protein quality. Unfolded protein response (UPR) is a cellular response to ER stress and ...promotes cell survival. Severe or prolonged stress activates apoptosis signaling to trigger cell death. In mammals, the UPR is initiated by three major ER stress sensors, including inositol-requiring transmembrane kinase 1, double-stranded RNA-activated protein kinase-like ER kinase and activating transcription factor 6. UPR dysfunction plays an important role in the pathogenesis of neurodegenerative diseases including Alzheimer’s disease, Parkinson’s disease, amyotrophic lateral sclerosis and Huntington’s disease, which is characterized by the accumulation and aggregation of misfolded proteins. ER stress mediates the pathogenesis of psychiatric diseases, such as depression, schizophrenia, sleep fragmentation and post-traumatic stress disorder. The role of UPR in the neuropathology of humans, cell lines and animal models, is established. Therefore, inhibition of specific ER mediators may contribute to the treatment and prevention of neurodegeneration. Preclinical studies have shed light on the potential therapeutic strategies. Here, we will review the evidence of UPR activation in neurodegenerative disorders and psychiatric diseases along with the methodology.
Emulation of photonic synapses through photo‐recordable devices has aroused tremendous discussion owing to the low energy consumption, high parallel, and fault‐tolerance in artificial neuromorphic ...networks. Nonvolatile flash‐type photomemory with short photo‐programming time, long‐term storage, and linear plasticity becomes the most promising candidate. Nevertheless, the systematic studies of mechanism behind the charge transfer process in photomemory are limited. Herein, the physical properties of APbBr3 perovskite quantum dots (PQDs) on the photoresponsive characteristics of derived poly(3‐hexylthiophene‐2,5‐diyl) (P3HT)/PQDs‐based photomemory through facile A‐site substitution approach are explored. Benefitting from the lowest valance band maximum and longest exciton lifetime of FAPbBr3 quantum dot (FA‐QDs), P3HT/FA‐QDs‐derived photomemory not only exhibits shortest photoresponsive characteristic time compared to FA0.5Cs0.5PbBr3 quantum dots (Mix‐QDs) and CsPbBr3 quantum dots (Cs‐QDs) but also displays excellent ON/OFF current ratio of 2.2 upon an extremely short illumination duration of 1 ms. Moreover, the device not only achieves linear plasticity of synapses by optical potentiation and electric depression, but also successfully emulates the features of photon synaptic such as pair‐pulse facilitation, long‐term plasticity, and multiple spike‐dependent plasticity and exhibits extremely low energy consumption of 3 × 10−17 J per synaptic event.
Engineering of minimum photo‐recording time in poly(3‐hexylthiophene)/APbBr3 perovskite quantum dots‐based photomemory via facile an A‐site substitution approach is demonstrated. poly(3‐hexylthiophene‐2,5‐diyl)/FAPbBr3 quantum dot‐derived photomemory displays an extremely short programming time of 1 ms and enables the extremely low energy consumption of 3 × 10−17 J per synaptic event on the application of photonic synapse.