This review focuses on recent and potential advances in chemometric methods in relation to data processing in metabolomics, especially for data generated from mass spectrometric techniques. ...Metabolomics is gradually being regarded a valuable and promising biotechnology rather than an ambitious advancement. Herein, we outline significant developments in metabolomics, especially in the combination with modern chemical analysis techniques, and dedicated statistical, and chemometric data analytical strategies. Advanced skills in the preprocessing of raw data, identification of metabolites, variable selection, and modeling are illustrated. We believe that insights from these developments will help narrow the gap between the original dataset and current biological knowledge. We also discuss the limitations and perspectives of extracting information from high-throughput datasets.
Display omitted
•The advanced chemometric methods for metabolomics data processing are illustrated.•The limitations and perspectives in extracting information from high-throughput datasets are discussed.•Four questions which are of great importance to the advance for data processing of metabolomics are highlighted.
Abstract Measuring the flatness error of large precision workpieces quickly and accurately is a difficult problem. A new method for preprocessing flatness measurement data based on MSE (mean squared ...error) is proposed. A mathematical model of a new data preprocessing method was established, and the mathematical formula for model solving was derived in detail. The data were measured by digital level on the plane of the granite base with dimensions of 2340 m×1540 mm. The new method and SmartLevel (basic measurement system of the level computer) were used to calculate and process the data. The flatness errors after diagonal evaluation were 4.07 μm and 3.90 μm, respectively. The relative error of the two was 4.36%, which confirmed the reliability and accuracy of the new method. The data results show that this method can be effectively used for the engineering measurement of the flatness of large precision workpieces.
Data preprocessing is a major and essential stage whose main goal is to obtain final data sets that can be considered correct and useful for further data mining algorithms. This paper summarizes the ...most influential data preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of data for imbalanced preprocessing. They constitute all among the most important topics in data preprocessing research and development. This paper emphasizes on the most well-known preprocessing methods and their practical study, selected after a recent, generic book on data preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different data sets that provide useful tips for the use of preprocessing algorithms. In the first place, we graphically present the effects on two benchmark data sets for the preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL’2014 Big Data competition to provide a thorough analysis on the application of some preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers.
Data are naturally collected in their raw state and must undergo a series of preprocessing steps to obtain data in their input state for Artificial Intelligence (AI) and other applications. The data ...preprocessing phase is not only necessary to fit input requirements but also effective in improving AI training efficiency and output accuracy. Data preprocessing is a time consuming and complex phase that lacks a unified and structured approach. We survey data preprocessing techniques under different categories to provide an extended and structured scope of data preprocessing relevant to numerical time-series data. We also provide an empirical analysis of the impact of preprocessing techniques on the quality of the data and on the performance of AI algorithms. In addition, we discuss the feasibility of distributing some of the surveyed techniques to the edge. Leveraging edge computing to distribute data preprocessing reduces the workload on central systems, creates more manageable data lakes, reduces the consumption of resources (e.g., energy) and enables EdgeAI.
Network representation learning (NRL) advances the conventional graph mining of social networks, knowledge graphs, and complex biomedical and physics information networks. Dozens of NRL algorithms ...have been reported in the literature. Most of them focus on learning node embeddings for homogeneous networks, but they differ in the specific encoding schemes and specific types of node semantics captured and used for learning node embedding. This article reviews the design principles and the different node embedding techniques for NRL over homogeneous networks. To facilitate the comparison of different node embedding algorithms, we introduce a unified reference framework to divide and generalize the node embedding learning process on a given network into preprocessing steps, node feature extraction steps, and node embedding model training for an NRL task such as link prediction and node clustering. With this unifying reference framework, we highlight the representative methods, models, and techniques used at different stages of the node embedding model learning process. This survey not only helps researchers and practitioners gain an in-depth understanding of different NRL techniques but also provides practical guidelines for designing and developing the next generation of NRL algorithms and systems.
•Data complexity analysis.•Dynamic selection of normalization technique.•Deciding whether to use min–max/z-score normalization.•Kernelized Extreme Learning Machine.
Data preprocessing is an important ...step for designing classification model. Normalization is one of the preprocessing techniques used to handle the out-of-bounds attributes. This work develops 14 classification models using different learning algorithms for dynamic selection of normalization technique. This work extracts 12 data complexity measures for 48 datasets drawn from the KEEL dataset repository. Each of these datasets is normalized using min–max and z-score normalization technique. G-mean index is estimated for these normalized datasets using Gaussian Kernel Extreme Learning Machine (KELM) in order to determine the best-suited normalization technique. The data complexity measures along with the best-suited normalization technique are used as an input for developing the aforementioned dynamic models. These models predict the best suitable normalization technique based on the estimated data complexity measures of the dataset. The result shows that the model developed using Gaussian Kernel ELM (KELM) and Support Vector Machine (SVM) give promising results for most of the evaluated classification problems.
In order to reduce the torque ripple for switched reluctance motor (SRM), the learning error preprocessing-based torque–flux linkage recurrent neural network adaptive inversion control (TFRNNAIC) for ...SRM torque is proposed by the filter preprocessing and the non-linear mechanism characteristics for SRM. In TFRNNAIC with the advantages of parallel and series–parallel structure, the torque feedback error learning method is employed to update the weights of the torque–flux linkage recurrent neural network. To suppress the ripple effectively in torque error used for the weight learning of the torque–flux linkage recurrent neural network and obtain an accurate TFRNNAIC, namely the torque–flux linkage model for SRM, the low-pass filter preprocessing for the torque error is used. Moreover, the other low-pass filter is executed to reduce the ripple in the output for the PD torque control. The superposition of outputs for TFRNNAIC and PD torque control is taken as the reference flux linkage. Compared with other control strategies, such as the classical parallel neural network control, the classical series–parallel neural network control and TFRNNAIC, the simulation results show that the learning error preprocessing-based TFRNNAIC for SRM torque is capable of effectively reducing the torque ripple for SRM with the good recurrent performance.
Though most of the faces are axis-symmetrical objects, few real-world face images are axis-symmetrical images. In the past years, there are many studies on face recognition, but only little attention ...is paid to this issue and few studies to explore and exploit the axis-symmetrical property of faces for face recognition are conducted. In this paper, we take the axis-symmetrical nature of faces into consideration and design a framework to produce approximately axis-symmetrical virtual dictionary for enhancing the accuracy of face recognition. It is noteworthy that the novel algorithm to produce axis-symmetrically virtual face images is mathematically very tractable and quite easy to implement. Extensive experimental results demonstrate the superiority in face recognition of the virtual face images obtained using our method to the original face images. Moreover, experimental results on different databases also show that the proposed method can achieve satisfactory classification accuracy in comparison with state-of-the-art image preprocessing algorithms. The MATLAB code of the proposed method can be available at http://www.yongxu.org/lunwen.html.
•Developed a novel method to automatically produce approximately axis-symmetrical virtual face images.•Treated as an effective image preprocessing method.•Used as a virtual image dictionary learning method for image classification.•Extensive experiments on different face databases show its effectiveness as an image preprocessing algorithm.•The strong identification capability of our method is verified in comparison with state-of-the-art dictionary learning algorithms.
As construction sites increase in size, it becomes more difficult for a manager to understand the status of the site on time. However, with the development of unmanned aerial vehicles (UAVs), it is ...possible to collect a large amount of visual data of the construction site in a short time. Using this data, a large-scale construction site can be monitored in a timely and frequent manner with computer vision technologies. This paper proposes a method to generate a panorama of a construction site by using an image stitching technique with a focus on preprocessing. To create high-quality panoramas, blurred frames of videos are filtered out, key frames are selected, and camera lens distortion is corrected. The proposed method produced a high-quality panorama of a construction site, which was evaluated by comparing it with an aerial photograph and the panorama produced by the existing image stitching technique. The proposed method is expected to help managers to easily identify various construction site conditions with the help of high-quality image data.
•A preprocessing methodology to generate high-resolution panorama was presented.•The preprocessing includes blur removal, key frame selection, and lens correction.•Using a UAV, a high-resolution and high-quality panorama is generated.•A panorama is suitable for representing a large construction site.
Machine learning (ML) is continuously unleashing its power in a wide range of applications. It has been pushed to the forefront in recent years partly owing to the advent of big data. ML algorithms ...have never been better promised while challenged by big data. Big data enables ML algorithms to uncover more fine-grained patterns and make more timely and accurate predictions than ever before; on the other hand, it presents major challenges to ML such as model scalability and distributed computing. In this paper, we introduce a framework of ML on big data (MLBiD) to guide the discussion of its opportunities and challenges. The framework is centered on ML which follows the phases of preprocessing, learning, and evaluation. In addition, the framework is also comprised of four other components, namely big data, user, domain, and system. The phases of ML and the components of MLBiD provide directions for identification of associated opportunities and challenges and open up future work in many unexplored or under explored research areas.