Linear regression without correspondences is the problem of performing a linear regression fit to a dataset for which the correspondences between the independent samples and the observations are ...unknown. Such a problem naturally arises in diverse domains such as computer vision, data mining, communications and biology. In its simplest form, it is tantamount to solving a linear system of equations, for which the entries of the right hand side vector have been permuted. This type of data corruption renders the linear regression task considerably harder, even in the absence of other corruptions, such as noise, outliers or missing entries. Existing methods are either applicable only to noiseless data or they are very sensitive to initialization or they work only for partially shuffled data. In this paper we address these issues via an algebraic geometric approach, which uses symmetric polynomials to extract permutation-invariant constraints that the parameters <inline-formula> <tex-math notation="LaTeX">\xi ^{*} \in \mathbb {R} ^{\text {n}} </tex-math></inline-formula> of the linear regression model must satisfy. This naturally leads to a polynomial system of n equations in n unknowns, which contains <inline-formula> <tex-math notation="LaTeX">\xi ^{*} </tex-math></inline-formula> in its root locus. Using the machinery of algebraic geometry we prove that as long as the independent samples are generic, this polynomial system is always consistent with at most n ! complex roots, regardless of any type of corruption inflicted on the observations. The algorithmic implication of this fact is that one can always solve this polynomial system and use its most suitable root as initialization to the Expectation Maximization algorithm. To the best of our knowledge, the resulting method is the first working solution for small values of n able to handle thousands of fully shuffled noisy observations in milliseconds.
•Common existing regression approaches are statistically unsound.•The regression can be conducted using residuals which uphold the mass balance.•Regression approach impacts mainly the apparent ...energetic heterogeneity, nf.•Equilibration time and applied particle size fraction impact both coefficients.•Fine particle size fractions give a reasonable estimate of the equilibrium.
An accurate understanding of equilibrium conditions is an essential starting point for dimensioning and modelling efforts in adsorption systems. However, key assumptions and methodologies in the generation and analysis of adsorption data sets might be contributing to the error in the description of the equilibrium. This study investigates the potential impacts of the equilibration time, test adsorbent particle size selection, and the regression technique applied, on the fitted coefficients of the Freundlich equilibrium model in the adsorption of arsenate on Granular Ferric Hydroxide. The choice of regression algorithm was found to impact primarily the exponent describing the energetic heterogeneity of the adsorption surface and resulting non-linearity of the isotherm, nf. Non-linear regression with a hybrid error function on the observed solute concentration rather than loading was found to be a suitable approach. Insufficient equilibration time was found to impact predominantly the strength of adsorption, given by Kf, when applying a small adsorbent size fraction, and both coefficients when applying a coarser technical size fraction. Application of the small size fractions in lieu of the larger in accelerated trials appeared to over-estimate both coefficients, but is a reasonable alternative to long contact times.
As machine learning becomes widely used for automated decisions, attackers have strong incentives to manipulate the results and models generated by machine learning algorithms. In this paper, we ...perform the first systematic study of poisoning attacks and their countermeasures for linear regression models. In poisoning attacks, attackers deliberately influence the training data to manipulate the results of a predictive model. We propose a theoretically-grounded optimization framework specifically designed for linear regression and demonstrate its effectiveness on a range of datasets and models. We also introduce a fast statistical attack that requires limited knowledge of the training process. Finally, we design a new principled defense method that is highly resilient against all poisoning attacks. We provide formal guarantees about its convergence and an upper bound on the effect of poisoning attacks when the defense is deployed. We evaluate extensively our attacks and defenses on three realistic datasets from health care, loan assessment, and real estate domains.
In the period 1991-2015, algorithmic advances in Mixed Integer Optimization (MIO) coupled with hardware improvements have resulted in an astonishing 450 billion factor speedup in solving MIO ...problems. We present a MIO approach for solving the classical best subset selection problem of choosing k out of p features in linear regression given n observations. We develop a discrete extension of modern first-order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with n in the 1000s and p in the 100s in minutes to provable optimality, and finds near optimal solutions for n in the 100s and p in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than Lasso and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power.
The linearization of adsorption equations is controversial. The estimation of fitting parameters strongly depends on the linearization method, magnitude of experimental error, and data range. ...Although many studies contrast linear versions of these equations with their non-linear counterparts, linearization is preferred due to its simplicity since a line could be represented with fewer experimental points than a curve. An in-depth analysis was carried out to compare the accuracy of linear and non-linear models. Although different transformations linearize Langmuir isotherms, only one form yields reliable fitting parameters. Linear transformations could also lead to a statistical bias, favoring a model that does not represent the experimental behavior. Similar observations are discussed regarding the pseudo-second-order kinetic model. Linearization of Freundlich isotherms, pseudo-first-order kinetic models, and fixed-bed adsorption models through logarithms implies that attention must be taken on the logarithm limits by properly selecting the data range. Linearization also promotes the incorrect interpretation of models due to oversimplification. The linearized van't Hoff equation would yield a reasonable fit with fewer experimental points than the non-linear regression, which requires more data to assure convergence. In this sense, there is convincing evidence that non-linear regression is a more robust and reliable tool for adsorption modeling.
The COVID-19 pandemic originated from the city of Wuhan of China has highly affected the health, socio-economic and financial matters of the different countries of the world. India is one of the ...countries which is affected by the disease and thousands of people on daily basis are getting infected. In this paper, an analysis of daily statistics of people affected by the disease are taken into account to predict the next days trend in the active cases in Odisha as well as India.
A valid global data set is collected from the WHO daily statistics and correlation among the total confirmed, active, deceased, positive cases are stated in this paper. Regression model such as Linear and Multiple Linear Regression techniques are applied to the data set to visualize the trend of the affected cases.
Here a comparison of Linear Regression and Multiple Linear Regression model is performed where the score of the model R2tends to be 0.99 and 1.0 which indicates a strong prediction model to forecast the next coming days active cases. Using the Multiple Linear Regression model as on July month, the forecast value of 52,290 active cases are predicted towards the next month of 15th August in India and 9,358 active cases in Odisha if situation continues like this way.
These models acquired remarkable accuracy in COVID-19 recognition. A strong correlation factor determines the relationship among the dependent (active) with the independent variables (positive, deceased, recovered).
•Multiple linear regression model is proposed for prediction of Active cases in COVID-19 daily data.•The model predicts a value of 52,290 active cases in India and 9358 active cases in Odisha towards the 15th of August.•The ANOVA results shows a significant P value that accepts the proposed model.•Statistical results show MLR model has fair predictive potential over the LR model.
When the outcome is binary, psychologists often use nonlinear modeling strategies such as logit or probit. These strategies are often neither optimal nor justified when the objective is to estimate ...causal effects of experimental treatments. Researchers need to take extra steps to convert logit and probit coefficients into interpretable quantities, and when they do, these quantities often remain difficult to understand. Odds ratios, for instance, are described as obscure in many textbooks (e.g., Gelman & Hill, 2006, p. 83). I draw on econometric theory and established statistical findings to demonstrate that linear regression is generally the best strategy to estimate causal effects of treatments on binary outcomes. Linear regression coefficients are directly interpretable in terms of probabilities and, when interaction terms or fixed effects are included, linear regression is safer. I review the Neyman-Rubin causal model, which I use to prove analytically that linear regression yields unbiased estimates of treatment effects on binary outcomes. Then, I run simulations and analyze existing data on 24,191 students from 56 middle schools (Paluck, Shepherd, & Aronow, 2013) to illustrate the effectiveness of linear regression. Based on these grounds, I recommend that psychologists use linear regression to estimate treatment effects on binary outcomes.
The current compliance networks of automatic air-quality monitoring stations in large urban environments are not sufficient to provide spatial and temporal measurement resolution for realistic ...assessment of personal exposure to pollutants. Small low-cost sensor platforms with greater mobility and expected lower maintenance costs, are increasingly being used as a supplement to compliance monitoring stations. However, low-cost sensor platforms usually provide data with uncertain precision. To improve the precision, these sensor platforms require in-field calibration. Our paper aims to demonstrate that data from each individual sensor system can be corrected using that sensor system's own data to achieve much improved data quality compared to a reference. However, in this procedure, there are practical difficulties such as individual sensor outputs from the multi-sensor system not being sufficiently available due to malfunctions for instance. We explore how this can be dealt with. In our opinion, this is a novel approach, of practical importance both to users and manufacturers. We present a detailed comparative analysis of Linear Regression (univariate), Multivariate Linear Regression and Artificial Neural Networks used with a specific aim of calibrating field-deployed low-cost CO and O3 sensors. For Artificial Neural Network models, the performance of three common training algorithms was compared (Levenberg-Marquardt, Resilient back-propagation and Conjugate Gradient Powell-Beale algorithm). Data for this study were obtained from two campaigns conducted with 25 multi-sensor AQMESH v.3.5 platforms used within the activities of the CITI-SENSE project. The platforms were co-located to reference gas monitors at the Automatic Monitoring Station Stari Grad, in Belgrade, Serbia. This paper demonstrates that Multivariate Linear Regression and Artificial Neural Network calibration models can improve the output signal. This improvement can be measured by changes in the median and interquartile ranges of statistical parameters used for model evaluation. Artificial Neural Networks showed the best results compared to Linear Regression and Multivariate Linear Regression models. The best predictors for CO, in addition to CO low-cost sensor data, were PM2.5 and NO2, while for O3, in addition to O3 low-cost sensor data, the most suitable input predictors were NO and aH. Based on residual error analysis, we have shown that for CO and O3, a certain range of concentrations exists in which calibrated values differ by less than 10% from the reference method results. In addition, it was noted that for all models, CO sensors consistently showed lower variability between platforms compared to O3 sensors.
Display omitted
•Precision of measured concentrations can be enhanced by using other available sensors.•Calibration predictors for CO were COlc, PM2.5 and NO2, for O3 were O3,lc, NO and aH.•ANN models perform better compared to linear or multilinear regression models.•Within certain ranges of CO and O3 the error of calibrated data is less than 10%.•CO sensors show less variability between the platforms than O3 sensors.
We investigate the choice of the bandwidth for the regression discontinuity estimator. We focus on estimation by local linear regression, which was shown to have attractive properties (Porter, J. ...2003, "Estimation in the Regression Discontinuity Model" (unpublished, Department of Economics, University of Wisconsin, Madison)). We derive the asymptotically optimal bandwidth under squared error loss. This optimal bandwidth depends on unknown functionals of the distribution of the data and we propose simple and consistent estimators for these functionals to obtain a fully data-driven bandwidth algorithm. We show that this bandwidth estimator is optimal according to the criterion of Li (1987, "Asymptotic Optimality for C p , C L , Cross-validation and Generalized Cross-validation: Discrete Index Set", Annals of Statistics, 15, 958–975), although it is not unique in the sense that alternative consistent estimators for the unknown functionals would lead to bandwidth estimators with the same optimality properties. We illustrate the proposed bandwidth, and the sensitivity to the choices made in our algorithm, by applying the methods to a data set previously analysed by Lee (2008, "Randomized Experiments from Non-random Selection in U.S. House Elections", Journal of Econometrics, 142, 675–697) as well as by conducting a small simulation study.
Regularized Label Relaxation Linear Regression Fang, Xiaozhao; Xu, Yong; Li, Xuelong ...
IEEE transaction on neural networks and learning systems,
04/2018, Letnik:
29, Številka:
4
Journal Article
Linear regression (LR) and some of its variants have been widely used for classification problems. Most of these methods assume that during the learning phase, the training samples can be exactly ...transformed into a strict binary label matrix, which has too little freedom to fit the labels adequately. To address this problem, in this paper, we propose a novel regularized label relaxation LR method, which has the following notable characteristics. First, the proposed method relaxes the strict binary label matrix into a slack variable matrix by introducing a nonnegative label relaxation matrix into LR, which provides more freedom to fit the labels and simultaneously enlarges the margins between different classes as much as possible. Second, the proposed method constructs the class compactness graph based on manifold learning and uses it as the regularization item to avoid the problem of overfitting. The class compactness graph is used to ensure that the samples sharing the same labels can be kept close after they are transformed. Two different algorithms, which are, respectively, based on <inline-formula> <tex-math notation="LaTeX">\ell _{2} </tex-math></inline-formula>-norm and <inline-formula> <tex-math notation="LaTeX">\ell _{2,1} </tex-math></inline-formula>-norm loss functions are devised. These two algorithms have compact closed-form solutions in each iteration so that they are easily implemented. Extensive experiments show that these two algorithms outperform the state-of-the-art algorithms in terms of the classification accuracy and running time.