Artificial intelligence (AI) has recently intensified in the global economy due to the great competence that it has demonstrated for analysis and modeling in many disciplines. This situation is ...accelerating the shift towards a more automated society, where these new techniques can be consolidated as a valid tool to face the difficult challenge of credit fraud detection (CFD). However, tight regulations do not make it easy for financial entities to comply with them while using modern techniques. From a methodological perspective, autoencoders have demonstrated their effectiveness in discovering nonlinear features across several problem domains. However, autoencoders are opaque and often seen as black boxes. In this work, we propose an interpretable and agnostic methodology for CFD. This type of approach allows a double advantage: on the one hand, it can be applied together with any machine learning (ML) technique, and on the other hand, it offers the necessary traceability between inputs and outputs, hence escaping from the black-box model. We first applied the state-of-the-art feature selection technique defined in the companion paper. Second, we proposed a novel technique, based on autoencoders, capable of evaluating the relationship among input and output of a sophisticated ML model for each and every one of the samples that are submitted to the analysis, through a single transaction-level explanation (STE) approach. This technique allows each instance to be analyzed individually by applying small fluctuations of the input space and evaluating how it is triggered in the output, thereby shedding light on the underlying dynamics of the model. Based on this, an individualized transaction ranking (ITR) can be formulated, leveraging on the contributions of each feature through STE. These rankings represent a close estimate of the most important features playing a role in the decision process. The results obtained in this work were consistent with previous published papers, and showed that certain features, such as living beyond means, lack or absence of transaction trail, and car loans, have strong influence on the model outcome. Additionally, this proposal using the latent space outperformed, in terms of accuracy, our previous results, which already improved prior published papers, by 5.5% and 1.5% for the datasets under study, from a baseline of 76% and 93%. The contribution of this paper is twofold, as far as a new outperforming CFD classification model is presented, and at the same time, we developed a novel methodology, applicable across classification techniques, that allows to breach black-box models, erasingthe dependencies and, eventually, undesirable biases. We conclude that it is possible to develop an effective, individualized, unbiased, and traceable ML technique, not only to comply with regulations, but also to be able to cope with transaction-level inquiries from clients and authorities.
The Discounted Cash Flow (DCF) method is probably the most extended approach used in company valuation, its main drawbacks being probably the known extreme sensitivity to key variables such as ...Weighted Average Cost of Capital (WACC) and Free Cash Flow (FCF) estimations not unquestionably obtained. In this paper we propose an unbiased and systematic DCF method which allows us to value private equity by leveraging on stock markets evidences, based on a twofold approach: First, the use of the inverse method assesses the existence of a coherent WACC that positively compares with market observations; second, different FCF forecasting methods are benchmarked and shown to correspond with actual valuations. We use financial historical data including 42 companies in five sectors, extracted from Eikon-Reuters. Our results show that WACC and FCF forecasting are not coherent with market expectations along time, with sectors, or with market regions, when only historical and endogenous variables are taken into account. The best estimates are found when exogenous variables, operational normalization of input space, and data-driven linear techniques are considered (Root Mean Square Error of 6.51). Our method suggests that FCFs and their positive alignment with Market Capitalization and the subordinate enterprise value are the most influencing variables. The fine-tuning of the methods presented here, along with an exhaustive analysis using nonlinear machine-learning techniques, are developed and discussed in the companion paper.
Despite the broad interest and use of sentiment analysis nowadays, most of the conclusions in current literature are driven by simple statistical representations of sentiment scores. On that basis, ...the generated sentiment evaluation consists nowadays of encoding and aggregating emotional information from a number of individuals and their populational trends. We hypothesized that the stochastic processes aimed to be measured by sentiment analysis systems will exhibit nontrivial statistical and temporal properties. We established an experimental setup consisting of analyzing the short text messages (tweets) of 6 user groups with different nature (universities, politics, musicians, communication media, technological companies, and financial companies), including in each group ten high-intensity users in their regular generation of traffic on social networks. Statistical descriptors were checked to converge at about 2000 messages for each user, for which messages from the last two weeks were compiled using a custom-made tool. The messages were subsequently processed for sentiment scoring in terms of different lexicons currently available and widely used. Not only the temporal dynamics of the resulting score time series per user was scrutinized, but also its statistical description as given by the score histogram, the temporal autocorrelation, the entropy, and the mutual information. Our results showed that the actual dynamic range of lexicons is in general moderate, and hence not much resolution is given within their end-of-scales. We found that seasonal patterns were more present in the time evolution of the number of tweets, but to a much lesser extent in the sentiment intensity. Additionally, we found that the presence of retweets added negligible effects over standard statistical modes, while it hindered informational and temporal patterns. The innovative Compounded Aggregated Positivity Index developed in this work proved to be characteristic for industries and at the same time an interesting way to identify singularities among peers. We conclude that temporal properties of messages provide with information about the sentiment dynamics, which is different in terms of lexicons and users, but commonalities can be exploited in this field using appropriate temporal digital processing tools.
Artificial intelligence (AI) is rapidly shaping the global financial market and its services due to the great competence that it has shown for analysis and modeling in many disciplines. What is ...especially remarkable is the potential that these techniques could offer to the challenging reality of credit fraud detection (CFD); but it is not easy, even for financial institutions, to keep in strict compliance with non-discriminatory and data protection regulations while extracting all the potential that these powerful new tools can provide to them. This reality effectively restricts nearly all possible AI applications to simple and easy to trace neural networks, preventing more advanced and modern techniques from being applied. The aim of this work was to create a reliable, unbiased, and interpretable methodology to automatically evaluate CFD risk. Therefore, we propose a novel methodology to address the mentioned complexity when applying machine learning (ML) to the CFD problem that uses state-of-the-art algorithms capable of quantifying the information of the variables and their relationships. This approach offers a new form of interpretability to cope with this multifaceted situation. Applied first is a recent published feature selection technique, the informative variable identifier (IVI), which is capable of distinguishing among informative, redundant, and noisy variables. Second, a set of innovative recurrent filters defined in this work are applied, which aim to minimize the training-data bias, namely, the recurrent feature filter (RFF) and the maximally-informative feature filter (MIFF). Finally, the output is classified by using compelling ML techniques, such as gradient boosting, support vector machine, linear discriminant analysis, and linear regression. These defined models were applied both to a synthetic database, for better descriptive modeling and fine tuning, and then to a real database. Our results confirm that our proposal yields valuable interpretability by identifying the informative features’ weights that link original variables with final objectives. Informative features were living beyond one’s means, lack or absence of a transaction trail, and unexpected overdrafts, which are consistent with other published works. Furthermore, we obtained 76% accuracy in CFD, which represents an improvement of more than 4% in the real databases compared to other published works. We conclude that with the use of the presented methodology, we do not only reduce dimensionality, but also improve the accuracy, and trace relationships among input and output features, bringing transparency to the ML reasoning process. The results obtained here were used as a starting point for the companion paper which reports on our extending the interpretability to nonlinear ML architectures.
The search for an unbiased company valuation method to reduce uncertainty, whether or not it is automatic, has been a relevant topic in social sciences and business development for decades. Many ...methods have been described in the literature, but consensus has not been reached. In the companion paper we aimed to review the assessment capabilities of traditional company valuation model, based on company’s intrinsic value using the Discounted Cash Flow (DCF). In this paper, we capitalized on the potential of exogenous information combined with Machine Learning (ML) techniques. To do so, we performed an extensive analysis to evaluate the predictive capabilities with up to 18 different ML techniques. Endogenous variables (features) related to value creation (DCF) were proved to be crucial elements for the models, while the incorporation of exogenous, industry/country specific ones, incrementally improves the ML performance. Bagging Trees, Supported Vector Machine Regression, Gaussian Process Regression methods consistently provided the best results. We concluded that an unbiased model can be created based on endogenous and exogenous information to build a reference framework, to price and benchmark Enterprise Value for valuation and credit risk assessment.
With the emergence of information and communication technologies, a large amount of data has turned available for the organizations, which creates expectations on their value and content for ...management purposes. However, the exploratory analysis of available organizational data based on emerging Big Data technologies are still developing in terms of operative tools for solid and interpretable data description. In this work, we addressed the exploratory analysis of organization databases at early stages where little quantitative information is available about their efficiency. Categorical and metric single-variable tests are proposed and formalized in order to provide a mass criterion to identify regions in forms with clusters of significant variables. Bootstrap resampling techniques are used to provide nonparametric criteria in order to establish easy-to-use statistical tests, so that single-variable tests are represented each on a visual and quantitative statistical plot, whereas all the variables in a given form are jointly visualized in the so-called chromosome plots. More detailed profile plots offer deep comparison knowledge for categorical variables across the organization physical and functional structures, while histogram plots for numerical variables incorporate the statistical significance of the variables under study for preselected Pareto groups. Performance grouping is addressed by identifying two or three groups according to some representative empirical distribution of some convenient grouping feature. The method is applied to perform a Big-Data exploratory analysis on the follow-up forms of Spanish Red Cross, based on the number of interventions and on a by-record basis. Results showed that a simple one-variable blind-knowledge exploratory Big-Data analysis, as the one developed in this paper, offers unbiased comparative graphical and numerical information that characterize organizational dynamics in terms of applied resources, available capacities, and productivity. In particular, the graphical and numerical outputs of the present analysis proved to be a valid tool to isolate the underlying overloaded or under-performing resources in complex organizations. As a consequence, the proposed method allows a systematic and principled way for efficiency analysis in complex organizations, which combined with organizational internal knowledge could leverage and validate efficient decision-making.
► Empirical models of sales promotion are relevant for marketing strategies. ► A simple statistical tool allows operative comparisons among promotional models. ► Bootstrap statistical description is ...used to evaluate the models in terms of average and scatter measurements. ► Different figure of merits, and structured parameter selection, allowed an optimized promotion modeling. ► Prediction quality was robust with respect to the design parameters selection.
Sales promotions have become in recent years a paramount issue in the marketing strategies of many companies, and they have even more relevance in the present economic situation. Currently, the empirical models, aimed at assessing consumers behavior in response to certain sales promotions activities such as temporary price reductions, are receiving growing attention in this relevant research field, due to two reasons mainly: (1) the complexity of the interactions among the different elements incorporated inside promotions campaigns attracts growing attention; and (2) the increased availability of electronic records on sales history. Hence, it will become important that the performance description and comparison among all available machine learning promotion models, as well as their design parameters selection, will be performed using a robust and statistically rigorous procedure, while keeping functionality and usefulness. In this paper, we first propose a simple nonparametric statistical tool, based on the paired bootstrap resampling, to allow an operative result comparison among different learning-from-samples promotional models. Secondly, we use the bootstrap statistical description to evaluate the models in terms of average and scatter measurements, for a more complete efficiency characterization of the promotional sales models. These statistical characterizations allow us to readily work with the distribution of the actual risk, in order to avoid overoptimistic performance evaluation in the machine learning based models. We also present the analysis performed to determinate whether the figure of merit has a significant impact on final result, together with an in depth design parameter selection to optimize final results during the promotion evaluation using statistical learning techniques. No significant difference was obtained in terms of figure of merit choice, and Mean Absolute Error was selected for performance measurement. As a summary, the applied technique allows clarifying the design of the promotional sales models for a real database (milk category), according to the influence of the figure of merit used for design parameters selection, showing the robustness of the machine learning techniques in this setting. Results obtained in this paper will be subsequently applied, and presented in the companion paper, devoted to a more detailed quality analysis, to evaluate four well-known machine learning algorithms in real databases for two categories with different promotional behavior.
Objective: Heart rate turbulence (HRT) has been successfully explored for cardiac risk stratification. While HRT is known to be influenced by the heart rate (HR) and the coupling interval (CI), ...nonconcordant results have been reported on how the CI influences HRT. The purpose of this study is to investigate HRT changes in terms of CI and HR by means of an especially designed protocol. Methods: A dataset was acquired from 11 patients with structurally normal hearts for which CI was altered by different pacing trains and HR by isoproterenol during electrophysiological study (EPS). The protocol was designed so that, first, the effect of HR changes on HRT and, second, the combined effect of HR and CI could be explored. As a complement to the EPS dataset, a database of 24-h Holters from 61 acute myocardial infarction (AMI) patients was studied for the purpose of assessing risk. Data analysis was performed by using different nonlinear ridge regression models, and the relevance of model variables was assessed using resampling methods. The EPS subjects, with and without isoproterenol, were analyzed separately. Results: The proposed nonlinear regression models were found to account for the influence of HR and CI on HRT, both in patients undergoing EPS without isoproterenol and in low-risk AMI patients, whereas this influence was absent in high-risk AMI patients. Moreover, model coefficients related to CI were not statistically significant, p > 0.05, on EPS subjects with isoproterenol. Conclusion: The observed relationship between CI and HRT, being in agreement with the baroreflex hypothesis, was statistically significant (p <; 0.05 ), when decoupling the effect of HR and normalizing the CI by the HR. Significance: The results of this study can help to provide new risk indicators that take into account physiological influence on HRT, as well as to model how this influence changes in different cardiac conditions.
Great effort has been devoted in recent years to the development of sudden cardiac risk predictors as a function of electric cardiac signals, mainly obtained from the electrocardiogram (ECG) ...analysis. But these prediction techniques are still seldom used in clinical practice, partly due to its limited diagnostic accuracy and to the lack of consensus about the appropriate computational signal processing implementation. This paper addresses a three-fold approach, based on ECG indices, to structure this review on sudden cardiac risk stratification. First, throughout the computational techniques that had been widely proposed for obtaining these indices in technical literature. Second, over the scientific evidence, that although is supported by observational clinical studies, they are not always representative enough. And third, via the limited technology transfer of academy-accepted algorithms, requiring further meditation for future systems. We focus on three families of ECG derived indices which are tackled from the aforementioned viewpoints, namely, heart rate turbulence (HRT), heart rate variability (HRV), and T-wave alternans. In terms of computational algorithms, we still need clearer scientific evidence, standardizing, and benchmarking, siting on advanced algorithms applied over large and representative datasets. New scenarios like electronic health recordings, big data, long-term monitoring, and cloud databases, will eventually open new frameworks to foresee suitable new paradigms in the near future.