Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as ...selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.
•The Group-by aggregation may involve high communication cost during the shuffle phase.•We propose a dynamic technique for the Partitioning and Load Balancing (PLB) of data.•Our approach Enhances OLAP query execution time over Hadoop Clusters compared to existing approaches.•Our approach combines data and workload-driven models.•The star join operation is performed in only one Spark stage, without a shuffle phase.
Nowadays, multidimensional models are recognized to best reflect the decision makers’ analytical view of data. The classical multidimensional models were meant to analyze conventional data (numerical ...and categorical). However, they fail to handle data complexity, which is expressed by the multiplicity of data sources, the heterogeneity of formats, the diversity of structures, etc. To this end, new multidimensional models have been proposed for OLAP purposes. Nevertheless, data complexity is partially covered in these models, which may cause a lack in decision making. In our previous work, we proposed to integrate data complexity within a complex object-based multidimensional model. In this paper, based on our proposed model, we provide adapted OLAP operators that take into account data complexity. Thus, we define operators to create complex data cubes, to visualize them and to analyze them.
In this paper, we present a mixed approach for building XML data warehouses from both XML data sources and user requirements. Our proposed approach aims at obtaining a unique multidimensional schema ...of theXML data warehouse. The approach follows three steps. During the first step, an intermediate SBVR model extended with template rules is used to accommodate a data warehousing system and to facilitate the automatic identification of facts and dimensions from the user requirements. After modelling XML data sources in UML, the second step corresponds to identifying candidate DW schemata from such data sources. The third step compares these candidate schemata with the reference model obtained from the user requirements. In this step, we propose to adapt similarity metric-extended Boolean models (BIR) and to use them in order to measure, rank and select the most appropriate data warehouse schema. Such a schema should best describe the data sources and exhaustively cover all the needed user requirements. To demonstrate our approach, we present a case study of the bibliographic database dblp.
This paper presents a multidimensional model and a language to construct cubes for the purpose of on-line analytical processing. Both the multidimensional model and the cube model are based on the ...concept of complex object which models complex entities of real world. The multidimensional model is presented at two layers: the class diagram layer and the package layer. Both layers are used by a projection operation that aims at extracting cubes: at the package diagram layer, the projection dynamically assigns the roles of fact and dimensions to the complex objects of the multidimensional model whereas at the class diagram layer, it allows designing the measures. We also provide operations that optimize the construction of new cubes by using existing ones. The set of operations for cube construction are expressed by formal operators, thus forming a language. To show the feasibility of our multidimensional model and operators, we present implementation details of a real case study.
Efficient Compression and Storage of XML OLAP Cubes Boukraa, Doulkifli; Bouchoukh, Mohammed Amin; Boussaid, Omar
International journal of data warehousing and mining,
07/2015, Volume:
11, Issue:
3
Journal Article
Peer reviewed
In this paper, the authors present an approach to efficiently compress XML OLAP cubes. They propose a multidimensional snowflake schema of the cube as the basic physical configuration. The cube is ...then composed of one XML fact document and as many XML documents as the dimension hierarchy members. The basic configuration is reorganized into two ways by adding data redundancy on purpose in order to achieve a better compression ratio on the one hand and to improve query response time on the other hand. In the second configuration, all the documents of the cube are merged into one single XML document. In the third configuration, each reference between the fact and the dimensions or between the members of a dimension hierarchy is replaced by the whole XML referenced fragments. To the three physical configurations of the cube, the authors apply a new compression technique named XCC. They demonstrate the efficiency of the third configuration before and after compression and they also show the efficiency of their compression technique when applied to XML OLAP cubes.
Enabling Self-Service OLAP on NoSQL Data Lakes Boukraa, Doulkifli; Sellamna, Hamza; Chidekh, Yacine
2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS),
2024-April-24
Conference Proceeding
Online analytical processing plays a crucial role in data analysis by providing a powerful framework for efficiently exploring and understanding multidimensional data. With the increasing complexity ...and volume of data in modern organisations, the main challenge is to build OLAP cubes from distributed heterogeneous sources while considering attribute and relationship overlapping across the sources and to let analysts create their cubes on-demand to maximise their usefulness. We propose a self-service, metadata-driven approach for designing and generating OLAP cubes from heterogeneous NoSQL sources composing a data lake. We illustrate our approach through a realistic scenario where the data is scattered and overlaps across different sources. We present a proof-of-concept application that implements our proposed approach. An assessment of our proposal is conducted to highlight the efficacy and usefulness of the generated cubes for analytical purposes.
Object Part Appearance Module built into Yolo for Occlusion Remmouche, Brahim; Taffar, Mokhtar; Boukraa, Doulkifli
2024 8th International Conference on Image and Signal Processing and their Applications (ISPA),
2024-April-21
Conference Proceeding
Despite recent advances in detection and recognition models, the development of robust and accurate algorithms for describing and classifying object visual content in real-life scenarios remains a ...significant challenge. This paper introduces an object part appearance module, OPA, within the framework OPA-Yolo, specifically designed for occluded object detection. Our approach combines handcrafted features with deep ones across network layers to capture the visual content parts of occulted object classes. Mapped deep features identify different regions of interest, characterizing object part appearances with a significant amount of statistical information. To enrich this feature representation with invariant local features, we train the network to learn an optimal combination of object appearance, providing it with an additional attention mechanism. The model is built on a recent YOLO object detector architecture. Evaluated on the PASCAL-VOC dataset, OPA-Yolo achieved an average detection rate of approximately 24.76% on occulted objects, with some objects having an occlusion ratio of up to 60%.
Nowadays, huge amounts of data are continuously created from different sources in different formats, and stored into data lakes for further use. A typical use case is to conduct online analyses in ...order to gain business insights and make better decisions. However, there are many obstacles to overcome when it comes on analysing Big Data online, such as dealing with schema-free data, conciliating different data formats, managing different locations, and allowing BI professionals and analysts to create their analytical data by themselves. In this paper, we propose some solutions to overcome these obstacles. We propose a data lake metadata model as well as a metadata-driven approach to create OLAP cubes from data lakes on-demand and in a self-service manner. We apply our work to Twitter social network and we present a proof-of-concept dedicated application.
The data produced from educational activities could be exploited in order to extract useful knowledge, assist educational decision makers in making better decisions and help students achieve better ...results. In this study, we report our findings about the application of a data mining technique following the CRISP-DM model at the department of Computer Science at the University of Jijel, Algeria. Our proposed system is able to classify undergraduate and post-graduate students according to their results and to predict their performance for the coming years based on their current results and on history data. The system can also be used as an early-warning tool for students at risk and to help graduates in choosing the appropriate Master's disciplines to pursue their studies.