In this article, we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting ...diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models, as well as languages used to query and validate knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We conclude with high-level future research directions for knowledge graphs.
Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by ...graph-oriented operations and type constructors. These models took off in the eighties and early nineties alongside object-oriented models. Their influence gradually died out with the emergence of other database models, in particular geographical, spatial, semistructured, and XML. Recently, the need to manage information with graph-like nature has reestablished the relevance of this area. The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.
Graph databases are becoming increasingly popular for modeling different kinds of networks for data analysis. They are built over the property graph data model, where nodes and edges are annotated ...with property-value pairs. Most existing work in the field is based on graphs were the temporal dimension is not considered. However, time is present in most real-world problems. Many different kinds of changes may occur in a graph as the world it represents evolves across time. For instance, edges, nodes, and properties can be added and/or deleted, and property values can be updated. This paper addresses the problem of modeling, storing, and querying temporal property graphs, allowing keeping the history of a graph database. This paper introduces a temporal graph data model, where nodes and relationships contain attributes (properties) timestamped with a validity interval. Graphs in this model can be heterogeneous, that is, relationships may be of different kinds. Associated with the model, a high-level graph query language, denoted T-GQL, is presented, together with a collection of algorithms for computing different kinds of temporal paths in a graph, capturing different temporal path semantics. T-GQL can express queries like “Give me the friends of the friends of Mary, who lived in Brussels at the same time than her, and also give me the periods when this happened”. As a proof-of-concept, a Neo4j-based implementation of the above is also presented, and a client-side interface allows submitting queries in T-GQL to a Neo4j server. Finally, experiments were carried out over synthetic and real-world data sets, with a twofold goal: on the one hand, to show the plausibility of the approach; on the other hand, to analyze the factors that affect performance, like the length of the paths mentioned in the query, and the size of the graph.
We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges, ...and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start with graph patterns, in which a graph-structured query is matched against the data. Thereafter, we discuss navigational expressions, in which patterns can be matched recursively against the graph to navigate paths of arbitrary length; we give an overview of what kinds of expressions have been proposed and how they can be combined with graph patterns. We also discuss several semantics under which queries using the previous features can be evaluated, what effects the selection of features and semantics has on complexity, and offer examples of such features in three modern languages that are used to query graphs: SPARQL, Cypher, and Gremlin. We conclude by discussing the importance of formalisation for graph query languages; a summary of what is known about SPARQL, Cypher, and Gremlin in terms of expressivity and complexity; and an outline of possible future directions for the area.
Numerous irregular graph datasets, for example social networks or web graphs, may contain even trillions of edges. Often, their structure changes over time and they have domain-specific rich data ...associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size and irregularity of such datasets, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., document stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., Resource Description Framework or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and Atomicity, Consistency, Isolation, Durability). Fifty-one graph database systems are presented and compared, including Neo4j, OrientDB, and Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we outline future research and engineering challenges related to graph databases.
The vast majority of security breaches encountered today are a direct result of insecure code. Consequently, the protection of computer systems critically depends on the rigorous identification of ...vulnerabilities in software, a tedious and error-prone process requiring significant expertise. Unfortunately, a single flaw suffices to undermine the security of a system and thus the sheer amount of code to audit plays into the attacker's cards. In this paper, we present a method to effectively mine large amounts of source code for vulnerabilities. To this end, we introduce a novel representation of source code called a code property graph that merges concepts of classic program analysis, namely abstract syntax trees, control flow graphs and program dependence graphs, into a joint data structure. This comprehensive representation enables us to elegantly model templates for common vulnerabilities with graph traversals that, for instance, can identify buffer overflows, integer overflows, format string vulnerabilities, or memory disclosures. We implement our approach using a popular graph database and demonstrate its efficacy by identifying 18 previously unknown vulnerabilities in the source code of the Linux kernel.
Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed ...an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research.
We propose a recommender system that, starting from a set of users’ skills, identifies the most suitable jobs as they emerge from a large dataset of Online Job Vacancies (OJVs). To this aim, we ...process 2.5M+ OJVs posted in three different countries (United Kingdom, France, and Germany), training several embeddings and performing an intrinsic evaluation of their quality. Besides, we compute a measure of skill importance for each occupation in each country, the Revealed Comparative Advantage (rca). The best vector model, one for each country, together with the rca, is used to feed a graph database, which will serve as the keystone for the recommender system. Results are evaluated through a user study of 10 labor market experts, using P@3 and nDCG as scores. Results show a high precision for the recommendations provided by skills2job, and the high values of nDCG (0.985 and 0.984 in a 0,1 range) indicate a strong correlation between the experts’ scores and the rankings generated by skills2job.
•We explore labor market data using word embeddings and frequency measures accordance.•We organize those resources as a graph to perform graph-traversal queries.•We process 5M+ vacancies for ICT occupations in United Kingdom, Germany, and France.•We present skills2job, a system recommending the suitable occupations to the user.
With the digitization of historical databases and endowments, care must be taken when designing the framework for an information system on the web. Because conflicts arise frequently in reality, ...different data management requirements are necessary for the preservation of waqf property. For the purpose of creating and putting into place historical information systems and endowments for this inheritance, it is necessary to develop an acceptable management plan. An inheritance that is thought to be distinct from customary ones since it is governed by its own law is referred to as waqf, as an example. They typically comprise histories and endowments that need to be protected to ensure sustenance among the population and to ensure they live up to the standards of the community and country. This research was compiled and analyzed to support stakeholders in producing a more practical, focused, and value-delivery framework. The datasets were mapped based on relationships, graph databases, and semantic networks. Moreover, the framework was developed using several data representation models to ensure easier, faster, and more accurate methods of displaying the data. Relationships, graph databases, and semantic networks were used to map the datasets. The design was made available to users, administrators, and managers, with the latter group being in charge of maintaining data control over each entity. The case study was conducted using historical information and waqf from the Nadzir Pangeran Sumedang Indonesia Waqf Foundation (YNWPS) in the Kingdom of Sumedang Larang Indonesia (KSL).` The creation of a web-based information system to keep track of the data in each entity and ensure better preservation of historical genealogical databases and endowments was made simpler by the structured framework design.