Knowledge Graphs versus Property Graphs: Similarities, Differences and Some Guidance on Capabilities - TopQuadrant
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Knowledge Graphs versus Property Graphs: Similarities, Differences and Some Guidance on Capabilities
We are in the era of graphs. Graphs At the recent Data Governance Vision
are hot. Why? Flexibility is one conference, we gave a talk on the topic
strong driver: heterogeneous data, of supporting Data Governance using Graph Data Models:
Knowledge Graphs. One of the questions Property Graphs and RDF Graphs
integrating new data sources, and
asked at the end of the talk was whether
analytics all require flexibility. we were using Microsoft’s SQL Graph,
When we say that over 90% of imple-
mentations use either Property Graphs or
Graphs deliver it in spades. and if not, then why not. After answering RDG Graphs, we mean implementations
the question there on the fly, we decided that use some kind of an industry recog-
Over the last few years, a number of new
that it was time to write a short paper nized graph data model. Due to the
graph databases came to market. As we
explaining the differences between current expansive popularity of graphs,
start the next decade, dare we say “the
distinct implementations of graphs. many vendors are starting to represent
semantic twenties,” we also see vendors
their technology as graph based, when
that never before mentioned graphs starting
in reality they use a home-grown object
to position their products and solutions as
repository that can resemble certain
graphs or graph-based. Today, there are two
aspects of graphs.
main graph data models:
Graph databases are one thing, but This white paper is not intended to cover
“Knowledge Graphs” are an even hotter • Property Graphs such implementations since they do
topic. TopBraid EDG is a solution for creating (also known as Labeled Property Graphs)
not use a recognized data model and,
Knowledge Graphs and putting them to • RDF Graphs thus, there is no basis for comparison.
work. (See page 10 for more information on (Resource Description Framework) If you are considering a technology that
TopBraid EDG.) As a result, we are often asked claims to be graph based, our recommen-
Other graph data models are possible
to explain Knowledge Graphs. dation is to always ask what graph data
as well, but over 90% of the implementa-
• What are they? tions use one of these two models. We
model it uses.
• Why and where are they useful? will start by describing each of them.
• How are they different from “just graphs?”
2concept, but it currently offers more limited If you have worked with object databases,
capabilities than either Neo4J or some of you will find it easy to understand the
the other products that are using the prop- Property Graph data model. It is really
Property Graphs erty graph data model. more of an object data model than a
graph data model.
Generally, the property graph data • Nodes are entities
model consists of three elements:
While there are core commonalities • Edges are relationships
in property graph implementations, • Properties are attributes
there is no true standard property • Nodes are the entities in the graph.
Both, entities and relationships can
Nodes can be tagged with zero to many
graph data model. have attributes.
text labels representing their type. Nodes
Each implementation of a Property Graph are also called vertices.
is, therefore, somewhat different. In the Property values can have data types.
following, we will focus our discussion on • Edges are the directed links between Supported data types depend on the
nodes. Edges are also called relationships. vendor. For example, Neo4j data types
the characteristics that are common
The “from node” of a relationship is called are similar, but not identical, to Java
for any property graph database.
the source node. The “to node” is called language data types.
The most well-known implementation, which the target node. Each edge has a type.
popularized property graphs as a concept, is Figure 1 shows a fragment of a property
While edges are directed, they can be
the Neo4J graph database. At minimum, graph with data about actors, directors and
navigated and queried in either direction.
everything stated here is true for Neo4J. films or TV programs they worked on.
• Properties are the key-value pairs Nodes are represented as ovals. For exam-
Other examples of property graph imple- associated with a node or with an edge. ple, the node with ID 123, as we can see
mentations are TigerGraph and Titan. MS from its properties, represents Tom Hanks.
SQL Graph is based on the same underlying Node labels are shown in dark blue. Node
123’s labels are Person, Actor and Director.
3A PROPERTY GRAPH FRAGMENT WITH DATA ABOUT ACTORS, DIRECTORS, AND FILMS OR TV PROGRAMS
label: Location “Name” White Plains
“Name” New York City “ID” 126 label: City “Population” 58811 “ID” 127
“Name” A League of Their Own type: FILMED_IN type: FILMED_IN
“Released” 1993
“ID” 12 “ID” 13 label: Location
label: City
label: TV Series
NODE
NODE LABEL
“Name” The Post
“ID” 124 “Released” 2017
“ID” 125 label: Movie
PROPERTY KEY
VALUE PAIRS
“Role” Tony Bradlee
EDGE
type: ACTED_IN EDGE TYPE
“ID” 11 type: ACTED_IN
“ID” 14 label: Person
type: DIRECTED “ID” 10
label: Actor
“Role” Ben Bradlee
“First Name” Tom label: Person
“First Name” Sara
“Last Name” Hanks “ID” 123 label: Director
“Last Name” Paulson
“ID” 128
“Year Born” 1956 label: Actor
Figure 1: Simple Property graph excerpt with information about people and works of art
4Relationships are depicted as grey arrows. • Some vendors, in addition to their own The fastest way to load bulk data is by
Each relationship has a single type that query language, also implement some importing a text file. For property graph
is shown in red. Properties are shown in subset of Cypher. For example, SAP Hana data, there is no standard serialization
the rounded rectangles with the gold offers its own extensions to SQL and its (a way to represent graph data as a text
background. Properties are connected to own GraphScript language plus they file). It is typical for a property graph
nodes and relationships that they belong support a subset of Cypher vendor to define a CSV format that users
to using red arrows. There is also Apache TinkerPop — an open should follow in order to prepare files for
A key part of any data model is having a source graph computing framework that is bulk load.
query language available for working with it. integrated with some property graph
After all, users need to have a way to access and RDF graph databases. It offers the
and manipulate the data in the graph. No Gremlin language which is more of an
industry standard query language exists for API language than a query language.
property graphs. Instead, each database A key requirement for working with any
offers their own, unique query language that data model is the ability to reference
is incompatible with others: nodes, properties and relationships
• Neo4J offers Cypher also known as CQL (edges). In the case of property graphs,
— its own query language that, to some internally, nodes and edges have IDs.
extent, took SQL as an inspiration; IDs are assigned by a database and are
internal to a database. Referencing is
• TigerGraph offers GSQL — its own done by using text strings — node labels,
query language that also took SQL as
an inspiration; relationship types, and property names.
• MS SQL Graph has their own extension
to SQL to support graph query;
5the object. Two nodes connected by an called Turtle. There is also a JSON serial-
edge form a subject-predicate-object ization called JSON-LD as well as an XML
statement, also known as a Triple or a serialization. All RDF databases are able
RDF Graphs Triple Statement. While edges are direct- to export and import graph content in
ed, they can be navigated and queried in standard serializations making it easy
either direction. and seamless to interchange data.
Everything in an RDF graph is called a
RDF graphs use a standard graph Built-in Semantics
resource. “Edge” and “Node” are just the
data model. The standard for the RDF
roles played by a resource in a given state- The RDF Data Model provides a richer,
technology stack is managed by the World
ment. Fundamentally in RDF, there is no semantically consistent foundation over
Wide Web consortium (W3C), the same
difference between resources playing an property graphs. Let’s see how a graph
standards body that manages HTML, XML
edge role and resources playing a node role. we showed earlier (Figure 1) is represented
and many other web standards. Every data-
An edge in one statement can be a node in as an RDF Graph (Figure 2).
base that supports RDF is expected to
another. We will give examples of this in
support the model in the same way. Note that the diagrams depict relationships
the diagrams that follow that will make
using the recommended conventions of
The RDF graph data model basically this core idea clearer.
the property graph and RDF graph commu-
consists of two elements: There is a standard query language for nities. Relationships in Property Graphs
• Nodes, the vertices in a graph. Nodes RDF Graphs called SPARQL. It is both, a are typically capitalized with multiple words
can be resources with unique identifiers full featured query language and an HTTP joined together by an underscore as in
or they can be “literals” with values that protocol making it possible to send query ACTED_IN. Relationships (or any property)
are strings, integers, etc. requests to endpoints over HTTP. in RDF graphs are typically identified using
• Edges, the directed links between nodes. A key part of the RDF standard is the
the lower camel case convention as in
Edges are also called predicates and/or ex:actedIn. In both cases, these are simply
definition of serializations. The most
properties. The “from node” of an edge is recommended practices, not a “must have.”
commonly used serialization format is
called the subject. The “to node” is called
6The graph in Figure 2 appears larger than Literal values in an RDF Graph can have The URIs identifying nodes are displayed
the property graph in Figure 1 because all datatypes. The datatypes are taken from the in the diagram using qualified names,
literal values are also depicted as nodes XML Schema (e.g., xsd:string, xsd:integer, commonly called Qname notation. To
in the graph. All nodes are depicted as etc.) Text values can also have language form a Qname, the namespace part of
rounded rectangles with the light yellow tags to support internationalization of data. the URI is abbreviated using a prefix.
background. For example, instead of a single value for For example, “rdf:” and “rdfs:” represent
rdfs:label for New York City we could have the built-in standard namespaces
When visualizing RDF Graph data, it is
multiple values such as: w3.org/1999/02/22-rdf-syntax-ns#
common not to show literal values as nodes
• “New York City” xsd:string @en and w3.org/2000/01/rdf-schema#,
in order to make a cleaner and simpler
respectively.
looking diagram. That said, from the data • “Nueva York” xsd:string @sp
structure perspective, they are part of the These namespaces define the semantics
graph just like any other node. The only Identifier is a very important concept (the model behind) the RDF Data model.
difference is that they can’t serve as a for RDF graphs. Every non-literal node is The built-in resources such as rdf:type
source node i.e., a subject of a statement. assigned an identifier — typically, a URI/IRI. carry semantics that are defined in the
They can only be targets or objects. Local, non-URI identifiers are possible, but standard. The built-in resources can be
Throughout this paper, we will continue to rarely used because they are not interoper- used as either nodes or edges in a graph.
show them in the diagrams as nodes. able. Globally unique identifiers bring many For an example of such semantics in edges,
benefits to graph data models. An RDF- see the predicates (aka properties) rdf:type
Although this makes the diagrams larger based solution can auto-generate URIs and rdfs:label in the RDF graph diagram in
and busier, we believe it helps to illustrate based on selected URI construction rules. Figure 2. For an example of such semantics
the differences between the two data Alternatively, when adding data (e.g., load- in nodes, see the node rdfs:Class that is
models and the implications of these ing a serialized file), users can provide URIs the object of the rdf:type predicate in the
differences on knowledge capture, that they want to use. diagram shown in Figure 3.
graph design and graph evolution.
7AN RDG GRAPH WITH THE SAME DATA ABOUT ACTORS, DIRECTORS, FILMS OR TV PROGRAMS
rdfs:label rdf:type rdf:type rdfs:label
wikidata: schema: wikidata:
New York City White Plains
6 City 462177
e :filmedin e :filmedin e :population
rdf:type rdfs:label
schema: schema: The Post 59047
e :125
TVSeries Movie
e :actedin Sara Paulson
rdf:type
e :released e :directed rdf:type e :actedin
1993 wikidata:
e :124 e :Actor schema: schema:
2263
givenName familyName
rdf:type
schema: schema: schema: rdf:type
rdfs:label birthDate givenName familyName
wikidata:
e :Director 257442
A League of
Their Own 1956 Tom Hanks
Figure 2: An RDF graph representing the information in the Property graph in Figure 1
8A key differentiator that we will be introduc- support a common set of schemas for own URIs. These URIs have ‘ex:’ prefix —
ing is how the underlying model (schema) is structured data markup. The prefix to illustrate that they are provided as
represented in the same way as the data. ‘schema:’ stands for schema.org. Similarly, an example.
Just to serve as a primer, “rdf:type” is a ‘wikidata:’ is a namespace used to provide
For human users browsing data, a reference
predicate used to connect a resource with a DBPedia data in a structured, knowledge
to a resource URI will typically return infor-
class it belongs to; “rdfs:label” is used to graph format. It provides a number of predi-
mation about a resource presented as a web
provide a display name for a resource. cates and classes with commonly agreed
page. For APIs making a call, information
and understood semantics. In the example,
The uniformity of the data model makes can be returned in JSON, any standard
we are using schema:givenName, schema:-
RDF Graphs more easily evolvable and gives serialization of RDF or any other machine
familyName and schema:City. In this way,
them more flexibility compared toINProperty
FILMED processable format.
graphs developed by different organizations
Graphs. We will see examples of this later
can link and share common semantics. The part of the Qname after the prefix is
in the white paper.
called a local name. A local name could
When organizations create their own
be formed by using a display label if it can
Enrichment through Composition knowledge graphs, they may use URIs of
uniquely identify a resource within a name-
With the inherent composability of RDF community defined resources as well as
space and is considered immutable. It could
Graphs, when two nodes have the same create resources for which they “mint”
also be formed using a counter; much like in
URI, they are automatically merged. This their own URIs. In the latter case, they
relational databases a record gets the next
means that you can load different files and would normally use a web domain they own
sequential number as its ID. It could also be
their content will be joined together forming as a namespace because a reference to a
formed using a machine-generated random
a larger and more interesting graph. resource in an RDF Graph is expected to
ID or be based on the value of one or more
resolve and return information about it.
Examples of composability, can be found predicates that can establish a locally
In our example, in addition to using URIs
in the use of schema.org, and wikidata. unique identity.
from RDF, RDFS, Wikidata and Schema.org,
Schema.org is a namespace jointly setup
we are also demonstrating the use of our
by Google, Bing and Yahoo to create and
9TopBraid EDG: An Enterprise
Knowledge Graph Infrastructure
for Data Governance
RULES: If both of a person’s parents have blue eyes, they will also have blue eyes
• TopBraid EDG, is a rich set of interconnected Knowl-
edge Graphs expressing knowledge about how data
is used and managed in the enterprise ecosystem.
• These integrated Knowledge Graphs are ready to be
enriched with your enterprise specific knowledge.
• When this enrichment takes place, your enterprise
is ready for implementing comprehensive Data
Governance. MODELS : A person has eye color. A person has two parents.
A person’s father is also a person and he is male.
A knowledge graph contains facts
about entities in the world together
with the meaning of those facts
expressed as models and rules.
FACTS: James has blue eyes. James’ father is Andrew. James is a person.
10Differences in Terminology connects a resource to its display name. In Note that some Property Graph databases
Property Graphs it is typical to create a
and Capability property called “name” and use it to hold a
(e.g., SAP Hana) do not use the term
“label” at all and, instead use the term
Certain key terms used when describing display name for a node. You could also use “type” or “node type.” The underlying
graphs actually mean very different things a differently named property. implementation, however, is the same —
depending on the graph data model one type is a tag for a node or a tag for a
In Property Graphs, the term “label” is used
talks about. This is important to understand property. It is not a node itself.
to identify the type of a node. It is called a
to avoid confusion. It is also important to
label rather than a type because it is simply a Let’s take a look in Figure 3, at a fragment
understand in order to appreciate differen-
string — a textual tag. It has no meaning of the same RDF graph we showed in
ces in the capabilities that these two graph
beyond the text. No information about it can Figure 2, now expanded with more informa-
data models provide.
be captured in a graph. Edges in a Property tion about types or classes and other
We will now describe the differences in the Graph also have a tag that identifies the type schema elements.
meaning and use of some key concepts — of an edge. It is called a “type” or, sometimes,
• LABELS • TYPES • PROPERTIES. “relationship type”. It is used in queries The green border around nodes or edges
when matching relationships, and it is also indicates graph elements that describe the
What are Labels and Types used as a display name for edges when data model. In RDF, as in Property Graphs,
graphs are shown visually. nodes can belong to more than one set
In RDF Graphs, a label is a standard predi-
(class). We see this with Actor and Director.
cate defined in the RDFS namespace — Contrastingly, in RDF Graphs, the type of Tom Hanks is both. However, if one of the
rdfs:label. It is used to point to the value of a a node or property’s type is a resource i.e., classes is a subclass of another, there is no
display name for any resource. For example, another node in the graph — typically, with need in RDF to specify a “parent type.”
the label for resource wikidata:Q6 in the additional information associated with it to Instead, this information is provided at the
graph shown in Figure 2 is “New York City.” define its intended use and semantics. A class level for all resources that belong to a
You could also use another predicate for this node is connected to its type using the class — because class information is also a
purpose, but rdfs:label is widely accepted rdf:type predicate. part of the RDF graph.
as a unique identifier of a property that
11MODELING INFORMATION, REPRESENTED THE SAME WAY AS FACTS, CAN EXPAND AN RDF GRAPH
e :population rdfs:label
59047 White Plains
schema:Movie wikidata: schema:City
462177
rdf:type rdf:type rdfs:subClassOf
rdfs:subClassOf
e :filmedin schema:AdministrativeArea
schema:CreativeWork
e :125 “The Post”
rdfs:Class rdfs:subClassOf
rdfs:label
rdf: type rdfs:subClassOf
schema:Place
schema:TVSeries e :actedin ACTED IN
1993 rdfs:label
rdf:type
e :released
rdfs:subClassOf
e :directed
wikidata:
ee ::124
124 2263 e :Actor schema:Person
rdf:type
schema:
rdfs:label givenName
rdfs:subClassOf
rdf:type
schema: schema:
familyName birthDate
A League of
Their Own Tom Hanks 1956 e :Director
Figure 3: Part of the RDF graph diagram of Figure 2 expanded with modeling information
12For example, unlike the Property Graph in can add a label to the predicate ex:actedIn. namespace that is used for SHACL — a
Figure 1, we do not say in Figure 2 that Tom Similarly, we could also say that when the language for defining rules and constraints
Hanks is a person in addition to being an relationship ex:actedIn is used to navigate in for RDF Graphs, turning them into fully
actor and a director or that Sara Paulson is a the opposite direction (from a movie to an fledged Knowledge Graphs. SHACL offers a
person in addition to being an actor. We actor), the display name of the relationship very strong approach to ensuring the integri-
simply say that there is a rdfs:subClassOf should be shown as ‘actors’. In an RDF Graph, ty of RDF data and more.
relationship between the class of Actors and a resource that is used as a predicate in one
the class of People. And the same for the statement can be used as a subject or object For instance, we can:
class of Directors. The semantics of rdf:type in another statement. This is an example of • Consult a graph to find out what
and rdfs:subClassOf are defined in the the additional flexibility that, among other properties are appropriate for, let’s say,
standard — the graph depicted in Figure 3 things, lets us store information about predi- a movie and what are the valid values
says that every resource of type Actor is cates and their usage. The edges in Property for these properties.
also of type Person. Graphs offer nothing comparable. • Define constraints also known as rich data
We can extend the RDF graph further to quality/validity rules. For example, as
We also do not say that the type of New York
explicitly define how a predicate should be shown in Figure 4, we have defined a min
City or White Plains is a place (location) in
used. For example, we could say that any range of allowed date values for the ‘re-
addition to a city. We do not need to repeat
resource of type schema:CreativeWork can leased’ property of a creative work (e.g.,
this fact for each city. We already said it in
have a property ex:released and the value of a movie or a TV Series). Now, if a movie
the model — each city is also a place and
that property must be a date. This would released prior to 1900 is added to a graph,
what is defined for a place will apply to a city.
apply to a Movie or a TVSeries since they the graph can identify it as a problem.
In an RDF Graph, we can capture any infor- both are subclasses of schema:Creative- While this example is simple, we can add
mation about the model of the data that Work. The diagram in Figure 4 shows what to the graph much more sophisticated
is stored in a graph. This information will be this looks like in a graph. rules. For instance, we could specify copy-
stored, accessed and processed the same right regulations that must be in place for
In Figure 4, the sh: prefix (e.g. in sh:property)
way as any other data. For example, the resources released or published after a
stands for w3.org/ns/shacl#, the standard
graph diagram in Figure 3 shows that we certain date.
13• Define rich inference rules. Inference All property values (literals and URIs alike) • In “data modeling speak,” in an RDF Graph
rules generate new facts from the are stored as nodes. For example, as shown properties can be either attributes
facts in the graph. in Figure 2: or relationships.
These key capabilities turn RDF Graphs
• The rdfs:label for the resource ex:125 is In Property Graphs, properties can only have
into Knowledge Graphs. “The Post.” In this example, rdfs:label is a literal values. These are stored and treated
property and “The Post” is a value. differently from the nodes in a graph. In data
What are Properties
modeling speak, properties in a Property
In RDF Graphs, an edge is called a property • The edge ex:filmedIn is also a property. Its
values for ex:125 are wikidata:Q6 and Graph are always attributes. This is why
(predicate) and an object that a property property graphs are formally described as
points to may be called a property value. wikidata:Q462177.
directed, edge labeled, attributed graphs.
sh:property
schema:CreativeWork e :CreativeWork released
rdfs:subClassOf rdfs:subClassOf sh:minValue sh:path sh:datatype
schema:TVSeries schema:Movie 1900 e :released sd:date
Figure 4: Extending an RDF graph with more modeling information about the ex:released property
14A property structure is that of key-value
“Name” White Plains
pairs. This means that a property key can label: Location
only have a single value. If it has more than
“Population” “ID” 127 label: City
58811 56853
one value, then the single value is turned into
an array of comma separated values. For an
example, see Figure 5. Figure 5: In Property graphs, the property structure is that of key-value pairs —
multiple values must be turned into an array of comma separated values
Turning multi-valued properties into arrays
makes it harder to efficiently answer queries
such as “all cities with population over
For example, Wikidata captures many representing Tom Hanks to the node
58,000.” The first value in the array is the representing the movie The Post and an
details about the source of the information
population of White Plains in 2018. The edge connecting Sarah Paulson to this
about Tom Hanks’ birth date in order to
second value is the population of White movie. The two edges have the same type,
give users confidence in the reliability of
Plains in 2010. There is no way in a Property but different identity.
the data. As shown in Figure 6, it got the
Graph to capture what each of these values
information from 9 sources which all • In RDF, it is the same edge. This means
represents beyond the fact that the key part
agree on the date. The sources include the that if you need to say something about
of the key-value pair is Population. This
Encyclopedia Britannica, Internet Broadway a relationship between Tom Hanks and The
brings us to the next important difference
Database and others. Post (e.g., the role he played in the movie),
— how to capture additional information
you can’t simply add a statement to the
about a property value. In saying this, we Differences in Attaching ex:actedIn property. If you do this, it will
mean any property — whether it is an attri- Information about an Edge apply everywhere this property is used.
bute or a relationship. As we see with the
population example, it may be important to In RDF Graphs, unlike in Property Graphs, In other words, in the Property Graph data
qualify a measurement by the date it was edges are typically re-used: model, edges uniquely identify the source-
measured on. There are also other important • In the Property Graph shown in Figure 1, node — edge — target-node combination. In
information qualifiers — including source there are two ACTED_IN edges with the RDF data model, they tend not to. Of
and confidence. different IDs: an edge connecting the node course, one could create a unique edge and
15simply give it the type ex:actedIn. However,
this is normally not done because RDF
databases are optimized for working with
edges that represent types instead of occur-
rences of types.
To support the need to attach information
on an edge between two specific nodes, RDF
provides a way to create a new node that
uniquely identifies the source-edge-target
triple (or the subject-predicate-object in
RDF speak) combination. With that in place,
we can make statements about the new
node using the regular approach — it can be
a subject or an object of any statement. This
is shown in Figure 7 where we created a new
node ex:126 to represent the statement
(triple) of Tom Hanks’ acting in The Post.
The new node is connected to the statement
about Tom’s acting in The Post using rdf:sub-
ject, rdf:predicate, rdf:object and rdf:State-
ment, built-in elements of the RDF data
model that support this use case.
Figure 6: A screenshot from WIkidata showing the sources of information about Tom Hanks’ birth date
16Compared to Property Graphs, this RDF GRAPH WITH AN EXAMPLE OF MAKING A STATEMENT ABOUT ANOTHER STATEMENT
approach is more powerful and
flexible because it supports:
“The Post”
• Adding other edges (relationships) to edges. sd: string
rdf:Statement
For example, instead of having a role as a
string, we may want to have a connection to a
rdfs:label
node representing Ben Bradlee, a person. This rdf:type
rdf:ob ect
is fundamentally not possible with Property schema:Movie ee: :125
125 e : 126
Graphs without changing (restructuring) the e :126
rdf:type
original graph. rdf:predicate
e :role
• Adding more information to any property, not
just a relationship. For example, we can use it e :actedin
Ben Bradlee
to specify the effective date of each population
rdf:sub ect
measurement for White Plains. This is also not
possible with Property Graphs. wikidata: e :Actor
2263 rdf:type
For Property Graphs, the solution to the need to
add edges to other edges is to create intermedi-
ate nodes — as shown in Figure 8. rdf: type
rdf:type
This requires restructuring of a graph and schema: schema: schema:
givenName familyName birthDate
changing all queries and logic because the path e :Director
between actors and movies is now different Tom Hanks 1956
(compare with the original graph in Figure 1).
With RDF, you do not need to make changes to
Figure 7: RDF graph showing making a statement about another statement —
the graph structure to make a link to the
to attach information on an edge between two specific nodes.
17resource representing Ben Bradlee. You IN PROPERTY GRAPHS ADDING EDGES TO OTHER EDGES REQUIRES REFACTORING THE GRAPH
simply change the node at the end of the
ex:role relationship from a string to a URI. “Name” The Post
This is demonstrated in Figure 9. The ap- “Released” 2017
label: Movie Role
proach is evolutionary and does not require
any refactoring other than the change of the “ID” 16
“ID” 125
type: ROLE_IN
value itself.
type: ROLE_IN
There may, however, be some other situa-
tions where you would want to introduce label: Movie
new intermediate nodes. If you do so, type: PORTRAYIN
SHACL rules can be used to deliver the “ID” 129 “ID” 130
original relationship path inferring its value “ID” 15
from the new, more complex path. In this
way, your existing queries and programs “First Name” Tom
type: PLAYED_BY
“Last Name” Hanks
can remain the same. “Year Born” 1956
“ID” 17
The Property Graph solution to adding more label: Person “Name” Ben Bradlee
information to a property (e.g., population)
is to change the structure of the graph to
label: Person
turn a property into an edge and a value to a “ID” 123 label: Director
node. This requires restructuring of a graph label: Actor NODE
and change to all queries and logic for its NODE LABEL
processing because the storage and access PROPERTY KEY
VALUE PAIRS
of properties is fundamentally different and EDGE
separate from the graph traversal. This
EDGE TYPE
makes Property Graphs less evolvable or
flexible than RDF Graphs. Figure 8: Refactored Property Graph with Ben Bradlee as a Person
18Flexibility is acknowledged as the key A current downside of the RDF Statement Graph Analytics, Named
differentiating advantage of graph approach to capturing information about
databases. For example, the leading edges is what is sometimes called “graph
Graphs and Other Topics
vendor of property graph databases bloat.” To capture a role that Tom Hanks had This white paper is not intended to
says, “With graph databases, IT and data in The Post, we need to add at least three completely cover all capabilities of Property
architect teams move at the speed of extra statements (rdf:subject, rdf:predicate Graphs or Knowledge Graphs. We have
business because the structure and and rdf:object) in addition to the role infor- focused only on critical differentiators.
schema of a graph model flexes as applica- mation — four if you also add a type link to With this, we need to at least mention
tions and industries change. Rather than rdf:Statement. Quite a lot of overhead for two important topics:
exhaustively modeling a domain ahead of just one fact. If, however, you need to cap-
time, data teams can add to the existing ture several facts about Tom’s acting in this • Algorithms for Graph Analytics
graph structure without endangering movie, then this approach has less overhead. • Named Graphs
current functionality.” We agree that this Graph analytics is a key application for
A new extension to the RDF data model
would be a very important and desired property graphs. By analytics, we mean
called RDF* (RDF Star) and its variation
advantage. However, as we describe in this node centrality, node similarity, shortest
called RDF Plus address this issue. It is
paper, changes in the model of the Property paths, clustering and other algorithms.
currently in the process of being added to
Graph data will require refactoring and Property Graphs are known for offering
the standard. In the meantime, TopBraid
changes to queries. In a Property Graph these algorithms and many applications
EDG can create a new node with the URI
edges and properties are different data of property graphs rely on such algorithms.
composed from the subject-predicate-object
structures and their handling in queries is Having said this, there isn’t anything special
nodes of the statement you need to add
fundamentally different. in a property graph data model that makes
information to. The new node uniquely
As you can see, compared to an RDF Graph, identifies the original statement and can be these algorithms possible. They can be
it is harder to organically grow a Property used as a subject of other statements, avoid- applied equally well over RDF Graphs. In
Graph in response to changes in your infor- ing graph bloat. For standard-compliant fact, many RDF-based solutions are also
mation requirements. information exchange, EDG serializes such offering similar algorithms.
nodes as RDF Statements.
19IN AN RDF GRAPH, YOU DO NOT NEED TO CHANGE THE GRAPH STRUCTURE TO MAKE A LINK TO A RESOURCE
“The Post” rdf:Statement
rdfs:label
rdf:type
rdf:ob ect
schema:Movie ee: :125
125 e : 126
e :126
rdf:type
rdf:predicate
e :role
e :actedin
wikidata:
rdf:sub ect
2263
wikidata: e :Actor
2263 rdf:type
rdf: type
rdf:type
schema: schema: schema:
givenName familyName birthDate
e :Director
Tom Hanks 1956
Figure 9: RDF Graph with Ben Bradlee as a Person
20The ability to partition data is important. ulations with it. This again follows the idea However, we increasingly hear of
Relational databases partition data using of “separate, but connectable.” customers hitting the wall with Property
tables and views. Both Property Graphs and Graphs because as they start to use them,
For example, in TopBraid EDG, a given busi-
RDF Graphs let users work with sets of nodes they recognize the need for one or more
ness glossary or a taxonomy is a named
of a specific type (in the case of Property of the following capabilities:
graph. Resources in it can be connected to
Graphs, nodes carrying a specific label), resources in other graphs, but it can also be • Capture of Schema in a Graph
e.g., a query can be limited to only work with
actors or to only work with directors. This
manipulated as a distinct set of statements. • Support for Validation and Data Integrity
For example, there could be a purpose asso-
provides a very basic, limited partitioning. • Capture of Rich Rules
ciated with a glossary as a whole e.g., its
RDF data can also be partitioned in named users and uses can be identified and so on. • Support for Inheritance and Inference
graphs. A named graph offers us a way to There is no similar concept in the Property • Globally Unique Identifiers
say that some group of triple statements Graph world. • Resolvable Identifiers
belong to a “sub-graph.” We can then give it • Connectivity Across Graphs
a uniquely identifying name (hence, the term Limitations of • Better Solution to Graph Evolvability
“named graph”) and associate any other Property Graphs Note that these are fundamental limitations
information with it that we see as important.
The idea is somewhat similar to views in In this white paper, we describe some that are not addressed in the design of
relational databases. A single statement can limitations of Property Graphs and property graphs. In principle, it may be
belong to many named graphs. Thus, it is a their differences with Knowledge Graphs possible to add at least some of these
different concept from physically partitioning that are based on RDF. capabilities to a Property Graph — but not
distinct graphs across different machines. that easily or elegantly. Some of you may have
The main vendor for property graph
already started on the road to doing this.
We can query a named graph individually, or technology, Neo4J, offers a mature system
we can query all available graphs, or a subset with some attractive, easy to get started However, it is a lot of effort, both conceptual
of available graphs. We can load a named with capabilities. There are also a few (i.e., design and architecture) and imple-
graph, clear it and perform any other manip- other Property Graph databases on the mentation work. Even if you succeed in
market today.
21accomplishing it, you will end up with a With Property Graphs, data modeling property graph database. We already
proprietary home-grown version of capabili- happens on paper or on a white board, demonstrated how a decision to use inter-
ties that already exist, are standardized and separate from the graph itself. Property mediate nodes in a property graph may be
well proven. Graphs are not self-describing and the based on the need to add information to a
meaning of the data they store is not a property, which is only possible if a property
Inherent Semantics make part of a graph. is turned into an edge.
it easy for RDF Graphs to Further, in property graphs some property
become Knowledge Graphs Some Guidance for Moving values such as dates or names are often
As illustrated in the previous sections, RDF-
from a Property Graph to a turned into entities because there is no
based graphs capture more than just data. Knowledge Graph efficient way of querying literal values,
especially if they are multi valued. As a
They capture the meaning or semantics of It is fairly easy to generate one of the result, you may have an entity for a number
data, including rich constraints and highly RDF standard serializations from a property 58,811 or a year 1956. This, however, could
expressive rules. All information is stored in graph. In fact, Neo4J offers a library for result in having so-called “dense nodes” or
a graph and is available for query and any doing this. You can readily get the data out, nodes that participate in many relation-
other algorithms that can help us reason and but you will not be able to get the semantics ships. Typically, nodes that are targets of
discover new knowledge based on the avail- of the data; this is due to the fact that the thousands of relationships are considered to
able knowledge. And the amount of the data model only exists in your initial design be dense in Neo4J with the potential of
available knowledge with Knowledge sketches and, partially, within Cypher performance issues when such nodes are
Graphs is practically unlimited — just as queries and programs. deleted. The design of the model may,
it is on the world wide web. We can reach
Further, as we have discussed, the structure therefore, be impacted by the density con-
out and take advantage of the information
of the graph data may be influenced by siderations. Similarly, you may have rela-
available in other graphs. Separate, but
the specific limitations of the property tionships that represent specific dates e.g.,
connectable is a key feature of the web —
graph data model and optimizations that BORN_IN_1956, BORN_IN_1957, etc. This is
and of Knowledge Graphs.
were required due to the architecture of a a design pattern used in property graphs
22because with a generic BORN_IN relation- access to data. If you have used GraphQL to
ship, Cypher queries looking for people born build your solution on top of a Property
in, let’s say 1956, do not perform well. Once
you move to RDF, you may decide to revisit
Graph, you will be able to keep much of your
code as you move to an RDF platform like Summary
some of these design decisions. TopBraid EDG that also supports GraphQL.
The simplest way forward is to export prop- For property graphs, GraphQL Schemas
erty graph data as-is and then create a data need to be manually created and then manu- Neo4J is a mature solution that popularized
model in RDF that represents the structure ally maintained as the graph structures get Property Graphs and made them easy to get
of the data. For example, if you created extended and changed. One of the advan- started with. People tend to think that RDF
intermediate nodes in order to link roles to tages of a self-describing graph is that based Knowledge Graphs are hard to under-
people portrayed by roles, you would mirror GraphQL Schemas can be automatically stand, complex and hard to get started with.
this in your RDF model (often called an generated from the data model. This delivers In the past, there was some truth to that
ontology) even if strictly speaking this is not on the promise of frictionless development characterization. Today, with products like
necessary in the RDF-based implementation. and graceful systems maintenance by ren- TopBraid EDG, it is no longer the case.
dering unnecessary any manual effort for Many users are discovering the limitations of
TopBraid EDG can use data to reverse engi-
defining and maintaining schemas. For more property graphs. Even if you started your
neer an ontology. This will speed up your
information on how TopBraid EDG works first graph project using a property graph, it
migration efforts and will make the data
with GraphQL, visit topquadrant.com/tech- is likely that sooner or later you will be
model explicit. You can then decide if you
nology/graphql/. hindered by limitations and will want to
want to adjust the model and change the
data or move forward with it as-is, evolving For the types of queries that can’t be easily adopt or at least explore the feasibility of an
it later if necessary. supported by GraphQL, you will typically use RDF / Semantic Knowledge Graph based
SPARQL. TopBraid EDG lets you use either of system. You will not be alone, as a number of
Many applications today use GraphQL to organizations are graduating from property
the query languages and it also lets you put
read and write data. Neo4J and some other graphs to knowledge graphs. We hope that
SPARQL expressions into GraphQL.
Property Graph offerings support GraphQL this paper has provided some insight and
value in your decision making.
23GOVERNANCE PACKAGES AVAILABLE IN TOPBRAID EDG
About
TopQuadrant
TopQuadrant helps organizations succeed Vocabulary Metadata Reference Data Business
Management Management Management Glossaries
in Data Governance. Its flagship product,
TopBraid EDG, delivers easy and meaningful
access for all data stakeholders to enterprise In addition to the above, TopBraid Tagger and In ramping up a Data Governance program, different
metadata, business terms, reference data, AutoClassifier is a popular additional module organizations may have different starting points. With
data and application catalogs, data lineage, that is part of a comprehensive information TopBraid EDG, you can start incrementally and add
requirements, policies, and processes. management and governance environment capabilities as you go. For details on available EDG
where packages for other types of assets can packages and additional modules visit topquadrant.com/
TopQuadrant’s customer list includes be easily added if needed. products/topbraid-enterprise-data-governance/
over 120 organizations in financial services,
pharma, healthcare, digital media, govern-
ment and other sectors.
©2020 TopQuadrant, Inc. All rights reserved. TopBraid Enterprise Data Governance–Vocabulary Management, and the TopQuadrant logo
are trademarks of TopQuadrant Inc. in the U.S. All other trademarks are the property of their respective owners. Specifications subject
to change without notice.
For more details or to schedule a demo, contact us at: edg-info@topquadrant.com
24You can also read