Knowledge Graphs versus Property Graphs: Similarities, Differences and Some Guidance on Capabilities - TopQuadrant

Page created by Harold Caldwell
 
CONTINUE READING
Knowledge Graphs versus Property Graphs: Similarities, Differences and Some Guidance on Capabilities - TopQuadrant
Knowledge Graphs
versus
Property Graphs:
Similarities, Differences and
Some Guidance on Capabilities
Knowledge Graphs versus Property Graphs: Similarities, Differences and Some Guidance on Capabilities - TopQuadrant
We are in the era of graphs. Graphs              At the recent Data Governance Vision
are hot. Why? Flexibility is one                 conference, we gave a talk on the topic
strong driver: heterogeneous data,               of supporting Data Governance using         Graph Data Models:
                                                 Knowledge Graphs. One of the questions      Property Graphs and RDF Graphs
integrating new data sources, and
                                                 asked at the end of the talk was whether
analytics all require flexibility.               we were using Microsoft’s SQL Graph,
                                                                                             When we say that over 90% of imple-
                                                                                             mentations use either Property Graphs or
Graphs deliver it in spades.                     and if not, then why not. After answering   RDG Graphs, we mean implementations
                                                 the question there on the fly, we decided   that use some kind of an industry recog-
Over the last few years, a number of new
                                                 that it was time to write a short paper     nized graph data model. Due to the
graph databases came to market. As we
                                                 explaining the differences between          current expansive popularity of graphs,
start the next decade, dare we say “the
                                                 distinct implementations of graphs.         many vendors are starting to represent
semantic twenties,” we also see vendors
                                                                                             their technology as graph based, when
that never before mentioned graphs starting
                                                                                             in reality they use a home-grown object
to position their products and solutions as
                                                                                             repository that can resemble certain
graphs or graph-based.                           Today, there are two
                                                                                             aspects of graphs.
                                                 main graph data models:
Graph databases are one thing, but                                                           This white paper is not intended to cover
“Knowledge Graphs” are an even hotter            • Property Graphs                           such implementations since they do
topic. TopBraid EDG is a solution for creating     (also known as Labeled Property Graphs)
                                                                                             not use a recognized data model and,
Knowledge Graphs and putting them to             • RDF Graphs                                thus, there is no basis for comparison.
work. (See page 10 for more information on         (Resource Description Framework)          If you are considering a technology that
TopBraid EDG.) As a result, we are often asked                                               claims to be graph based, our recommen-
                                                 Other graph data models are possible
to explain Knowledge Graphs.                                                                 dation is to always ask what graph data
                                                 as well, but over 90% of the implementa-
• What are they?                                 tions use one of these two models. We
                                                                                             model it uses.
• Why and where are they useful?                 will start by describing each of them.
• How are they different from “just graphs?”

 2
concept, but it currently offers more limited   If you have worked with object databases,
                                               capabilities than either Neo4J or some of       you will find it easy to understand the
                                               the other products that are using the prop-     Property Graph data model. It is really
  Property Graphs                              erty graph data model.                          more of an object data model than a
                                                                                               graph data model.
                                               Generally, the property graph data              • Nodes are entities
                                               model consists of three elements:
While there are core commonalities                                                             • Edges are relationships
in property graph implementations,                                                             • Properties are attributes
there is no true standard property             • Nodes are the entities in the graph.
                                                                                               Both, entities and relationships can
                                                 Nodes can be tagged with zero to many
graph data model.                                                                              have attributes.
                                                 text labels representing their type. Nodes
Each implementation of a Property Graph          are also called vertices.
is, therefore, somewhat different. In the                                                      Property values can have data types.
following, we will focus our discussion on     • Edges are the directed links between          Supported data types depend on the
                                                 nodes. Edges are also called relationships.   vendor. For example, Neo4j data types
the characteristics that are common
                                                 The “from node” of a relationship is called   are similar, but not identical, to Java
for any property graph database.
                                                 the source node. The “to node” is called      language data types.
The most well-known implementation, which        the target node. Each edge has a type.
popularized property graphs as a concept, is                                                   Figure 1 shows a fragment of a property
                                                 While edges are directed, they can be
the Neo4J graph database. At minimum,                                                          graph with data about actors, directors and
                                                 navigated and queried in either direction.
everything stated here is true for Neo4J.                                                      films or TV programs they worked on.
                                               • Properties are the key-value pairs            Nodes are represented as ovals. For exam-
Other examples of property graph imple-          associated with a node or with an edge.       ple, the node with ID 123, as we can see
mentations are TigerGraph and Titan. MS                                                        from its properties, represents Tom Hanks.
SQL Graph is based on the same underlying                                                      Node labels are shown in dark blue. Node
                                                                                               123’s labels are Person, Actor and Director.

 3
A PROPERTY GRAPH FRAGMENT WITH DATA ABOUT ACTORS, DIRECTORS, AND FILMS OR TV PROGRAMS

                                                                      label: Location       “Name” White Plains
      “Name” New York City                  “ID” 126                  label: City           “Population” 58811             “ID” 127

      “Name” A League of Their Own                             type: FILMED_IN            type: FILMED_IN
      “Released” 1993

                                                                  “ID” 12                 “ID” 13                         label: Location
                                                                                                                          label: City
                         label: TV Series
                                                                                                                                            NODE

                                                                                                                                            NODE LABEL
                                    “Name” The Post
         “ID” 124                   “Released” 2017
                                                                 “ID” 125                      label: Movie
                                                                                                                                            PROPERTY KEY
                                                                                                                                            VALUE PAIRS
                                                                                                                  “Role” Tony Bradlee
                                                                                                                                            EDGE

                                                                                                type: ACTED_IN                              EDGE TYPE

                             “ID” 11          type: ACTED_IN
                                                                                                       “ID” 14        label: Person
         type: DIRECTED                                         “ID” 10
                                                                                                                      label: Actor

                                                                          “Role” Ben Bradlee

     “First Name” Tom                                             label: Person
                                                                                        “First Name” Sara
     “Last Name” Hanks                 “ID” 123                   label: Director
                                                                                        “Last Name” Paulson
                                                                                                                       “ID” 128
     “Year Born” 1956                                             label: Actor

Figure 1: Simple Property graph excerpt with information about people and works of art

 4
Relationships are depicted as grey arrows.      • Some vendors, in addition to their own     The fastest way to load bulk data is by
Each relationship has a single type that          query language, also implement some        importing a text file. For property graph
is shown in red. Properties are shown in          subset of Cypher. For example, SAP Hana    data, there is no standard serialization
the rounded rectangles with the gold              offers its own extensions to SQL and its   (a way to represent graph data as a text
background. Properties are connected to           own GraphScript language plus they         file). It is typical for a property graph
nodes and relationships that they belong          support a subset of Cypher                 vendor to define a CSV format that users
to using red arrows.                            There is also Apache TinkerPop — an open     should follow in order to prepare files for
A key part of any data model is having a        source graph computing framework that is     bulk load.
query language available for working with it.   integrated with some property graph
After all, users need to have a way to access   and RDF graph databases. It offers the
and manipulate the data in the graph. No        Gremlin language which is more of an
industry standard query language exists for     API language than a query language.
property graphs. Instead, each database         A key requirement for working with any
offers their own, unique query language that    data model is the ability to reference
is incompatible with others:                    nodes, properties and relationships
• Neo4J offers Cypher also known as CQL         (edges). In the case of property graphs,
  — its own query language that, to some        internally, nodes and edges have IDs.
  extent, took SQL as an inspiration;           IDs are assigned by a database and are
                                                internal to a database. Referencing is
• TigerGraph offers GSQL — its own              done by using text strings — node labels,
  query language that also took SQL as
  an inspiration;                               relationship types, and property names.

• MS SQL Graph has their own extension
  to SQL to support graph query;

 5
the object. Two nodes connected by an        called Turtle. There is also a JSON serial-
                                                 edge form a subject-predicate-object         ization called JSON-LD as well as an XML
                                                 statement, also known as a Triple or a       serialization. All RDF databases are able
  RDF Graphs                                     Triple Statement. While edges are direct-    to export and import graph content in
                                                 ed, they can be navigated and queried in     standard serializations making it easy
                                                 either direction.                            and seamless to interchange data.
                                               Everything in an RDF graph is called a
RDF graphs use a standard graph                                                               Built-in Semantics
                                               resource. “Edge” and “Node” are just the
data model. The standard for the RDF
                                               roles played by a resource in a given state-   The RDF Data Model provides a richer,
technology stack is managed by the World
                                               ment. Fundamentally in RDF, there is no        semantically consistent foundation over
Wide Web consortium (W3C), the same
                                               difference between resources playing an        property graphs. Let’s see how a graph
standards body that manages HTML, XML
                                               edge role and resources playing a node role.   we showed earlier (Figure 1) is represented
and many other web standards. Every data-
                                               An edge in one statement can be a node in      as an RDF Graph (Figure 2).
base that supports RDF is expected to
                                               another. We will give examples of this in
support the model in the same way.                                                            Note that the diagrams depict relationships
                                               the diagrams that follow that will make
                                                                                              using the recommended conventions of
The RDF graph data model basically             this core idea clearer.
                                                                                              the property graph and RDF graph commu-
consists of two elements:                      There is a standard query language for         nities. Relationships in Property Graphs
• Nodes, the vertices in a graph. Nodes        RDF Graphs called SPARQL. It is both, a        are typically capitalized with multiple words
 can be resources with unique identifiers      full featured query language and an HTTP       joined together by an underscore as in
 or they can be “literals” with values that    protocol making it possible to send query      ACTED_IN. Relationships (or any property)
 are strings, integers, etc.                   requests to endpoints over HTTP.               in RDF graphs are typically identified using
• Edges, the directed links between nodes.     A key part of the RDF standard is the
                                                                                              the lower camel case convention as in
 Edges are also called predicates and/or                                                      ex:actedIn. In both cases, these are simply
                                               definition of serializations. The most
 properties. The “from node” of an edge is                                                    recommended practices, not a “must have.”
                                               commonly used serialization format is
 called the subject. The “to node” is called

 6
The graph in Figure 2 appears larger than       Literal values in an RDF Graph can have          The URIs identifying nodes are displayed
the property graph in Figure 1 because all      datatypes. The datatypes are taken from the      in the diagram using qualified names,
literal values are also depicted as nodes       XML Schema (e.g., xsd:string, xsd:integer,       commonly called Qname notation. To
in the graph. All nodes are depicted as         etc.) Text values can also have language         form a Qname, the namespace part of
rounded rectangles with the light yellow        tags to support internationalization of data.    the URI is abbreviated using a prefix.
background.                                     For example, instead of a single value for       For example, “rdf:” and “rdfs:” represent
                                                rdfs:label for New York City we could have       the built-in standard namespaces
When visualizing RDF Graph data, it is
                                                multiple values such as:                         w3.org/1999/02/22-rdf-syntax-ns#
common not to show literal values as nodes
                                                • “New York City” xsd:string @en                 and w3.org/2000/01/rdf-schema#,
in order to make a cleaner and simpler
                                                                                                 respectively.
looking diagram. That said, from the data       • “Nueva York” xsd:string @sp
structure perspective, they are part of the                                                      These namespaces define the semantics
graph just like any other node. The only        Identifier is a very important concept           (the model behind) the RDF Data model.
difference is that they can’t serve as a        for RDF graphs. Every non-literal node is        The built-in resources such as rdf:type
source node i.e., a subject of a statement.     assigned an identifier — typically, a URI/IRI.   carry semantics that are defined in the
They can only be targets or objects.            Local, non-URI identifiers are possible, but     standard. The built-in resources can be
Throughout this paper, we will continue to      rarely used because they are not interoper-      used as either nodes or edges in a graph.
show them in the diagrams as nodes.             able. Globally unique identifiers bring many     For an example of such semantics in edges,
                                                benefits to graph data models. An RDF-           see the predicates (aka properties) rdf:type
Although this makes the diagrams larger         based solution can auto-generate URIs            and rdfs:label in the RDF graph diagram in
and busier, we believe it helps to illustrate   based on selected URI construction rules.        Figure 2. For an example of such semantics
the differences between the two data            Alternatively, when adding data (e.g., load-     in nodes, see the node rdfs:Class that is
models and the implications of these            ing a serialized file), users can provide URIs   the object of the rdf:type predicate in the
differences on knowledge capture,               that they want to use.                           diagram shown in Figure 3.
graph design and graph evolution.

 7
AN RDG GRAPH WITH THE SAME DATA ABOUT ACTORS, DIRECTORS, FILMS OR TV PROGRAMS

                         rdfs:label                     rdf:type                            rdf:type                       rdfs:label
                                         wikidata:                       schema:                             wikidata:
        New York City                                                                                                                     White Plains
                                              6                            City                               462177

                                                e :filmedin                            e :filmedin                                      e :population

                                                                    rdf:type                            rdfs:label
                        schema:                  schema:                                                             The Post             59047
                                                                                     e :125
                        TVSeries                  Movie

                                                                e :actedin                                                                                      Sara   Paulson
                               rdf:type

       e :released                     e :directed                                    rdf:type                                      e :actedin
1993                                                    wikidata:
                        e :124                                                                                e :Actor                              schema:            schema:
                                                           2263
                                                                                                                                                    givenName          familyName

                                                                                                  rdf:type

                         schema:                schema:                        schema:                                      rdf:type
rdfs:label               birthDate              givenName                      familyName
                                                                                                                                                         wikidata:
                                                                                                             e :Director                                  257442
       A League of
       Their Own                      1956                    Tom       Hanks

Figure 2: An RDF graph representing the information in the Property graph in Figure 1

 8
A key differentiator that we will be introduc-   support a common set of schemas for            own URIs. These URIs have ‘ex:’ prefix —
ing is how the underlying model (schema) is      structured data markup. The prefix             to illustrate that they are provided as
represented in the same way as the data.         ‘schema:’ stands for schema.org. Similarly,    an example.
Just to serve as a primer, “rdf:type” is a       ‘wikidata:’ is a namespace used to provide
                                                                                                For human users browsing data, a reference
predicate used to connect a resource with a      DBPedia data in a structured, knowledge
                                                                                                to a resource URI will typically return infor-
class it belongs to; “rdfs:label” is used to     graph format. It provides a number of predi-
                                                                                                mation about a resource presented as a web
provide a display name for a resource.           cates and classes with commonly agreed
                                                                                                page. For APIs making a call, information
                                                 and understood semantics. In the example,
The uniformity of the data model makes                                                          can be returned in JSON, any standard
                                                 we are using schema:givenName, schema:-
RDF Graphs more easily evolvable and gives                                                      serialization of RDF or any other machine
                                                 familyName and schema:City. In this way,
them more flexibility compared  toINProperty
                           FILMED                                                               processable format.
                                                 graphs developed by different organizations
Graphs. We will see examples of this later
                                                 can link and share common semantics.           The part of the Qname after the prefix is
in the white paper.
                                                                                                called a local name. A local name could
                                                 When organizations create their own
                                                                                                be formed by using a display label if it can
Enrichment through Composition                   knowledge graphs, they may use URIs of
                                                                                                uniquely identify a resource within a name-
With the inherent composability of RDF           community defined resources as well as
                                                                                                space and is considered immutable. It could
Graphs, when two nodes have the same             create resources for which they “mint”
                                                                                                also be formed using a counter; much like in
URI, they are automatically merged. This         their own URIs. In the latter case, they
                                                                                                relational databases a record gets the next
means that you can load different files and      would normally use a web domain they own
                                                                                                sequential number as its ID. It could also be
their content will be joined together forming    as a namespace because a reference to a
                                                                                                formed using a machine-generated random
a larger and more interesting graph.             resource in an RDF Graph is expected to
                                                                                                ID or be based on the value of one or more
                                                 resolve and return information about it.
Examples of composability, can be found                                                         predicates that can establish a locally
                                                 In our example, in addition to using URIs
in the use of schema.org, and wikidata.                                                         unique identity.
                                                 from RDF, RDFS, Wikidata and Schema.org,
Schema.org is a namespace jointly setup
                                                 we are also demonstrating the use of our
by Google, Bing and Yahoo to create and

 9
TopBraid EDG: An Enterprise
 Knowledge Graph Infrastructure
 for Data Governance
                                                          RULES: If both of a person’s parents have blue eyes, they will also have blue eyes

 • TopBraid EDG, is a rich set of interconnected Knowl-
      edge Graphs expressing knowledge about how data
      is used and managed in the enterprise ecosystem.
 • These integrated Knowledge Graphs are ready to be
      enriched with your enterprise specific knowledge.
 • When this enrichment takes place, your enterprise
      is ready for implementing comprehensive Data
      Governance.                                         MODELS : A person has eye color. A person has two parents.
                                                          A person’s father is also a person and he is male.

     A knowledge graph contains facts
     about entities in the world together
     with the meaning of those facts
     expressed as models and rules.
                                                          FACTS: James has blue eyes. James’ father is Andrew. James is a person.

10
Differences in Terminology                          connects a resource to its display name. In        Note that some Property Graph databases
                                                    Property Graphs it is typical to create a
and Capability                                      property called “name” and use it to hold a
                                                                                                       (e.g., SAP Hana) do not use the term
                                                                                                       “label” at all and, instead use the term
Certain key terms used when describing              display name for a node. You could also use        “type” or “node type.” The underlying
graphs actually mean very different things          a differently named property.                      implementation, however, is the same —
depending on the graph data model one                                                                  type is a tag for a node or a tag for a
                                                    In Property Graphs, the term “label” is used
talks about. This is important to understand                                                           property. It is not a node itself.
                                                    to identify the type of a node. It is called a
to avoid confusion. It is also important to
                                                    label rather than a type because it is simply a    Let’s take a look in Figure 3, at a fragment
understand in order to appreciate differen-
                                                    string — a textual tag. It has no meaning          of the same RDF graph we showed in
ces in the capabilities that these two graph
                                                    beyond the text. No information about it can       Figure 2, now expanded with more informa-
data models provide.
                                                    be captured in a graph. Edges in a Property        tion about types or classes and other
We will now describe the differences in the         Graph also have a tag that identifies the type     schema elements.
meaning and use of some key concepts —              of an edge. It is called a “type” or, sometimes,
• LABELS • TYPES • PROPERTIES.                      “relationship type”. It is used in queries         The green border around nodes or edges
                                                    when matching relationships, and it is also        indicates graph elements that describe the
What are Labels and Types                           used as a display name for edges when              data model. In RDF, as in Property Graphs,
                                                    graphs are shown visually.                         nodes can belong to more than one set
In RDF Graphs, a label is a standard predi-
                                                                                                       (class). We see this with Actor and Director.
cate defined in the RDFS namespace —                Contrastingly, in RDF Graphs, the type of          Tom Hanks is both. However, if one of the
rdfs:label. It is used to point to the value of a   a node or property’s type is a resource i.e.,      classes is a subclass of another, there is no
display name for any resource. For example,         another node in the graph — typically, with        need in RDF to specify a “parent type.”
the label for resource wikidata:Q6 in the           additional information associated with it to       Instead, this information is provided at the
graph shown in Figure 2 is “New York City.”         define its intended use and semantics. A           class level for all resources that belong to a
You could also use another predicate for this       node is connected to its type using the            class — because class information is also a
purpose, but rdfs:label is widely accepted          rdf:type predicate.                                part of the RDF graph.
as a unique identifier of a property that

 11
MODELING INFORMATION, REPRESENTED THE SAME WAY AS FACTS, CAN EXPAND AN RDF GRAPH

                                                 e :population                        rdfs:label
                                                                  59047                                  White Plains

                                       schema:Movie                                     wikidata:                              schema:City
                                                                                         462177
                                                            rdf:type                                       rdf:type                                    rdfs:subClassOf
                               rdfs:subClassOf

                                                                                                   e :filmedin                                              schema:AdministrativeArea
                                   schema:CreativeWork
                                                                                           e :125                          “The Post”
               rdfs:Class                                                                                                                                rdfs:subClassOf
                                                                                                           rdfs:label

                   rdf: type                          rdfs:subClassOf
                                                                                                                                                                   schema:Place

                                      schema:TVSeries                                    e :actedin                            ACTED IN

                                                                       1993                                 rdfs:label

                                      rdf:type
                                                                 e :released

                                                                                                                                                     rdfs:subClassOf
                                                                   e :directed
                                                                                        wikidata:
                                             ee ::124
                                                  124                                      2263                                           e :Actor                       schema:Person
                                                                                                                         rdf:type
                                                                 schema:
                                    rdfs:label                   givenName
                                                                                                                                                                         rdfs:subClassOf
                                                                                                                                    rdf:type
                                                                                 schema:              schema:
                                                                                 familyName           birthDate
                                          A League of
                                          Their Own                    Tom                    Hanks                     1956           e :Director

Figure 3: Part of the RDF graph diagram of Figure 2 expanded with modeling information

12
For example, unlike the Property Graph in           can add a label to the predicate ex:actedIn.        namespace that is used for SHACL — a
Figure 1, we do not say in Figure 2 that Tom        Similarly, we could also say that when the          language for defining rules and constraints
Hanks is a person in addition to being an           relationship ex:actedIn is used to navigate in      for RDF Graphs, turning them into fully
actor and a director or that Sara Paulson is a      the opposite direction (from a movie to an          fledged Knowledge Graphs. SHACL offers a
person in addition to being an actor. We            actor), the display name of the relationship        very strong approach to ensuring the integri-
simply say that there is a rdfs:subClassOf          should be shown as ‘actors’. In an RDF Graph,       ty of RDF data and more.
relationship between the class of Actors and        a resource that is used as a predicate in one
the class of People. And the same for the           statement can be used as a subject or object        For instance, we can:
class of Directors. The semantics of rdf:type       in another statement. This is an example of         • Consult a graph to find out what
and rdfs:subClassOf are defined in the              the additional flexibility that, among other          properties are appropriate for, let’s say,
standard — the graph depicted in Figure 3           things, lets us store information about predi-        a movie and what are the valid values
says that every resource of type Actor is           cates and their usage. The edges in Property          for these properties.
also of type Person.                                Graphs offer nothing comparable.                    • Define constraints also known as rich data
                                                    We can extend the RDF graph further to                quality/validity rules. For example, as
We also do not say that the type of New York
                                                    explicitly define how a predicate should be           shown in Figure 4, we have defined a min
City or White Plains is a place (location) in
                                                    used. For example, we could say that any              range of allowed date values for the ‘re-
addition to a city. We do not need to repeat
                                                    resource of type schema:CreativeWork can              leased’ property of a creative work (e.g.,
this fact for each city. We already said it in
                                                    have a property ex:released and the value of          a movie or a TV Series). Now, if a movie
the model — each city is also a place and
                                                    that property must be a date. This would              released prior to 1900 is added to a graph,
what is defined for a place will apply to a city.
                                                    apply to a Movie or a TVSeries since they             the graph can identify it as a problem.
In an RDF Graph, we can capture any infor-          both are subclasses of schema:Creative-               While this example is simple, we can add
mation about the model of the data that             Work. The diagram in Figure 4 shows what              to the graph much more sophisticated
is stored in a graph. This information will be      this looks like in a graph.                           rules. For instance, we could specify copy-
stored, accessed and processed the same                                                                   right regulations that must be in place for
                                                    In Figure 4, the sh: prefix (e.g. in sh:property)
way as any other data. For example, the                                                                   resources released or published after a
                                                    stands for w3.org/ns/shacl#, the standard
graph diagram in Figure 3 shows that we                                                                   certain date.

 13
• Define rich inference rules. Inference              All property values (literals and URIs alike)         • In “data modeling speak,” in an RDF Graph
  rules generate new facts from the                   are stored as nodes. For example, as shown              properties can be either attributes
  facts in the graph.                                 in Figure 2:                                            or relationships.
These key capabilities turn RDF Graphs
                                                      • The rdfs:label for the resource ex:125 is           In Property Graphs, properties can only have
into Knowledge Graphs.                                   “The Post.” In this example, rdfs:label is a       literal values. These are stored and treated
                                                         property and “The Post” is a value.                differently from the nodes in a graph. In data
What are Properties
                                                                                                            modeling speak, properties in a Property
In RDF Graphs, an edge is called a property           • The edge ex:filmedIn is also a property. Its
                                                         values for ex:125 are wikidata:Q6 and              Graph are always attributes. This is why
(predicate) and an object that a property                                                                   property graphs are formally described as
points to may be called a property value.                wikidata:Q462177.
                                                                                                            directed, edge labeled, attributed graphs.

                                                                      sh:property
                             schema:CreativeWork                                                           e :CreativeWork released

           rdfs:subClassOf                          rdfs:subClassOf                   sh:minValue             sh:path                  sh:datatype

                 schema:TVSeries             schema:Movie                                           1900     e :released         sd:date

Figure 4: Extending an RDF graph with more modeling information about the ex:released property

14
A property structure is that of key-value
                                                        “Name” White Plains
pairs. This means that a property key can                                                                                                 label: Location
only have a single value. If it has more than
                                                        “Population”                                    “ID” 127                          label: City
                                                         58811 56853
one value, then the single value is turned into
an array of comma separated values. For an
example, see Figure 5.                                Figure 5: In Property graphs, the property structure is that of key-value pairs —
                                                      multiple values must be turned into an array of comma separated values
Turning multi-valued properties into arrays
makes it harder to efficiently answer queries
such as “all cities with population over
                                                  For example, Wikidata captures many                         representing Tom Hanks to the node
58,000.” The first value in the array is the                                                                  representing the movie The Post and an
                                                  details about the source of the information
population of White Plains in 2018. The                                                                       edge connecting Sarah Paulson to this
                                                  about Tom Hanks’ birth date in order to
second value is the population of White                                                                       movie. The two edges have the same type,
                                                  give users confidence in the reliability of
Plains in 2010. There is no way in a Property                                                                 but different identity.
                                                  the data. As shown in Figure 6, it got the
Graph to capture what each of these values
                                                  information from 9 sources which all                     • In RDF, it is the same edge. This means
represents beyond the fact that the key part
                                                  agree on the date. The sources include the                  that if you need to say something about
of the key-value pair is Population. This
                                                  Encyclopedia Britannica, Internet Broadway                  a relationship between Tom Hanks and The
brings us to the next important difference
                                                  Database and others.                                        Post (e.g., the role he played in the movie),
— how to capture additional information
                                                                                                              you can’t simply add a statement to the
about a property value. In saying this, we        Differences in Attaching                                    ex:actedIn property. If you do this, it will
mean any property — whether it is an attri-       Information about an Edge                                   apply everywhere this property is used.
bute or a relationship. As we see with the
population example, it may be important to        In RDF Graphs, unlike in Property Graphs,                In other words, in the Property Graph data
qualify a measurement by the date it was          edges are typically re-used:                             model, edges uniquely identify the source-
measured on. There are also other important       • In the Property Graph shown in Figure 1,               node — edge — target-node combination. In
information qualifiers — including source           there are two ACTED_IN edges with                      the RDF data model, they tend not to. Of
and confidence.                                     different IDs: an edge connecting the node             course, one could create a unique edge and

 15
simply give it the type ex:actedIn. However,
this is normally not done because RDF
databases are optimized for working with
edges that represent types instead of occur-
rences of types.
To support the need to attach information
on an edge between two specific nodes, RDF
provides a way to create a new node that
uniquely identifies the source-edge-target
triple (or the subject-predicate-object in
RDF speak) combination. With that in place,
we can make statements about the new
node using the regular approach — it can be
a subject or an object of any statement. This
is shown in Figure 7 where we created a new
node ex:126 to represent the statement
(triple) of Tom Hanks’ acting in The Post.
The new node is connected to the statement
about Tom’s acting in The Post using rdf:sub-
ject, rdf:predicate, rdf:object and rdf:State-
ment, built-in elements of the RDF data
model that support this use case.

                                                 Figure 6: A screenshot from WIkidata showing the sources of information about Tom Hanks’ birth date

16
Compared to Property Graphs, this                 RDF GRAPH WITH AN EXAMPLE OF MAKING A STATEMENT ABOUT ANOTHER STATEMENT
approach is more powerful and
flexible because it supports:
                                                                                      “The Post”
• Adding other edges (relationships) to edges.                                         sd: string
                                                                                                                                                          rdf:Statement
  For example, instead of having a role as a
  string, we may want to have a connection to a
                                                                                               rdfs:label
  node representing Ben Bradlee, a person. This                                                                                                                     rdf:type
                                                                                                                                             rdf:ob ect
  is fundamentally not possible with Property            schema:Movie                   ee: :125
                                                                                            125                                                           e : 126
  Graphs without changing (restructuring) the                                                                                                               e :126
                                                                           rdf:type
  original graph.                                                                                           rdf:predicate
                                                                                                                                                          e :role
• Adding more information to any property, not
  just a relationship. For example, we can use it                                      e :actedin
                                                                                                                                                             Ben Bradlee
  to specify the effective date of each population
                                                                                                                            rdf:sub ect
  measurement for White Plains. This is also not
  possible with Property Graphs.                                                      wikidata:                                                      e :Actor
                                                                                         2263                                     rdf:type
For Property Graphs, the solution to the need to
add edges to other edges is to create intermedi-
ate nodes — as shown in Figure 8.                                                                                                 rdf: type
                                                                                                                                  rdf:type

This requires restructuring of a graph and               schema:               schema:              schema:
                                                         givenName             familyName           birthDate
changing all queries and logic because the path                                                                                      e :Director
between actors and movies is now different                           Tom                    Hanks                      1956
(compare with the original graph in Figure 1).
With RDF, you do not need to make changes to
                                                                                  Figure 7: RDF graph showing making a statement about another statement —
the graph structure to make a link to the
                                                                                  to attach information on an edge between two specific nodes.

17
resource representing Ben Bradlee. You IN PROPERTY GRAPHS ADDING EDGES TO OTHER EDGES REQUIRES REFACTORING THE GRAPH
simply change the node at the end of the
ex:role relationship from a string to a URI.                                         “Name” The Post
This is demonstrated in Figure 9. The ap-                                            “Released” 2017
                                                        label: Movie Role
proach is evolutionary and does not require
any refactoring other than the change of the                                             “ID” 16
                                                                                                     “ID” 125
                                                                        type: ROLE_IN
value itself.
                                                          type: ROLE_IN
There may, however, be some other situa-
tions where you would want to introduce                                                                                                           label: Movie
new intermediate nodes. If you do so,                                                            type: PORTRAYIN

SHACL rules can be used to deliver the                               “ID” 129                                                     “ID” 130
original relationship path inferring its value                                                                “ID” 15

from the new, more complex path. In this
way, your existing queries and programs          “First Name” Tom
                                                                             type: PLAYED_BY
                                                 “Last Name” Hanks
can remain the same.                             “Year Born” 1956
                                                                             “ID” 17
The Property Graph solution to adding more                                                                        label: Person        “Name” Ben Bradlee
information to a property (e.g., population)
is to change the structure of the graph to
                                                                                                label: Person
turn a property into an edge and a value to a                        “ID” 123                   label: Director
node. This requires restructuring of a graph                                                    label: Actor                                        NODE

and change to all queries and logic for its                                                                                                         NODE LABEL

processing because the storage and access                                                                                                           PROPERTY KEY
                                                                                                                                                    VALUE PAIRS

of properties is fundamentally different and                                                                                                        EDGE

separate from the graph traversal. This
                                                                                                                                                    EDGE TYPE

makes Property Graphs less evolvable or
flexible than RDF Graphs.                        Figure 8: Refactored Property Graph with Ben Bradlee as a Person

18
Flexibility is acknowledged as the key        A current downside of the RDF Statement          Graph Analytics, Named
differentiating advantage of graph            approach to capturing information about
databases. For example, the leading           edges is what is sometimes called “graph
                                                                                               Graphs and Other Topics
vendor of property graph databases            bloat.” To capture a role that Tom Hanks had     This white paper is not intended to
says, “With graph databases, IT and data      in The Post, we need to add at least three       completely cover all capabilities of Property
architect teams move at the speed of          extra statements (rdf:subject, rdf:predicate     Graphs or Knowledge Graphs. We have
business because the structure and            and rdf:object) in addition to the role infor-   focused only on critical differentiators.
schema of a graph model flexes as applica-    mation — four if you also add a type link to     With this, we need to at least mention
tions and industries change. Rather than      rdf:Statement. Quite a lot of overhead for       two important topics:
exhaustively modeling a domain ahead of       just one fact. If, however, you need to cap-
time, data teams can add to the existing      ture several facts about Tom’s acting in this    • Algorithms for Graph Analytics
graph structure without endangering           movie, then this approach has less overhead.     • Named Graphs
current functionality.” We agree that this                                                     Graph analytics is a key application for
                                              A new extension to the RDF data model
would be a very important and desired                                                          property graphs. By analytics, we mean
                                              called RDF* (RDF Star) and its variation
advantage. However, as we describe in this                                                     node centrality, node similarity, shortest
                                              called RDF Plus address this issue. It is
paper, changes in the model of the Property                                                    paths, clustering and other algorithms.
                                              currently in the process of being added to
Graph data will require refactoring and                                                        Property Graphs are known for offering
                                              the standard. In the meantime, TopBraid
changes to queries. In a Property Graph                                                        these algorithms and many applications
                                              EDG can create a new node with the URI
edges and properties are different data                                                        of property graphs rely on such algorithms.
                                              composed from the subject-predicate-object
structures and their handling in queries is                                                    Having said this, there isn’t anything special
                                              nodes of the statement you need to add
fundamentally different.                                                                       in a property graph data model that makes
                                              information to. The new node uniquely
As you can see, compared to an RDF Graph,     identifies the original statement and can be     these algorithms possible. They can be
it is harder to organically grow a Property   used as a subject of other statements, avoid-    applied equally well over RDF Graphs. In
Graph in response to changes in your infor-   ing graph bloat. For standard-compliant          fact, many RDF-based solutions are also
mation requirements.                          information exchange, EDG serializes such        offering similar algorithms.
                                              nodes as RDF Statements.

19
IN AN RDF GRAPH, YOU DO NOT NEED TO CHANGE THE GRAPH STRUCTURE TO MAKE A LINK TO A RESOURCE

                              “The Post”                                                          rdf:Statement

                                       rdfs:label
                                                                                                            rdf:type
                                                                                     rdf:ob ect
 schema:Movie                   ee: :125
                                    125                                                           e : 126
                                                                                                    e :126
                   rdf:type
                                                    rdf:predicate
                                                                                                  e :role

                               e :actedin
                                                                                                              wikidata:
                                                                    rdf:sub ect
                                                                                                                 2263

                              wikidata:                                                      e :Actor
                                 2263                                     rdf:type

                                                                          rdf: type
                                                                          rdf:type

 schema:               schema:              schema:
 givenName             familyName           birthDate
                                                                             e :Director
             Tom                    Hanks                      1956
                                                                                                                Figure 9: RDF Graph with Ben Bradlee as a Person

20
The ability to partition data is important.      ulations with it. This again follows the idea   However, we increasingly hear of
Relational databases partition data using        of “separate, but connectable.”                 customers hitting the wall with Property
tables and views. Both Property Graphs and                                                       Graphs because as they start to use them,
                                                 For example, in TopBraid EDG, a given busi-
RDF Graphs let users work with sets of nodes                                                     they recognize the need for one or more
                                                 ness glossary or a taxonomy is a named
of a specific type (in the case of Property                                                      of the following capabilities:
                                                 graph. Resources in it can be connected to
Graphs, nodes carrying a specific label),        resources in other graphs, but it can also be   • Capture of Schema in a Graph
e.g., a query can be limited to only work with
actors or to only work with directors. This
                                                 manipulated as a distinct set of statements.    • Support for Validation and Data Integrity
                                                 For example, there could be a purpose asso-
provides a very basic, limited partitioning.                                                     • Capture of Rich Rules
                                                 ciated with a glossary as a whole e.g., its
RDF data can also be partitioned in named        users and uses can be identified and so on.     • Support for Inheritance and Inference
graphs. A named graph offers us a way to         There is no similar concept in the Property     • Globally Unique Identifiers
say that some group of triple statements         Graph world.                                    • Resolvable Identifiers
belong to a “sub-graph.” We can then give it                                                     • Connectivity Across Graphs
a uniquely identifying name (hence, the term     Limitations of                                  • Better Solution to Graph Evolvability
“named graph”) and associate any other           Property Graphs                                 Note that these are fundamental limitations
information with it that we see as important.
The idea is somewhat similar to views in         In this white paper, we describe some           that are not addressed in the design of
relational databases. A single statement can     limitations of Property Graphs and              property graphs. In principle, it may be
belong to many named graphs. Thus, it is a       their differences with Knowledge Graphs         possible to add at least some of these
different concept from physically partitioning   that are based on RDF.                          capabilities to a Property Graph — but not
distinct graphs across different machines.                                                       that easily or elegantly. Some of you may have
                                                 The main vendor for property graph
                                                                                                 already started on the road to doing this.
We can query a named graph individually, or      technology, Neo4J, offers a mature system
we can query all available graphs, or a subset   with some attractive, easy to get started       However, it is a lot of effort, both conceptual
of available graphs. We can load a named         with capabilities. There are also a few         (i.e., design and architecture) and imple-
graph, clear it and perform any other manip-     other Property Graph databases on the           mentation work. Even if you succeed in
                                                 market today.

21
accomplishing it, you will end up with a         With Property Graphs, data modeling             property graph database. We already
proprietary home-grown version of capabili-      happens on paper or on a white board,           demonstrated how a decision to use inter-
ties that already exist, are standardized and    separate from the graph itself. Property        mediate nodes in a property graph may be
well proven.                                     Graphs are not self-describing and the          based on the need to add information to a
                                                 meaning of the data they store is not a         property, which is only possible if a property
Inherent Semantics make                          part of a graph.                                is turned into an edge.
it easy for RDF Graphs to                                                                        Further, in property graphs some property
become Knowledge Graphs                          Some Guidance for Moving                        values such as dates or names are often
As illustrated in the previous sections, RDF-
                                                 from a Property Graph to a                      turned into entities because there is no
based graphs capture more than just data.        Knowledge Graph                                 efficient way of querying literal values,
                                                                                                 especially if they are multi valued. As a
They capture the meaning or semantics of         It is fairly easy to generate one of the        result, you may have an entity for a number
data, including rich constraints and highly      RDF standard serializations from a property     58,811 or a year 1956. This, however, could
expressive rules. All information is stored in   graph. In fact, Neo4J offers a library for      result in having so-called “dense nodes” or
a graph and is available for query and any       doing this. You can readily get the data out,   nodes that participate in many relation-
other algorithms that can help us reason and     but you will not be able to get the semantics   ships. Typically, nodes that are targets of
discover new knowledge based on the avail-       of the data; this is due to the fact that the   thousands of relationships are considered to
able knowledge. And the amount of the            data model only exists in your initial design   be dense in Neo4J with the potential of
available knowledge with Knowledge               sketches and, partially, within Cypher          performance issues when such nodes are
Graphs is practically unlimited — just as        queries and programs.                           deleted. The design of the model may,
it is on the world wide web. We can reach
                                                 Further, as we have discussed, the structure    therefore, be impacted by the density con-
out and take advantage of the information
                                                 of the graph data may be influenced by          siderations. Similarly, you may have rela-
available in other graphs. Separate, but
                                                 the specific limitations of the property        tionships that represent specific dates e.g.,
connectable is a key feature of the web —
                                                 graph data model and optimizations that         BORN_IN_1956, BORN_IN_1957, etc. This is
and of Knowledge Graphs.
                                                 were required due to the architecture of a      a design pattern used in property graphs

22
because with a generic BORN_IN relation-          access to data. If you have used GraphQL to
ship, Cypher queries looking for people born      build your solution on top of a Property
in, let’s say 1956, do not perform well. Once
you move to RDF, you may decide to revisit
                                                  Graph, you will be able to keep much of your
                                                  code as you move to an RDF platform like          Summary
some of these design decisions.                   TopBraid EDG that also supports GraphQL.
The simplest way forward is to export prop-       For property graphs, GraphQL Schemas
erty graph data as-is and then create a data      need to be manually created and then manu-      Neo4J is a mature solution that popularized
model in RDF that represents the structure        ally maintained as the graph structures get     Property Graphs and made them easy to get
of the data. For example, if you created          extended and changed. One of the advan-         started with. People tend to think that RDF
intermediate nodes in order to link roles to      tages of a self-describing graph is that        based Knowledge Graphs are hard to under-
people portrayed by roles, you would mirror       GraphQL Schemas can be automatically            stand, complex and hard to get started with.
this in your RDF model (often called an           generated from the data model. This delivers    In the past, there was some truth to that
ontology) even if strictly speaking this is not   on the promise of frictionless development      characterization. Today, with products like
necessary in the RDF-based implementation.        and graceful systems maintenance by ren-        TopBraid EDG, it is no longer the case.
                                                  dering unnecessary any manual effort for        Many users are discovering the limitations of
TopBraid EDG can use data to reverse engi-
                                                  defining and maintaining schemas. For more      property graphs. Even if you started your
neer an ontology. This will speed up your
                                                  information on how TopBraid EDG works           first graph project using a property graph, it
migration efforts and will make the data
                                                  with GraphQL, visit topquadrant.com/tech-       is likely that sooner or later you will be
model explicit. You can then decide if you
                                                  nology/graphql/.                                hindered by limitations and will want to
want to adjust the model and change the
data or move forward with it as-is, evolving      For the types of queries that can’t be easily   adopt or at least explore the feasibility of an
it later if necessary.                            supported by GraphQL, you will typically use    RDF / Semantic Knowledge Graph based
                                                  SPARQL. TopBraid EDG lets you use either of     system. You will not be alone, as a number of
Many applications today use GraphQL to                                                            organizations are graduating from property
                                                  the query languages and it also lets you put
read and write data. Neo4J and some other                                                         graphs to knowledge graphs. We hope that
                                                  SPARQL expressions into GraphQL.
Property Graph offerings support GraphQL                                                          this paper has provided some insight and
                                                                                                  value in your decision making.

23
GOVERNANCE PACKAGES AVAILABLE IN TOPBRAID EDG

 About
 TopQuadrant
 TopQuadrant helps organizations succeed                            Vocabulary                   Metadata                 Reference Data                  Business
                                                                    Management                  Management                 Management                     Glossaries
 in Data Governance. Its flagship product,
 TopBraid EDG, delivers easy and meaningful
 access for all data stakeholders to enterprise          In addition to the above, TopBraid Tagger and             In ramping up a Data Governance program, different
 metadata, business terms, reference data,               AutoClassifier is a popular additional module             organizations may have different starting points. With
 data and application catalogs, data lineage,            that is part of a comprehensive information               TopBraid EDG, you can start incrementally and add
 requirements, policies, and processes.                  management and governance environment                     capabilities as you go. For details on available EDG
                                                         where packages for other types of assets can              packages and additional modules visit topquadrant.com/
 TopQuadrant’s customer list includes                    be easily added if needed.                                products/topbraid-enterprise-data-governance/
 over 120 organizations in financial services,
 pharma, healthcare, digital media, govern-
 ment and other sectors.

                                                  ©2020 TopQuadrant, Inc. All rights reserved. TopBraid Enterprise Data Governance–Vocabulary Management, and the TopQuadrant logo
                                                  are trademarks of TopQuadrant Inc. in the U.S. All other trademarks are the property of their respective owners. Specifications subject
                                                  to change without notice.

           For more details or to schedule a demo, contact us at: edg-info@topquadrant.com

24
You can also read