Information Prediction using Knowledge Graphs for Contextual Malware Threat Intelligence

Page created by Dorothy Solis
 
CONTINUE READING
Information Prediction using Knowledge Graphs for Contextual Malware Threat Intelligence
Information Prediction using Knowledge Graphs for
                                                                          Contextual Malware Threat Intelligence

                                                           Nidhi Rastogi, Sharmishtha Dutta, Ryan Christian, Mohammad Zaki, Alex Gittens
                                                                              Rensselaer Polytechnic Institute, Troy, NY
                                                                                                     Charu Aggarwal
arXiv:2102.05571v2 [cs.CR] 18 Feb 2021

                                                                                                IBM Research, Yorktown, NY

                                                                       Abstract                                         This information is available in abundance; however, it is
                                         Large amounts of threat intelligence information about mal-                    typically in non-standard and unstructured formats. Besides,
                                         ware attacks are available in disparate, typically unstructured,               information is not always accurate or complete. For exam-
                                         formats. Knowledge graphs can capture this information and                     ple, threat reports2 about the recent attacks on SolarWinds
                                         its context using RDF triples represented by entities and rela-                software captured detailed analysis on the Sunspot malware
                                         tions. Sparse or inaccurate threat information, however, leads                 that inserted the Sunburst malware into SolarWinds’ Orion
                                         to challenges such as incomplete or erroneous triples. Named                   software. The timeline has changed in subsequent reports as
                                         entity recognition (NER) and relation extraction (RE) models                   more information became revealed. However, no information
                                         used to populate the knowledge graph cannot fully guaran-                      is available about the threat actor, although the Sunburst mal-
                                         tee accurate information retrieval, further exacerbating this                  ware and a backdoor linked to the Turla advanced persistent
                                         problem. This paper proposes an end-to-end approach to gen-                    threat (APT) group share similarities. Knowledge graphs can
                                         erate a Malware Knowledge Graph called MalKG, the first                        enable the identification of this missing information about the
                                         open-source automated knowledge graph for malware threat                       attack actor.
                                         intelligence. MalKG dataset called MT40K1 , contains approx-                      A primary goal for security threat researchers is to automate
                                         imately 40,000 triples generated from 27,354 unique entities                   threat detection and prevention with minimal human involve-
                                         and 34 relations. We demonstrate the application of MalKG                      ment in the loop. AI-enabled technology, together with big
                                         in predicting missing malware threat intelligence informa-                     data about threats, have the potential to deliver on this to a
                                         tion in the knowledge graph. For ground truth, we manually                     great extent. Increasing sources of threat information on the
                                         curate a knowledge graph called MT3K, with 3,027 triples                       Internet, such as analysis reports, blogs, common vulnerabili-
                                         generated from 5,741 unique entities and 22 relations. For                     ties, and exposure databases (CVE, NVD), provide enough
                                         entity prediction via a state-of-the-art entity prediction model               data to evaluate the current threat landscape and prepare our
                                         (TuckER), our approach achieves 80.4 for the hits@10 met-                      defenses against future ones. Structuring this mostly unstruc-
                                         ric (predicts the top 10 options for missing entities in the                   tured data and converting it into a usable format solves only
                                         knowledge graph), and 0.75 for the MRR (mean reciprocal                        one problem. It misses out on capturing the context of the
                                         rank). We also propose a framework to automate the extrac-                     threat information, which can significantly boost threat intel-
                                         tion of thousands of entities and relations into RDF triples,                  ligence. However, several challenges need addressing before
                                         both manually and automatically, at the sentence level from                    this step, for instance: (a) lack of standardized data aggrega-
                                         1,100 malware threat intelligence reports and from the com-                    tion formats that collect both data and context, (b) limitations
                                         mon vulnerabilities and exposures (CVE) database.                              in data extraction models that lead to incomplete information
                                                                                                                        gathered about attacks, and (c) inaccurate data extracted from
                                                                                                                        threat information. In this paper, we address the first two chal-
                                         1    Introduction                                                              lenges. The third challenge that addresses the trustworthiness
                                                                                                                        of source information is left as future work.
                                         Malware threat intelligence informs security analysts and re-
                                                                                                                           One might argue that data sources such as Common Vulner-
                                         searchers about trending attacks and defense mechanisms.
                                                                                                                        abilities and Exposures (CVE) [38] and National Vulnerability
                                         They use this intelligence to help them in their increasingly
                                                                                                                        Database (NVD) [27], and standards like Structured Threat
                                         complex jobs of finding zero-day attacks, protecting intel-
                                                                                                                        Information Expression (STIX) [3] can effectively aggregate
                                         lectual property, and preventing intrusions from adversaries.
                                             1 Anonymous   GitHub link: https://github.com/malkg-researcher/MalKG          2 https://www.cyberscoop.com/turla-solarwinds-kaspersky/

                                                                                                                    1
Information Prediction using Knowledge Graphs for Contextual Malware Threat Intelligence
threat and vulnerability data in an automated way. However,                space can represent an entity, and a matrix can represent a
these data sources and standards overlook the context’s impor-             relation. Therefore, entities and relations can be embedded us-
tance, which can be favorably added via linking and semantic               ing various approaches [7] [6] [2] [40] into machine readable
metadata. Text data is not a mere aggregation of words. There              forms that are amenable to automated analysis and inference.
is a meaning in the information and linking of concepts to                    Motivation: In the examples described in the previous sec-
outside sources is crucial to its effective use. For example, a            tion, knowledge graph completion would predict the missing
human security analyst can correctly comprehend the mean-                  head entity, i.e., the author of the “Aurora Botnet” malware.
ing of the term “Aurora” in a threat report. It refers to the series       MalKG’s completeness primarily relies on two factors: (1)
of cyber attacks conducted by Advanced Persistent Threats                  information available in the threat reports and (2) the effi-
(APTs) by China, first detected by Google in 2009. However,                cacy of NER and RE models. Entity prediction is the task of
a computer or an AI model collecting this information might                predicting the best candidate for a missing entity or inferring
mistake it for the Northern Lights unless the security context             new facts based on existing information stored in a knowledge
is provided.                                                               graph. The rule-based inference using a malware threat intel-
   Knowledge graphs bridge this gap by adding context as                   ligence ontology governs which head and tail entity classes
well as allowing reasoning over data. The smallest unit of a               can form relations. In this paper, we present entity prediction
knowledge graph, referred to as an RDF triple [21], comprises              based on the TuckER [2] entity recognition model. We also
data instances using entities and relations. Entities capture              compare and contrast the performance of entity prediction
real-world objects, events, abstract concepts, and facts. For ex-          models that otherwise perform well for datasets in other do-
ample, entities represent malware, organizations, vulnerability,           mains. Translation-based embedding models [6, 20, 40] fare
and malware behavioral characteristics for the cybersecurity               especially well, as does tensor-factorization [2, 37, 43]. Other
domain. Relations connect entities using key phrases that                  models perform adequately on specific datasets but do not
describe the relationship between them. Specifically, adding               fully express some properties of MalKG, such as one-to-many
context using knowledge graphs in malware threat intelli-                  relations among entities, not including inverse relations, and
gence opens up many opportunities:                                         very few instances of symmetric relations.
                                                                              Cybersecurity intelligence is a broad domain, and it is a
  1. Entity prediction: Predict the missing entities from a                massive undertaking to represent this domain in one knowl-
     knowledge graph. Given a malware knowledge graph,                     edge graph. Therefore, in this research we concentrate on
     complete missing information such as malware author                   malware threat intelligence and study malware examples to
     and vulnerabilities. For an incomplete triple h multi-                illustrate the work. Nonetheless, the challenges and approach
     homed windows systems, hasVulnerability, ? i, entity                  are broad enough to be applied to other areas in the security
     prediction would return an ordered set of entities e.g.,              domain. Our malware threat intelligence knowledge graph,
     {spoofed route pointer, multiple eval injection vulnera-              MalKG, is instantiated from malware threat intelligence re-
     bility, heap overflow}. In this case, spoofed route pointer           ports collected from the Internet. We tokenize approximately
     is a correct tail entity for the incomplete triple.                   1100 threat reports in PDF format and from the CVE vul-
  2. Entity classification: Determine if the class label                   nerability description database to extract entities covering
     of an entity is accurate. For example, correctly de-                  malware threat intelligence. The instantiating process is de-
     termining spoofed route pointer is an instance of                     scribed in the sections below and includes algorithms for
     class:Vulnerability.                                                  named entity extraction (NER) and relation extraction (RE).
                                                                           Together, both NER and RE algorithms are deployed and au-
  3. Entity resolution: Determine whether two entities refer to            tomated to generate thousands of RDF triples that capture
     the same object. For example, Operation AppleJeus and                 malware threat information into units of a knowledge graph.
     Operation AppleJeus Sequel might have similar proper-                 We also discuss the challenges encountered in this process.
     ties. However, they are different malware campaigns and               Specifically, the main contributions of this paper are:
     should not be resolved to the same entity.
                                                                             1. We present an end-to-end, automated, ontology-defined
  4. Triple classification: Verify if an RDF triple in the mal-                 knowledge graph generation framework for malware
     ware threat intelligence knowledge graph is accurately                     threat intelligence. 40,000 triples are extracted from
     classified. For example, h ZeroCleare, targets, Energy                     1,100 threat reports and CVE vulnerability descriptions
     Sector In The Middle-East i should be classified as a cor-                 written in unstructured text format. These triples cre-
     rect triple whereas h ZeroCleare, targets, Energy Sector                   ate the first open-source knowledge graph for extracting
     In Russia i should be classified as an incorrect triple.                   structural and contextual malware threat intelligence for
                                                                                almost one thousand malware since 1999.
Knowledge Graphs like the MalKG are rich in information
that can be leveraged by statistical learning systems [7]. For               2. The triples are encoded into vector embeddings using
instance, in the MalKG, a low-dimensional embedding vector                      tensor-factorization based models that can extract latent

                                                                       2
Information Prediction using Knowledge Graphs for Contextual Malware Threat Intelligence
information from graphs and find hidden information.
                                                                                               Table 1: Notations and Descriptions
       This is a crucial goal for cybersecurity analysts because
                                                                                Notation              Description
       running machine learning models on vector embeddings
                                                                                E                     the set of all entities
       can produce more accurate and contextual predictions
       for autonomous intrusion detection systems.                              R                     the set of all relations
                                                                                T                     the set of all triples
    3. We provide a ground truth dataset, MT3K, which com-                      hehead , r, etail i   head entity, relation and tail entity
       prises 5,741 hand-labeled entities and their classes, 22                 hh, r, ti             embedding of a h, r and a t
       relations, and 3,027 triples extracted from 81 unstruc-                  de , dr               embedding dimension of entities, relations
       tured threat reports written between 2006-2019. Security                 ne , nr , nt          number of h+t, r, triples
       experts validated this data and are extremely valuable                     fφ                  scoring function
       in training the automation process for larger knowledge
       graphs for malware threat intelligence.
                                                                            graphs such as Freebase [5], Google’s Knowledge Graph3 ,
    4. We adopt a state of the art entity and relation extraction
                                                                            and WordNet [10] have emerged in the past decade. They
       models to the cybersecurity domain and map the default
                                                                            enhance “big-data” results with semantically structured infor-
       semantic classification to the malware ontology. We also
                                                                            mation that is interpretable by computers, a property deemed
       ensure that the models scale up to the threat corpus’s
                                                                            indispensable to build more intelligent machines. No open-
       size without compromising on the precision.
                                                                            source malware knowledge graph exists, and therefore, in
    5. Given the MalKG with missing information (or entities)               this research, we propose the first of its kind automatically
       from the knowledge graph, we demonstrate the effec-                  generated knowledge graph for malware threat intelligence.
       tiveness of entity completion using embedding models                    The MalKG generation process uses a combination of cu-
       that learned accurately and predicted the missing infor-             rated and automated unstructured data [21] to populate the
       mation from MalKG by finding semantic and syntactic                  KG. This has helped us in achieving the high-accuracy of a
       similarities in other triples of MalKG.                              curated approach and the efficiency of an automated approach.
                                                                            MalKG was constructed using existing ontologies [28, 34] as
    6. MalKG is open source, and along with the text corpus,                they provide a structure to MalKG generation with achievable
       triples, and entities, it provides a new benchmark frame-            end goals in mind. Ontologies are constructed using compe-
       work for novel contextual security research problems.                tency questions that pave way for systematic identification
                                                                            of semantic categories (called classes) and properties (also
2     Background                                                            called relations) [28]. An ontology also avoids generation of
                                                                            redundant triples and accelerates the triple generation process.
2.1      Definitions                                                           The MalKG comprises hierarchical entities. For example,
                                                                            a Malware class can have sub-classes such as TrojanHorse and
The malware knowledge graph, MalKG, expresses hetero-                       Virus. A hierarchy captures the categories of malware based
geneous multi-modal data as a directed graph, MalKG =                       on how they spread and infect computers. The relations in
{E, R, T}, where E, R and T indicate the sets of entities, rela-            MalKG, on the other hand, are not hierarchical. For instance,
tions and triples (also called facts), respectively. Each triple            a Virus (sub-class of Malware) known as WickedRose exploits
hehead , r, etail i ∈ T indicates that there is a relation r ∈ R be-        a Microsoft PowerPoint vulnerability and installs malicious
tween ehead ∈ E and etail ∈ E. For the entities and relation                code. Here “exploits” is the relation between the entities Virus
ehead , etail , r we use the bold face h, t, r to indicate their low-       (a subclass) and Software, and therefore does not require a sub-
dimensional embedding vectors, respectively. We use de and                  relation to describe a different category of exploit. MalKG
dr to denote the embedding dimensions of entities and re-                   stores facts like these in the form of a triple.
lations, respectively. The symbols ne , nr , and nt denote the
number of entities, relations, and triples the KG, respectively.
We use the fφ notation for the scoring function of each triple              2.3        Information Prediction
hh, r, ti. The scoring function measures the plausibility of a              Consider an incomplete triple from MalKG, h Adobe Flash
fact in T, based on translational distance or semantic similar-             Player, hasVulnerability, ? i. Knowledge graph completion
ity [19]. A summary of the notation appears in Table 1.                     would predict its missing tail t, i.e., a vulnerability of the
                                                                            software titled “Adobe Flash Player.” MalKG can predict
2.2      Malware Knowledge Graph (MalKG)                                    missing entities from an incomplete triple, which forms the
                                                                            smallest malware threat intelligence unit. During knowledge
Knowledge graphs store large amount of domain specific
                                                                            graphs construction, triples can have a missing head or tail.
information in the form of triples using entities, and rela-
tionships between them [21]. General purpose knowledge                        3 https://developers.google.com/knowledge-graph

                                                                        3
Knowledge graph completion, sometimes also known as entity                (a) Malware threat reports cover technical and social aspects
prediction, is the task of predicting the best candidate for a                of a malware attack in the form of unstructured text.
missing entity. It also predicts an entity by inferring new facts             There is no standardized format for these reports used
based on existing information represented by the knowledge                    to describe the malware threat intelligence. Therefore,
graph. Formally, the task of entity-prediction is to predict                  extracted information from raw text contains noise and
the value for h given h?, r, ti or t given hh, r, ?i, where “?"               other complex information such as code snippets, images,
indicates a missing entity.                                                   and tables.

                                                                         (b) Threat reports contain both generic and domain-specific
2.4    MalKG Vector Embedding                                                information. Using named entity extraction (NER) mod-
                                                                             els trained on generic datasets or specific domains such
Triples of MalKG are represented by a relatively low (e.g.,                  as health cannot be used for cybersecurity threat reports.
50) dimensional vector space embeddings [2, 36, 39, 42, 43],                 Besides, domain-specific concepts sometimes have titles
without losing information about the structure and preserving                that can mislead NER models into categorizing them as
key features. We construct embeddings for all the triples in                 generic concepts. For example, when a CVE description
MalKG where each triple has three vectors, e1 , r, e2 ∈ Rd and               mentions “Yosemite backup 8.7,” it refers to Yosemite
the entities are hierarchical. Therefore, to narrow down the                 server backup software, not the national park in Califor-
choice of embedding models, we ensured that the embedding                    nia.
model meets the following requirements:
                                                                          (c) Entity extraction can sometimes lead to multiple classifi-
  1. An embedding model has to be expressive enough                           cation of the same concept. For example, SetExpan [29]
     to handle different kinds of relations (as opposed to                    may classify annotated entities similar to an input seed
     general graphs where all edges are equivalent). It                       semantic class. For instance, the entity “Linux” was clas-
     should create separate embeddings for triples shar-                      sified as both as a software and as a vulnerability.
     ing the same relation. To elaborate, the embed-
     ding of a triple he1 , hasVulnerability, e3 i where e1 ∈            (d) The relation extraction method like DocRED [44] are
     class:ExploitTarget should be different from the em-                    trained on medical dataset and are sparsely documented.
     bedding of a triple he2 , hasVulnerability, e3 i where e2 ∈             Hard coded variables in the source code that specifically
     class:Software.                                                         works for the sample data make it challenging for re-
                                                                             use. However, we generalized it to target malware threat
  2. An embedding model should also be able to create dif-                   intelligence entities. Another crucially important aspect
     ferent embeddings for an entity when it is involved in                  in [44] is the exponential growth in memory requirement
     different kinds of relations [40]. This requirement en-                 with the number of entities in a given document.
     sures creating embedding of triples with 1-to-n, n-to-1,
     and n-to-n relations. For example, a straightforward and             (e) The semantic classifiers defined by entity extraction mod-
     expressive model TransE [6] has flaws in dealing with 1-                 els [1, 5, 29] are distinct from the ones required by the
     to-n, n-to-1, and n-to-n relations [20, 40]. It would create             malware threat intelligence domain, and therefore, can-
     very similar embeddings of DUSTMAN and ZeroCleare                        not be applied directly.
     malswares because they both involve TDL file, although
     these are two separate entities and should have separate            3   MalKG: System Architecture
     embeddings.
                                                                         The input to our proposed framework for malware KG con-
  3. Entities that belong to the same class should be projected          struction, as shown in Figure 1, consists of a training corpus
     in the same neighborhood in the embedding space [11].               and an existing ontology (e.g., [28]) that has classes and re-
     This requirement allows using an embedding’s neighbor-              lations to provide structure to MalKG. The training corpus
     hood information when predicting the class label of an              contains malware threat intelligence reports gathered from
     unlabeled entity (this task is called entity classification).       the Internet, manual annotations of the entities and relations
                                                                         in those reports using the ontology, and the CVE vulnerability
2.5    Challenges in MalKG Construction                                  database. The training corpus and the ontology classes are fed
                                                                         into entity extractors such as SetExpan [29] to provide entity
Generating triples from the vast number of threat reports is             instances from the unstructured malware reports. A relation
not a trivial task, and unlike for a small cureated benchmark            extraction model [44] uses these entity instances and ontology
dataset, the manual annotation process is not replicable and             relations as input and predicts relations among the entity pairs.
scalable for a large amount of data. Some of the challenges              Entity pairs and the corresponding relations are called triples
that need to be addressed are:                                           and they form an initial knowledge graph. We use a tensor

                                                                     4
factorization-based approach, TuckER [2], to create vector
                                                                              Table 2: Triples corresponding to the threat report.
embeddings to predict the missing entities and obtaining the
                                                                                   Head        Relation               Tail
final version of MalKG with the completed RDF triples.
                                                                               hDUSTMAN,       similarTo,         ZeroClearei
                                                                               hDUSTMAN,       involves,    Turla Driver Loader(TDL)i
3.1    Generating Contextual and Structured In-                                 hZeroCleare,   involves,    Turla Driver Loader(TDL)i
       formation                                                               hDUSTMAN,       involves,          dustman.exei
                                                                               hdustman.exe,    drops,            assistant.sysi
This section describes the process of generating contextual
                                                                               hdustman.exe,    drops,           elrawdisk.sysi
and structured malware threat information from unstructured
                                                                               hdustman.exe,    drops,             agent.exei
data.
   Corpus - Threat reports, blogs, threat advisories, tweets,
and reverse engineered code can be investigated to gather de-
                                                                        MalKG, this provenance information contains a unique iden-
tails about malware characteristics. Our work relies on threat
                                                                        tifier of the threat reports from which it was extracted or the
reports to capture information on the detailed account of mal-
                                                                        unique CVE IDs.
ware attacks and CVE vulnerability descriptions. Security
                                                                            Information Extraction - Following the ontology for mal-
organizations publish in-depth information on how cyber-
                                                                        ware threat intelligence, we created ground truth triples by
criminals abuse various software and hardware platforms and
                                                                        manually annotating threat reports. We generated MT3K, a
security tools. See Figure 2 for an excerpt of a threat report on
                                                                        knowledge graph for malware threat intelligence using hand
the Backdoor.Winnti malware family. Information collected
                                                                        annotated triples. While this was an essential step for creat-
from a corpus of threat reports includes vulnerabilities, plat-
                                                                        ing ground truth, the hand-annotated approach had several
forms impacted, modus operandi, attackers, the first sighting
                                                                        limitations, but is invaluable for testing large-scale models.
of the attack, and facts such as indicators of compromise.
                                                                            Inference Engine - We used the classes and relations de-
These reports serve as research material for other security
                                                                        fined in MALOnt [28] and Malware ontology [34] to extract
analysts who acquire essential information to study and an-
                                                                        triples from unstructured text. While using an ontology is not
alyze malware samples, adversary techniques, learn about
                                                                        the only approach to instantiate MalKG, it can map disparate
zero-day vulnerabilities exploited, and much more. However,
                                                                        data from multiple sources into a shared structure and main-
not all reports are consistent, accurate, or comprehensive. Ag-
                                                                        tains a logic in that structure using rules. An ontology uses a
gregating valuable data from these threat reports to incor-
                                                                        common vocabulary that can facilitate the collection, aggre-
porate all knowledge about a given malware and obtain this
                                                                        gation, and analysis of that data [24]. For example, there are
information during threat incident research (e.g., investigating
                                                                        classes named Malware and Location in MALOnt. According
false-positive events) is an important research goal.
                                                                        to the relations specified in MALOnt, an entity of Malware
   Contextualization - MalKG captures threat information
                                                                        class can have a relation “similarTo” with another Malware
in a triple, which is the smallest unit of information in the
                                                                        class entity. But due to the rule engine, Malware class can
knowledge graph [21]. For instance, consider the following
                                                                        never be related to an entity of Location class.
snippet from a threat report:4
                                                                            A knowledge graph contains 1-to-1, 1-to-n, n-to-1, and n-
   “... DUSTMAN can be considered as a new variant of                   to-n relations among entities. An entity-level 1-to-n relation
  ZeroCleare malware ... both DUSTMAN and ZeroCleare                    means there are multiple correct tails in a knowledge graph
  utilized a skeleton of the modified “Turla Driver Loader              for a particular pair of head and relation. In Figure 3, the
(TDL)” ... The malware executable file “dustman.exe” is not             entity “dustman.exe” is an instance of a Malware File. It drops
    the actual wiper, however, it contains all the needed               three different files. Entities DUSTMAN and ZeroCleare both
     resources and drops three other files [assistant.sys,              *involve* the Turla Driver Loader (TDL) file, and elaborate
         elrawdisk.sys, agent.exe] upon execution...”                   how n-to-1 relations exist in a knowledge graph; n-to-n works
                                                                        similarly.
   Table 2 shows a set of triples generated from this text. To-
gether, many such triples form the malware knowledge graph
where the entities and relations model nodes and directed               3.2    Predicting Missing Information
edges, respectively. Figure 3 shows the sample knowledge
                                                                        Knowledge graph completion enables inferring missing facts
graph from the text.
                                                                        based on existing triples in MalKG when information (entities
   Instantiating Ontology - Each entity in MalKG has a
                                                                        and relations) about a malware threat is sparse. For MalKG,
unique identifier, which captures the provenance informa-
                                                                        we demonstrate entity prediction through TuckER [2] and
tion comprising references to the triples’ threat sources. For
                                                                        TransH [40]. Standard datasets such as FB15K [6], FB15k-
   4 https://www.lastwatchdog.com/wp/wp-content/uploads/                237 [35], WN18 [6] , and WN18RR [9] were used to com-
Saudi-Arabia-Dustman-report.pdf                                         pare with other models and create baselines for other more

                                                                    5
Figure 1: System Architecture for MalKG construction and showing entity prediction.

                                                                          2. Dissimilarity scores fφ (h, r,t) are calculated for each true
                                                                             test triple and its corresponding corrupted triples. The
                                                                             scores are sorted in ascending order to identify the rank
                                                                             of the true test triple in the ordered set. This rank as-
                                                                             sesses the model’s performance in predicting head given
                                                                             h ?, relation, tail i. A higher rank indicates the model’s
                                                                             efficiency in learning embeddings for the entities and
                                                                             relations.
Figure 2: An excerpt of a threat report on Backdoor.Winnti                3. The previous two steps are repeated for the tail entity t,
malware family.                                                              and true test triples ranks are stored. This rank assesses
                                                                             the model’s performance in predicting the missing t
                                                                             given hh, r, ti.

                                                                          4. Mean Rank, Mean Reciprocal Rank (MRR), and Hits@n
                                                                             are conventional metrics used in literature, calculated
                                                                             from the ranks of all true test triples. Mean rank signifies
                                                                             the average rank of all true test triples. Mean rank is
                                                                             significantly affected when a single test triple has a bad
                                                                             rank. Recent works [2, 17] report mean reciprocal rank
                                                                             (MRR) for more robust estimation. It is the average of
                                                                             the inverse of the ranks of all the true test triples. Hits@n
                                                                             denotes the ratio of true test triples in the top n positions
Figure 3: Sample MalKG. Nodes and edges represent entities                   of the ranking. When comparing embedding models,
and relations, respectively. The edge labels denote relation                 smaller values of Mean Rank and larger values of MRR
type.                                                                        and Hits@n indicate a better model. We report the results
                                                                             of MRR and Hits@n for various tensor-factorization and
                                                                             translation-based models.
elaborate models. However, the datasets that were used for
comparison differed starkly form our MT3KG and MT40K
triple dataset. The following protocol describes the prediction       4     Experimental Evaluation
for missing entities in the malware knowledge graph:
                                                                      This section describes our experimental setup, results, and
 1. The head entity h is replaced with all other entities in          implementation details for knowledge graph completion.
    the knowledge graph for each triple in the test set. This
    generates a set of candidate triples to compare with each         4.1      Datasets
    true test triple. This step is known as corrupting the
    head of a test triple, and the resultant triple is called a       The malware threat intelligence corpus comprises of threat
    corrupted triple.                                                 reports and CVE vulnerability descriptions. The threat reports

                                                                  6
have been used to evaluate the performance of generic knowl-
              Table 3: Description of Datasets
                                                                      edge graphs and their applications, such as link-prediction,
 Dataset    ne       nr    nt        docs    avgDeg density           and knowledge graph completion. However, there is no open-
 MT3K       5,741 22       3,027     81      0.5273   0.00009         source knowledge graph for malware threat intelligence, and
 MT40K      27,354 34      40,000    1,100   1.46     5.34 ×          therefore there do no exist triples for malware KG evaluation.
                                                      10−5            We hope that MT3K and MT40K will serve as benchmark
 FB15K      14,951 1345 592213 -             39.61    0.00265         datasets for future security research.
 WN18       40943 18    151,442 -            3.70     0.00009
 FB15K-     14,541 237 310,116 -             21.32    0.00147
 237
                                                                      4.3     Automated Dataset: MT40K
 WN18RR     40,943 11      93,003    -       2.27     0.00005         The MT40K dataset is a collection of 40,000 triples generated
                                                                      from 27,354 unique entities and 34 relations. The corpus
                                                                      consists of approximately 1,100 de-identified plain text threat
were written between 2006-2021 and all CVE vulnerability              reports written between 2006-2021 and all CVE vulnerability
descriptions created between 1990 to 2021. We perform sev-            descriptions created between 1990 to 2021. The annotated
eral pre-processing steps on the raw corpus data. We provide          keyphrases were classified into entities derived from semantic
a unique identifier to every individual document and the CVE          categories defined in malware threat ontologies [28, 34]. The
description. Individual sentences were extracted from each            inference rule engine was applied when forming relations
document and assigned a sentence ID. Identification of doc-           between entities to produce triples used to build the MalKG.
uments and sentences also provides provenance for future                 Our information extraction framework comprises entity
work in trustworthiness, contextual entity classification, and        and relation extraction and is grounded in the cybersecurity
relation extraction. We tokenize and parse the plain text sen-        domain requirements. The framework takes as input the plain
tences into words with start and end positions using the spaCy        text of threat reports and raw CVE vulnerability descriptions.
default tokenizer5 library and saves them in sequential order         It locates and classifies atomic elements (mentions) in the
for each document.                                                    text belonging to pre-defined categories such as Attacker, Vul-
                                                                      nerability, and so on. We split the tasks of entity extraction
                                                                      and recognition. The entity extraction (EE) step combines an
4.2    Manually Curated Triples: MT3K                                 ensemble of state-of-the-art entity extraction techniques. The
The MT3K dataset is a collection of 5741 triples manually             entity classification (EC) step addresses the disambiguation
annotated by our research team and validated by security ex-          of an entity classified into more than one semantic category,
perts. 81 threat reports were annotated (see Figure 4) using          and assigns only one category based on an empirically learned
the Brat annotation tool [32]. In Figure 4, “PowerPoint file”         algorithm. Results obtained from both EE and EC are com-
and “installs malicious code” are labeled as classes Software         bined to give the final extraction results. Details are provided
and Vulnerability respectively. The arrow from Software to            in the next section.
Vulnerability denotes the semantic relationship hasVulnerabil-
ity between them. The annotation followed rules defined in            4.3.1   Entity Extraction (EE)
malware threat ontologies [28, 34]. 22 other relation types
(nr ) are defined in malware threat intelligence ontology [28].       We segregate the extraction of tokenized words into two cate-
                                                                      gories: domain-specific key-phrase extraction and generic
                                                                      key-phrase extraction (class:Location). The cybersecurity
                                                                      domain-specific EE task was further subdivided into factual
                                                                      key-phrase extraction (class:Malware) and contextual key-
                                                                      phrase extraction (class:Vulnerability). We use precision and
                                                                      recall measures as evaluation criteria to narrow down the
                                                                      EE models for different classes. Specifically, the Flair frame-
                                                                      work [1] gave high precision scores (0.88-0.98) for classi-
      Figure 4: Annotation using Brat annotation tool.                fication of generic and factual domain-specific key-phrases
                                                                      such as - Person, Date, Geo-Political Entity, Product (Software,
   The MT3K dataset serves as the benchmark dataset for               Hardware). The diversity in writing style for the threat reports
evaluating malware threat intelligence using our automated            and the noise introduced during reports conversion to text led
approach for triples generation. Other triple datasets such as        to different precision scores for classes but within the range
the FB15K [6], FB15k-237 [35], WN18 [6] , and WN18RR [9]              of 0.9-0.99. For contextual entity extraction, we expanded the
                                                                      SetExpan [29] model to generate all entities in the text corpus.
  5 https://spacy.io/api/tokenizer                                    The feedback loop that input seed values of entities based on

                                                                  7
confidence score allowed for better training dataset. Some               4.4     Predicting Missing Information
of the semantic classes used in this approach are Malware,
                                                                         Knowledge graph completion (or entity prediction) predicts
Malware Campaign, Attacker Organization, Vulnerability.
                                                                         missing information in MalKG. It also evaluates the entity pre-
                                                                         diction accuracy of a given triple and recommends a ranking
4.3.2   Entity Classification (EC)                                       of predicted entities. For the missing entities in the MalKG
                                                                         triples, a tensor factorization-based approach TuckER [2]
A shortcoming of using multiple EE models is that some en-               meets the requirements (see Section 2.4) of MalKG struc-
tities can get classified multiple times, which might even be            ture and can predict the missing entities. However, before
inaccurate in some cases. We observed empirically that this              narrowing down on the entity prediction model, we evaluated
phenomenon was not limited to any specific class. An expla-              its performance against other state-of-the-art models.
nation for this phenomenon is the variation and grammatical
inconsistency in different threat reports and their writing style.
                                                                         4.4.1   Benchmark Dataset
For example, the entity “linux” was classified as Software
and Organization for the same position in a document. Since              Due to the lack of a benchmark dataset in the cybersecurity
the entity extraction models are running independently of                domain, we rely on other datasets to observe and compare
each other, it was likely for close class types to have over-            existing embedding models’ performances. The most com-
laps. We followed a three step approach for resolving the                monly used benchmark datasets are FB15K, WN18, FB15K-
disambiguation in entity classification:                                 237, WN18RR. Collected from the freebase knowledge graph,
                                                                         FB15K and FB15K-237 are collaborative knowledge bases
  1. We compared the confidence score of entities occurring              containing general facts such as famous people, locations,
     at the same position in the same document and between               events. WN18 and WN18RR are datasets from Wordnet, a
     classes.                                                            lexical database of the English language. Simpler models per-
                                                                         form better on these datasets where a test triple’s correspond-
  2. For the same confidence scores, we marked those classes             ing inverse relation exists in the training set. Such instances
     as ambiguous. We retained these instances until the rela-           are eliminated from FB15K and WN18 to generate FB15K-
     tion extraction phase. At that time, we relied on the rule          237 and WNRR. Experimental results by OpenKE6 for entity
     engine from the ontology to determine the correct class             prediction (or knowledge graph completion) are summarized
     of the head and tail entity for a given relation (described         in Table 4.
     in the next section).                                                  The comparison between the structure of the graphs we
                                                                         generated - MalKG and hand-annotated, their properties, and
  3. For entities that passed the first two steps, we classified         corpus details are described in Table 3. For example, it com-
     them using the broader semantic class.                              pares the number of entities, relations, and triples (facts), the
                                                                         average degree of entities, and graph density of our datasets
4.3.3   Relation Extraction                                              (MT3K, MT40K) and benchmark datasets. Note, the average
                                                                         degree across all relations and graph density are defined as
We base relation extraction for MalKG on the DocRED; a                   nt /ne and nt /n2e , respectively [25]. A lower score of these two
distantly supervised artificial neural network model [44] pre-           metrics indicates a sparser graph.
trained on the Wiki data and Wikipedia with three features.
DocRED [44] requires analyzing multiple sentences in a docu-             4.4.2   Implementation
ment to extract entities and infer their relations by amalgamat-
ing all knowledge of the document, a shift from sentence-level           MT3K validation set is used for tuning the hyper-parameters
relation extraction. The MT3K triples and text corpus form               for MT40K. For the TuckER experiment on MT3K, we
the training dataset and the entities generated in the previous          achieved the best performance with de = 200 and dr = 30,
steps, and the processed text corpus the testing dataset. As             learning rate = 0.0005, batch size of 128 and 500 iterations.
described in section 2.5, while training and testing was not a           For the TransH experiment, de = dr = 200, learning rate =
straightforward implementation, we generated 40,000 triples              0.001 gave us the best results. For MT3K and MT40K, we
using 34 relations between 27,354 entities. The instantiated             split each dataset into 70% train, 15% validation, and 15%
entities, their classification, and position in the threat report        test data. However, for the larger MT40K dataset, we relied on
or CVE description formed the RE model’s input. The testing              the MT3K dataset for testing. Knowledge graph completion
batch size was 16 documents. Due to scalability issues in                for MalKG can be described under two scenarios:
DocRED, the number of entities per document was limited to                 1. predict head given h?, relation,taili
less than 80. Text documents with a higher number of entities
were further split into smaller sizes. The training ran for 4-6            2. predict tail given hhead, relation, ?i
hours, for an epoch size of 984, and the threshold set to 0.5.              6 https://github.com/thunlp/OpenKE

                                                                     8
5     Results and Discussion                                               2. We show that TuckER does not store all the latent knowl-
                                                                              edge into separate entity and relation embeddings. In-
Table 4 displays the results of entity prediction existing em-                stead, it enforces parameter sharing using its core tensor
bedding models on the benchmark datasets. An upward ar-                       so that all entities and relations learn the common knowl-
row (↑) next to an evaluation metric indicates that a larger                  edge stored in the core tensor. This is called multi-task
value is preferred for that metric and vice versa. For ease                   learning, which no other model except Complex [37]
of interpretation, we present Hits@n measures as a percent-                   employs. Therefore, TuckER creates embeddings that
age value instead of a ratio between 0 and 1. TransE has                      reflect the complex relations (1-n, n-1, n-n) between en-
the largest Hits@10 and MRR scores on the FB15K-237                           tities in the MalKG.
dataset, whereas TransR performs better on the rest of the
three datasets. However, TransH is more expressive (capa-                  3. Moreover, we almost certainly ensure there are no in-
ble of modeling more relations) than TransE despite hav-                      verse relations and redundant information in the valida-
ing the simplicity of TransE (time complexity O(de )). The                    tion and test data sets. It is hard for simple yet expressive
space complexity of TransR (O(ne de + nr de dr )) is more ex-                 models like TransH to perform well in such datasets
pensive than TransH (O(ne de + nr de )) [39]. Additionally, on                compared to bi-linear models like TuckER.
average, it takes six times more time to train TransR7 com-
                                                                           4. The number of parameters of TransH grows linearly
pared to TransH [20]. Considering these model capabilities
                                                                              with the number of unique entities (ne ), whereas for
and complexity trade-offs, we choose TransH as a represen-
                                                                              TuckER, it grows linearly for the embedding dimensions
tational translation-based model. DistMult performs better
                                                                              (de and dr ). A TuckER low-dimensional embedding can
than TuckER by a small margin of 0.37% in only one case
                                                                              be learned and reused across various machine learning
(Hits@10 on WN18). TuckER and ComplEx both achieve
                                                                              models and performs tasks such as classification and
very close results on multiple cases.
                                                                              anomaly detection. We explore anomaly detection as
   Therefore, we look closely at the results of the recently                  future work in the MalKG to identify trustworthy triples.
refined datasets (FB15K-237 and WN18RR) as our MT3K
dataset shares similar properties with these two datasets                 Since TuckER performed significantly better than TransH
(MT3K has hierarchical relations as in WN18RR and no                   on the ground truth dataset, we employ only TuckER on the
redundant triples as in FB15K-237). Since the space com-               MT40K dataset. It is noteworthy that all the measures except
plexity is very close (ComplEx: O(ne de + nr de ) and TuckER:          mean rank (MR) improved on MT40K compared to MT3K.
O(ne de + nr dr )) we choose TuckER for evaluation on MT3K,            It confirms the general intuition that as we provide more data,
as it subsumes ComplEx [2].                                            the model learns the embeddings more effectively. MR is not
   We evaluate the performance of TransH [40] and TuckER               a stable measure and fluctuates even when a single entity is
[2] on our datasets. Tables 5 and 6 display the performance            ranked poorly. We see that the MRR score improved even
of TransH and TuckER on MT3K and MT40K. TransH’s                       though MR worsened, which confirms the assumption that
MRR score on MT3K improved when compared to the bench-                 only a few poor predictions drastically affected the MR on
marks. Tucker’s performance improved in all measures when              MT40K. Hits@10 being 80.4 indicates that when the entity
compared to the other benchmark results. Evidently, TuckER             prediction model predicted an ordered set of entities, the true
outperforms TransH by a large margin on the MT3K dataset.              entity appeared in its top 10 for 80.4% times. Hits@1 measure
Four reasons underlie this result:                                     denotes that 73.9% times, the model ranked the true entity at
                                                                       position 1, i.e., for 80.4% cases, the model’s best prediction
    1. MT3K has only 22 relations in comparison to 5,741               was the corresponding incomplete triple’s missing entity. We
       unique entities. This makes the knowledge graph dense           can interpret Hits@3 similarly. We translate the overall entity
       with a higher entities-to-relation ratio, which implies a       prediction result on MT40K to be sufficient such that the
       strong correlation among entities. Tensor-factorization         trained model can be improved and further used to predict
       based approaches use tensors and non-linear transfor-           unseen entities.
       mations that capture such correlations into embeddings
       more accurately than linear models do [40]. For larger
                                                                       6     Related Work
       knowledge graphs, such as MalKG, this property is espe-
       cially pertinent because it may be missing entities from        Several generic and specific knowledge graphs such as Free-
       the KG due to shortcomings of entity extraction models          base [5], Google’s Knowledge Graph8 , and WordNet [10] are
       or for predicting information for a sparsely populated          available as open source. Linked Open data [4] refers to all the
       graph. Accurate predictions help in both scenarios, espe-       data available on the Web, accessible using standards such as
       cially in the field of cybersecurity.                           Uniform Resource Identifiers (URIs) to identify entities, and
    7 On   benchmark datasets                                              8 https://developers.google.com/knowledge-graph

                                                                   9
Table 4: Entity Prediction for different Data sets
                                  FB15K                 WN18                 FB15K-237                                WN18-RR
              Model         Hits@10↑ MRR↑ Hits@10↑ MRR↑ Hits@10↑ MRR↑                                             Hits@10↑ MRR↑
              TransE           44.3     0.227      75.4        0.395        32.2       0.169                         47.2   0.176
              TransH           45.5     0.177      76.2        0.434        30.9       0.157                         46.9   0.178
              TransR           48.8     0.236      77.8        0.441        31.4       0.164                         48.1   0.184
             DistMult          45.1     0.206      80.9        0.531        30.3       0.151                         46.2   0.264
             ComplEx           43.8     0.205      81.5        0.597        29.9       0.158                         46.7   0.276
             TuckER            51.3     0.260      80.6        0.576        35.4       0.197                         46.8   0.272
             The top three rows are models from the translation-based approach, rest are from tensor factorization-based approach
             Bold numbers denote the best scores in each category

                                                                               to learn word-level and character-level features. Lample et
Table 5: Experimental evaluation of MT3K for entity predic-
                                                                               al. [18] use a layer of conditional random field over LSTM
tion.
                                                                               to recognize entities with multiple words. MGNER [41] uses
  Model Hits@1↑ Hits@3↑ Hits@10↑ MR↓ MRR↑
                                                                               a self-attention mechanism to determine how much context
 TransH        26.8         50.5           65.2        32.34     0.414         information of a component should be used, and is capable of
 TuckER        64.3         70.5          81.21         7.6      0.697         capturing nested entities within a text.
                                                                                  In addition, pre-trained off-the-shelf NER tools such as
                                                                               StanfordCoreNLP9 , Polyglot10 , NLTK11 , and spaCy12 are
Table 6: Experimental evaluation of MT40K for entity predic-                   available. These tools can recognize entities representing uni-
tion.                                                                          versal concepts such as person, organization, and time. These
  Model Hits@1↑ Hits@3↑ Hits@10↑ MR↓ MRR↑                                      tools categorize domain-specific concepts (e.g., malware cam-
 TuckER        73.9         75.9           80.4         202      0.75          paign, software vulnerability in threat intelligence domain)
                                                                               as other or miscellaneous because the models are pre-trained
                                                                               on corpora created from news articles, Wikipedia articles,
                                                                               and tweets. While it is possible to train these models on a
Resource Description Framework (RDF) to unify entities into
                                                                               domain-specific corpus, a large amount of labeled training
a data model. However, it does not contain domain-specific
                                                                               data is required.
entities that deal with cyber threats and intelligence. Other
existing knowledge graphs are either not open source, too spe-                    The absence of sufficient training data in a specific domain
cific, or do not cater to the breadth of information our research              such as malware threat intelligence poses a challenge to use
intends to gather from intelligence reports.                                   the algorithms mentioned above and tools to extract technical
                                                                               terms of that domain. Corpus-based semantic class mining
   Creating and completing a knowledge graph from raw text
                                                                               is ideal for identifying domain-specific named entities when
comprises multiple tasks, e.g., entity discovery, relation ex-
                                                                               there are very few labeled data instances. This approach re-
traction, and designing embedding models. In this section, we
                                                                               quires two inputs: an unlabeled corpus and a small set of seed
review notable research works to accomplish these tasks. En-
                                                                               entities appearing in the corpus and belonging to the same
tity discovery refers to categorizing entities in a text according
                                                                               class. The system then learns patterns from the seed enti-
to a predefined set of semantic categories.
                                                                               ties and extracts more entities from the corpus with similar
   Named Entity Recognition (NER) is a machine learning
                                                                               patterns. Shi et al. [31], and Pantel et al. [26] prepare a candi-
task for entity discovery. The machine learning model learns
                                                                               date entity set by calculating the distributional similarity with
features of entities from an annotated training corpus and
                                                                               seed entities, which often introduces entity intrusion error
aims to identify and classify all named entities in a given text.
                                                                               (includes entities from a different, often more general, seman-
Deep neural network-based approaches have become a popu-
                                                                               tic class). Alternatively, Gupta et al. [12], and Shi et al. [30]
lar choice in NER for their ability to identify complex latent
                                                                               use an iterative approach, which sometimes leads to semantic
features of text corpora without any human intervention for
                                                                               drift (shifting away from the seed semantic class as undesired
feature design [19]. Long Short Term Memory (LSTM), a re-
                                                                               entities get included during iterations). SetExpan [29] over-
current neural network architecture, in NER can learn beyond
                                                                               comes both of these shortcomings by using selective context
the limited context window of a component and can learn long
dependencies among words. Convolutional Neural Network                            9 https://stanfordnlp.github.io/CoreNLP/
(CNN) is a class of neural networks that learns character-level                   10 https://polyglot.readthedocs.io/en/latest/
features like prefixes, and suffixes when applied to the task                     11 https://www.nltk.org/

of NER. Chiu et al. [8] combine LSTM and CNN’s strengths                          12 https://spacy.io/api/entityrecognizer

                                                                          10
features of the seeds and resetting the feature pool at the start            computational complexity. It uses a separate hyperplane
of each iteration.                                                           for each kind of relation, allowing an entity to have dif-
   Following the entity discovery phase, extracting relations                ferent representations in different hyperplanes when in-
between entity pairs takes place. Deep neural network-based                  volved in different relations. TransR [20] proposes a
approaches eliminate the need for manual feature engineering.                separate space to embed each kind of relation, making it
To extract relations for constructing a large scale knowledge                a complex and expensive model. TransD [13] simplifies
graph, manual annotations to generate training data is costly                TransR by further decomposing the projection matrix
and time-consuming. Subasic et al. [33] use deep learning                    into a product of two vectors. TranSparse [14] is another
to extract relations from plain text and later align with Wiki-              model that simplifies TransR by enforcing sparseness
data13 to select domain-specific relations. Jones et al. [16]                on the projection matrix. These models combined are
propose a semi-supervised algorithm that starts with a seed                  referred to as TransX models. An important assumption
dataset. The algorithm learns patterns from the context of                   in these models is that an embedding vector is the same
the seeds and extracts similar relations. A confidence scoring               when an entity appears as a triple’s head or tail.
method prevents the model from drifting away from perti-
nent relations and patterns. Ji et al. [15] propose APCNN, a              2. Tensor based approaches: Tensor factorization-based
sentence-level attention module, which gains improvement                     approaches consider a knowledge graph as a tensor of
on prediction accuracy from using entity descriptions as an                  order three, where the head and tail entities form the first
input. In applications where labeled data is scarce, distant                 two modes and relations form the third mode. For each
supervision, a learning approach for automatically generating                observed triple hei , rk , e j i, the tensor entry Xi jk is set to
training data using heuristic matching from few labeled data                 1; and otherwise 0. The tensor is factorized by param-
instances is favored. Yao et al. [44] use distant supervision                eterizing the entities and relations as low dimensional
to extract features beyond the sentence level, and address                   vectors. For each entry Xi jk = 1, the corresponding triple
document level relation extraction.                                          is reconstructed so that the value returned by the scroing
   Vector embeddings address the task of knowledge represen-                 function on the reconstructed triple is very close to Xi jk ,
tation by enabling computing on a knowledge graph. MalKG                     i.e., Xi jk ≈ fφ (hei , rk , e j i). The model learns the entity
has highly structured and contextual data and can be lever-                  and relation embeddings with the objective to minimize
aged statistically for classification, clustering, and ranking.              the triple reconstruction error:
However, the implementations of embedding vectors for the
various models are very diverse. We can broadly categorize
                                                                                            L = Σi, j,k (Xi jk − fφ (ei , rk , e j ))2
them into three categories: translation-based, neural-network-
based, or trilinear-product-based [36]. For MalKG, we eval-
uate two embedding approaches that directly relate to our                    where fφ is a scoring function.
work:                                                                        Another model, RESCAL [23], comprise powerful bi-
                                                                             linear models, but may suffer from overfitting due to
  1. Latent-distance based approach: These models mea-
                                                                             large number of parameters. DistMult [43] is a spe-
     sure the distance between the latent representation of
                                                                             cial case of RESCAL which fails to create embeddings
     entities where relations act as translation vectors. For
                                                                             for asymmetric relations. ComplEx [37] overcomes the
     a triple hh, r, ti, they aim to create embeddings h, r, t
                                                                             drawback of DistMult by extending the embedding space
     such that the distance between h and t is low if hh, r,ti
                                                                             into complex numbers. HolE [22] combines the expres-
     is a correct fact (or triple) and high otherwise. A correct
                                                                             sive power of RESCAL with the efficiency and sim-
     fact is a triple where a relation exists between the head
                                                                             plicity of DistMult. TuckER [2] is a fully expressive
     h and tail t. The simplest application of this idea is:
                                                                             model based on Tucker decomposition that enables multi-
                                                                             task learning by storing some learned knowledge in the
                     fφ (h, t) = −|h + r − t|1/2                             core tensor, allowing parameter sharing across relations.
                                                                             TuckER represents several tensor factorization models
     where | · |1/2 means either the l1 or the l2 norm.                      preceding it as its special case. The model has a linear
     TransE [6] creates embeddings such that for a correct                   growth of parameters as de and dr increases.
     triple hh, r,ti, h + r ≈ t. However, it has the limitation of
                                                                          3. Artificial neural network based models: Neural net-
     representing only 1-to-1 relations as it projects all rela-
                                                                             work based models are expressive and have more com-
     tions in the same vector space. Thus, TransE cannot rep-
                                                                             plexity as they require many parameters. For medium-
     resent 1-to-n, n-to-1, and n-to-n relations. TransH [40]
                                                                             sized datasets, they are not appropriate as they are likely
     overcomes the drawback of TransE without increasing
                                                                             to over fit the data, which leads to below-par perfor-
  13 https://www.wikidata.org/wiki/Wikidata:Main_Page                        mance [39].

                                                                     11
7    Conclusion                                                             [4] Florian Bauer and Martin Kaltenböck. Linked open data:
                                                                                The essentials. Edition mono/monochrom, Vienna, 710,
In this research, we propose a framework to generate the first                  2011.
malware threat intelligence knowledge graph, MalKG, that
can produce structured and contextual threat intelligence from              [5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
unstructured threat reports. We propose an architecture that                    Sturge, and Jamie Taylor. Freebase: a collaboratively
takes as input an unstructured text corpus, malware threat                      created graph database for structuring human knowl-
ontology, entity extraction models, and relation extraction                     edge. In Proceedings of the 2008 ACM SIGMOD in-
models. It creates vector embeddings of the triples that can                    ternational conference on Management of data, pages
extract latent information from graphs and find hidden infor-                   1247–1250, 2008.
mation. The framework outputs a MalKG, which comprises
                                                                            [6] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran,
27,354 unique entities, 34 relations and approximately 40,000
                                                                                Jason Weston, and Oksana Yakhnenko. Translating em-
triples in the form of entityhead -relation-entitytail . Since this
                                                                                beddings for modeling multi-relational data. In Ad-
is a relatively new line of research, we demonstrate how a
                                                                                vances in neural information processing systems, pages
knowledge graph can accurately predict missing information
                                                                                2787–2795, 2013.
from the text corpus through the entity prediction application.
We motivate and empirically explore different models for                    [7] Antoine Bordes, Jason Weston, Ronan Collobert, and
encoding triples and concluded that the tensor factorization                    Yoshua Bengio. Learning structured embeddings of
based approach resulted in the highest overall performance                      knowledge bases. In Proceedings of the AAAI Confer-
among our dataset.                                                              ence on Artificial Intelligence, volume 25, 2011.
   Using MT40K as the baseline dataset, we will explore ro-
bust models for entity and relation extraction for future work              [8] Jason PC Chiu and Eric Nichols. Named entity recog-
that can generalize to the cybersecurity domain for disparate                   nition with bidirectional lstm-cnns. Transactions of the
datasets, including blogs, tweets, and multimedia. Another                      Association for Computational Linguistics, 4:357–370,
application is to use vector embeddings to identify possi-                      2016.
ble cybersecurity attacks from the background knowledge
encoded in MalKG. The current prediction model can gen-                     [9] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp,
erate a list of top options for missing information. It will be                 and Sebastian Riedel. Convolutional 2d knowledge
interesting to extend this capability to predicting threat intelli-             graph embeddings. In Proceedings of the AAAI Confer-
gence about nascent attack vectors and use that information to                  ence on Artificial Intelligence, volume 32, 2018.
defend against attacks that are yet to propagate widely. A fun-
damental issue that we want to address through this research               [10] Christiane Fellbaum. Wordnet. In Theory and appli-
is inaccurate triples in the knowledge graph. This draws from                   cations of ontology: computer applications, pages 231–
the limitations of entity and relation extraction models and                    243. Springer, 2010.
inaccurate descriptions in the threat reports.
                                                                           [11] Shu Guo, Quan Wang, Bin Wang, Lihong Wang, and
                                                                                Li Guo. Semantically smooth knowledge graph em-
                                                                                bedding. In Proceedings of the 53rd Annual Meeting
References                                                                      of the Association for Computational Linguistics and
                                                                                the 7th International Joint Conference on Natural Lan-
 [1] Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif                          guage Processing (Volume 1: Long Papers), pages 84–
     Rasul, Stefan Schweter, and Roland Vollgraf. Flair: An                     94, 2015.
     easy-to-use framework for state-of-the-art nlp. In Pro-
     ceedings of the 2019 Conference of the North American                 [12] Sonal Gupta and Christopher D Manning. Improved
     Chapter of the Association for Computational Linguis-                      pattern learning for bootstrapped entity extraction. In
     tics (Demonstrations), pages 54–59, 2019.                                  Proceedings of the Eighteenth Conference on Computa-
                                                                                tional Natural Language Learning, pages 98–108, 2014.
 [2] Ivana Balažević, Carl Allen, and Timothy M Hospedales.               [13] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun
     Tucker: Tensor factorization for knowledge graph com-                      Zhao. Knowledge graph embedding via dynamic map-
     pletion. arXiv preprint arXiv:1901.09590, 2019.                            ping matrix. In Proceedings of the 53rd annual meeting
                                                                                of the association for computational linguistics and the
 [3] Sean Barnum. Standardizing cyber threat intelligence                       7th international joint conference on natural language
     information with the structured threat information ex-                     processing (volume 1: Long papers), pages 687–696,
     pression (stix). MITRE Corporation, 11:1–22, 2012.                         2015.

                                                                      12
You can also read