Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION

Page created by Julio Thompson
 
CONTINUE READING
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
Ref. Ares(2020)3160083 - 17/06/2020

 Funded by the Horizon 2020 Framework
 Programme of the European Union
 PREVISION - Grant Agreement 833115

 Deliverable D3.1
 Title: Machine Learning and Automation for Crime
 Prevention and Investigation (Initial Release)

 Dissemination Level: PU
 Nature of the Deliverable: R
 Date: 03/06/2020
 Distribution: WP3
 Editors: IOSB
 Reviewers: UPV, SPH
 Contributors: IOSB, ICCS, ETRA, ITTI, IfmPt, BPTI, CERTH, SIV,
 CNRS, PARCS, UM, CTL

Abstract: This document is the first in a series of two deliverables associated with work package 3 on semantic
reasoning, predictive policing, behavioural analysis, and high level data fusion.
It constitutes an initial draft of the topic and will be released in a refined version as deliverable 3.2 “Machine
Learning and Automation for Crime Prevention and Investigation (Refined Release)”.

 * Dissemination Level: PU= Public, RE= Restricted to a group specified by the Consortium, PP= Restricted to other
 program participants (including the Commission services), CO= Confidential, only for
 members of the Consortium (including the Commission services)
 ** Nature of the Deliverable: P= Prototype, R= Report, S= Specification, T= Tool, O= Other
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Disclaimer
This document contains material, which is copyright of certain PREVISION consortium parties and may not
be reproduced or copied without permission. The information contained in this document is the
proprietary confidential information of certain PREVISION consortium parties and may not be disclosed
except in accordance with the consortium agreement.
The commercial use of any information in this document may require a license from the proprietor of that
information.
Neither the PREVISION consortium as a whole, nor any certain party of the PREVISION consortium
warrants that the information contained in this document is capable of use, or that use of the information
is free from risk, and accepts no liability for loss or damage suffered by any person using the information.
The contents of this document are the sole responsibility of the PREVISION consortium and can in no way
be taken to reflect the views of the European Commission.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 2 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Revision History

 Date Rev. Description Partner

 16/03/2020 0.1 Document template IOSB
 31/03/2020 0.2 Classification, Regression algorithms SIV
 16/04/2020 0.3 Behavioural Analysis and Anomaly Detection tool CERTH
 for videos
 17/04/2020 0.4 Jargon detection CNRS-IRIT
 17/04/2020 0.5 Smart Fusion and Incomplete Data Handling CNRS, ETRA,
 ITTI, PARCS
 17/04/2020 0.6 Estimating Information Check-Worthiness CNRS-IRIT
 17/04/2020 0.7 Behavioural Analysis and Anomaly Detection tool CNRS, BPTI
 for text analysis and sentiment and radicalization
 detection
 23/04/2020 0.8 Behavioural Analysis and Anomaly detection tool for ICCS
 telecom and financial data
 24/04/2020 0.9 Cyber-attack detection tool UM
 28/04/2020 0.10 Further extensions of the PREVISION Ontology CTL
 29/04/2020 0.11 Predictive Analytics and Trend Analysis IfmPT
 29/04/2020 0.12 Semantic Information Processing and AI-based IOSB
 Evidence Discovery
 08/05/2020 0.13 Harmonization of document structure and content IOSB
 13/05/2020 1.0 Abstract, Executive Summary, Introduction, IOSB
 Summary and Conclusions
 03/06/2020 1.1 Implemented modification proposal by reviewers IOSB, CNRS,
 ETRA, PARCS
 10/06/2020 1.2 Security assessment ROSPP

H2020-SU-FCT03-2018-833115 PREVISION Project Page 3 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

List of Authors

 Partner Author

 IOSB Ernst Josef Behmer, Christian Ellmauer, Dirk Pallmer, Uwe Zeltmann
 CNRS Ngoc Hoang, Josiane Mothe, Faneva Ramiandrisoa, Olivier Teste, Md Zia Ullah
 IfmPt Thomas Schweer, Günter Okon
 ICCS Konstantinos Demestichas, Evgenia Adamopoulou, Konstantina Remoundou,
 Nikos Peppes, Thodoris Alexakis, Ioannis Loumiotis
 ETRA Antonio Moreno Borrás, Luisa Pérez Devesa
 ITTI Damian Puchalski, Michał Choraś, Marek Pawlicki, Paweł Kochański,
 Piotrowski Rafał, Rafał Kozik
 BPTI Justina Mandravickaitė, Tomas Krilavičius
 CERTH Konstantinos Gkountakos
 SIV Iacob Crucianu
 PARCS Axel Kerep, Patrice Le Loarer, Jean-Baptiste Choteau
 UM Misha Glazunov, Apostolis Zarras
 CTL Panagiotis Mitzas, Konstantinos Avgerinakis

H2020-SU-FCT03-2018-833115 PREVISION Project Page 4 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Table of Contents
Revision History ............................................................................................................................................ 3
List of Authors ............................................................................................................................................... 4
Table of Contents .......................................................................................................................................... 5
Index of figures ............................................................................................................................................. 9
Index of tables............................................................................................................................................. 11
Glossary ....................................................................................................................................................... 12
Executive Summary..................................................................................................................................... 13
1 Introduction ........................................................................................................................................ 15
 Motivation................................................................................................................................... 15
 Intended Audience ...................................................................................................................... 15
 Relation to Other Deliverables.................................................................................................... 16
 Technical Requirements.............................................................................................................. 16
 Deliverable Structure .................................................................................................................. 16
2 Semantic Information Processing ....................................................................................................... 17
 Ontology Modelling .................................................................................................................... 17
 2.1.1 Ontology Engineering.......................................................................................................... 17
 2.1.2 Ontology Editor ................................................................................................................... 19
 Rule-based Reasoning Tools ....................................................................................................... 21
 2.2.1 General Aspects .................................................................................................................. 21
 2.2.2 Probabilistic Reasoning Based on Markov Logic Networks ................................................ 22
 Jargon detection ......................................................................................................................... 31
 2.3.1 Introduction ........................................................................................................................ 31
 2.3.2 Related work ....................................................................................................................... 32
 2.3.3 Proposed method ............................................................................................................... 33
 2.3.4 Evaluation framework ......................................................................................................... 34
 2.3.5 Future work ......................................................................................................................... 34
3 Smart Fusion and Incomplete Data Handling ..................................................................................... 35
 Introduction ................................................................................................................................ 35
 3.1.1 Heterogeneities in PREVISION environment ...................................................................... 35
 3.1.2 Missing or incomplete data ................................................................................................ 36

H2020-SU-FCT03-2018-833115 PREVISION Project Page 5 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 Fill in the people interaction gaps based on common jargon and content – a mixed community
 detection approach ................................................................................................................................. 37
 3.2.1 Introduction ........................................................................................................................ 37
 3.2.2 Related work ....................................................................................................................... 37
 3.2.3 Proposed method ............................................................................................................... 39
 3.2.4 Evaluation framework ......................................................................................................... 42
 3.2.5 Future work ......................................................................................................................... 44
 Smart Fusion ............................................................................................................................... 44
 3.3.1 Introduction ........................................................................................................................ 44
 3.3.2 Related work ....................................................................................................................... 45
 3.3.3 Proposed methods .............................................................................................................. 45
 3.3.4 Evaluation framework ......................................................................................................... 47
 3.3.5 Future work ......................................................................................................................... 48
 Smart Browser for Art Search ..................................................................................................... 48
 3.4.1 Introduction ........................................................................................................................ 48
 3.4.2 Related work ....................................................................................................................... 49
 3.4.3 Proposed methods .............................................................................................................. 50
 3.4.4 Evaluation framework ......................................................................................................... 54
 3.4.5 Future work ......................................................................................................................... 57
 Missing data visual analytics and Cyber-Defensive Data Fusion ................................................ 57
 3.5.1 Introduction ........................................................................................................................ 57
 3.5.2 Related work ....................................................................................................................... 58
 3.5.3 Proposed methods .............................................................................................................. 62
 3.5.4 Evaluation framework ......................................................................................................... 62
4 AI-based Evidence Discovery .............................................................................................................. 63
 Information Flux.......................................................................................................................... 63
 4.1.1 Data Sources ....................................................................................................................... 63
 4.1.2 Source Data Access for Semantic Reasoning Modules ....................................................... 63
 PREVISION Ontology ................................................................................................................... 65
 4.2.1 Introduction ........................................................................................................................ 65
 4.2.2 Structure of the PREVISION Ontology ................................................................................. 66
 4.2.3 The modified MAGNETO Ontology ..................................................................................... 66

H2020-SU-FCT03-2018-833115 PREVISION Project Page 6 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 4.2.4 The Unified Cyber Ontology (UCO) ..................................................................................... 71
 4.2.5 Further extensions of the PREVISION Ontology ................................................................. 73
 4.2.6 Integration of UCO and further extensions of the PREVISION Ontology ............................ 77
 Application of Semantic Reasoning to LEA Use Cases ................................................................ 78
 Classification of Datasets Based on Machine Learning............................................................... 82
 4.4.1 Overview ............................................................................................................................. 83
 4.4.2 Decision Trees ..................................................................................................................... 83
 4.4.3 Workflow............................................................................................................................. 85
 4.4.4 Application in PREVISION .................................................................................................... 86
 4.4.5 Data Collection and Feature Selection................................................................................ 87
 4.4.6 Implementation .................................................................................................................. 88
 4.4.7 Evaluation............................................................................................................................ 90
 4.4.8 Random Forest Classifiers ................................................................................................... 92
 Information Processing ............................................................................................................... 94
 4.5.1 Overview ............................................................................................................................. 94
 4.5.2 Data acquisition. ................................................................................................................. 94
 4.5.3 Preprocessing ...................................................................................................................... 94
 4.5.4 Features extraction ............................................................................................................. 96
 4.5.5 Data Classification [122] ..................................................................................................... 98
 4.5.6 Development Approach [121]........................................................................................... 102
5 Predictive Analytics and Trend Analysis............................................................................................ 109
 Predictive Analytics ................................................................................................................... 109
 5.1.1 Introduction ...................................................................................................................... 109
 5.1.2 Related work ..................................................................................................................... 109
 5.1.3 Proposed method ............................................................................................................. 110
 5.1.4 Evaluation framework ....................................................................................................... 111
 5.1.5 Future work ....................................................................................................................... 111
 Trend Analysis ........................................................................................................................... 112
 5.2.1 Introduction ...................................................................................................................... 112
 5.2.2 Related work ..................................................................................................................... 112
 5.2.3 Proposed method ............................................................................................................. 115
 5.2.4 Evaluation framework ....................................................................................................... 115

H2020-SU-FCT03-2018-833115 PREVISION Project Page 7 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 5.2.5 Future work ....................................................................................................................... 115
 Regression Trees and Boosting Algorithms .............................................................................. 115
 5.3.1 Overview ........................................................................................................................... 115
 5.3.2 Regression Trees ............................................................................................................... 116
 5.3.3 Gradient Boosted Regression ........................................................................................... 120
 Estimating Information Check-Worthiness ............................................................................... 124
 5.4.1 Introduction ...................................................................................................................... 124
 5.4.2 Related Work .................................................................................................................... 124
 5.4.3 A variety of features to learn check-worthiness ............................................................... 125
 5.4.4 Evaluation.......................................................................................................................... 127
6 Multivariate Behavioural Analysis and Anomaly Detection ............................................................. 129
 Behavioural Analysis and Anomaly Detection tool for text analysis and sentiment and
 radicalization detection ........................................................................................................................ 129
 6.1.1 Aggression Identification in Posts - two machine learning approaches ........................... 129
 6.1.2 Visual Analytics for trends and dynamics of radicalized content ..................................... 135
 Behavioural Analysis and Anomaly Detection tool for videos .................................................. 140
 6.2.1 Introduction ...................................................................................................................... 140
 6.2.2 Related work ..................................................................................................................... 142
 6.2.3 Datasets ............................................................................................................................ 142
 Behavioural Analysis and Anomaly detection tool for telecom and financial data .................. 144
 6.3.1 Financial data .................................................................................................................... 144
 6.3.2 State-of-the-art ................................................................................................................. 144
 6.3.3 Telecommunication data .................................................................................................. 148
 Cyber-attack detection tool ...................................................................................................... 154
 6.4.1 Adversarial Training .......................................................................................................... 154
 6.4.2 Generative Modeling ........................................................................................................ 154
 6.4.3 Methodology ..................................................................................................................... 155
 6.4.4 Evaluation.......................................................................................................................... 156
7 Summary and Conclusions ................................................................................................................ 157
8 References ........................................................................................................................................ 158

H2020-SU-FCT03-2018-833115 PREVISION Project Page 8 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Index of figures
Figure 1. PREVISION Platform Architecture as shown in the Grant Agreement......................................... 15
Figure 2. Example of a query in Protégé. .................................................................................................... 20
Figure 3. Example of a visualisation with OntoGraf.................................................................................... 20
Figure 4: Markov network corresponding to the rules initialized with constants { , }. Source: [9]. ...... 25
Figure 5. Workflow of the MLN reasoning. ................................................................................................. 29
Figure 6. Annotations for Knowledge generated by reasoning / data properties of RelationDescription. 31
Figure 7. The schema of the ‘Fill-in the gap’ component. .......................................................................... 40
Figure 8. The schema of the ‘Smart similarity measure’ component. ........................................................ 41
Figure 9. Smart fusion process .................................................................................................................... 47
Figure 10. Scheme of the sequences and modules. ................................................................................... 51
Figure 11. Smart browser processes. .......................................................................................................... 53
Figure 12. Typology against vendor’s documentation................................................................................ 54
Figure 13. Main concepts (cf. [107]). .......................................................................................................... 67
Figure 14. Relations with the domain “Event” or a subclass of “Event” (first level). ................................. 70
Figure 15. Layers of representing cyber-investigation information. .......................................................... 72
Figure 16. Representation of an advertisement on a Dark Web marketplace. .......................................... 74
Figure 17. Representation of a vendor profile on a Dark Web marketplace. ............................................. 74
Figure 18. Representation of a product review on a Dark Web marketplace. ........................................... 75
Figure 19. Aggregated RDF graph presenting a drug advertisement, vendor details and a customer review.
 .................................................................................................................................................................... 76
Figure 20. RDF graph modelling an online forum instantiation.................................................................. 77
Figure 21. Basic structure and terminology of a Decision Tree. ................................................................. 84
Figure 22. Decision Tree for classifying animals [114]. ............................................................................... 84
Figure 23. Workflow, when developing a predictor. .................................................................................. 85
Figure 24. Example Decision Tree for detecting suspicious bank transfers. .............................................. 87
Figure 25. Imbalanced dataset vs. balanced dataset. ................................................................................. 87
Figure 26. Implementation workflow of the PREVISION decision tree classifier........................................ 90
Figure 27. Decision Tree of the FDR dataset, when the split rule based on an entropy measure is applied.
 .................................................................................................................................................................... 92
Figure 28. Decision Tree of the FDR dataset, when the split rule based on the GINI measure is applied. 92
Figure 29. Principle of classification with random forest models............................................................... 93
Figure 30. Information processing. ............................................................................................................. 97
Figure 31. Data Distribution. ..................................................................................................................... 106
Figure 32. Decision Tree and Random Forests. ........................................................................................ 107
Figure 33. ML deployment architecture. .................................................................................................. 108
Figure 34. Illustration of a CNN+LSTM architecture of aggression detection inspired from [172]. ......... 133
Figure 35. Word co-occurrence network: 1st stage of Ukrainian conflict (alpha=0.5). ............................ 136
Figure 36. Sentiment-based Narrative Trajectory: 1st stage of Ukrainian conflict. ................................. 137
Figure 37. Topics by expected proportions in the Lithuanian data. ......................................................... 138

H2020-SU-FCT03-2018-833115 PREVISION Project Page 9 of 174
Deliverable D3.1 Title: Machine Learning and Automation for Crime Prevention and Investigation (Initial Release) - PREVISION
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Figure 38. Effect of media Type (Mainstream vs Unconventional (in radical sense) on Topic Prevalence in
Lithuanian dataset. ................................................................................................................................... 139
Figure 39. Topic (Correlation) Network of Lithuanian media (mainstream & unconventional). Explanation:
Blue – more typical to unconventional media; Red – more typical to mainstream media; Black – topics that
differ delfi.lt (mainstream) and alternative/unconventional (sarmatas (sarmatas.lt and netiesa.lt) news
sources most. ............................................................................................................................................ 140
Figure 40. Abnormally detection: A methods categorization by data nature (left) and training objective
(right). ....................................................................................................................................................... 141
Figure 41. Indicative dataset samples: UCSD (left), CUHK Avenue (center), UMN (right). ...................... 143
Figure 42. Indicative dataset samples: Train, Belleview and Subway exit (left), U-turn (center), LV (right).
 .................................................................................................................................................................. 144
Figure 43. Amount of the transaction in the financial data records with respect to time. ...................... 147
Figure 44. Outliers identified in the dataset and plotted in the graph of the amount of the transaction in
the financial data records with respect to time........................................................................................ 147
Figure 45. Duration of calls with respect to the cell phone that initiated the call. .................................. 151
Figure 46. Duration of calls with respect to the date. .............................................................................. 152
Figure 47. Outliers identified in the dataset and plotted in a graph of the duration of the calls with respect
to the cell phone that initiated the call. ................................................................................................... 152
Figure 48. Outliers identified in the dataset and plotted in a graph of the duration of the calls with respect
to the date. ............................................................................................................................................... 153

H2020-SU-FCT03-2018-833115 PREVISION Project Page 10 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Index of tables
Table 1. Sm, Ca and Fr abbreviate the predicates Smokes, Cancer and Friends, respectively. Source: [6] 24
Table 2. Arithmetic and boolean functions (Doan, Niu, Ré, Shavlik, & Zhang, 2011) ................................. 27
Table 3. String functions (Doan, Niu, Ré, Shavlik, & Zhang, 2011) ............................................................. 28
Table 4. Objectives of the smart fusion ...................................................................................................... 45
Table 5. Data fusion methods that will be developed in PREVISION .......................................................... 46
Table 6. Referent typologies and databases ............................................................................................... 56
Table 7. Classification schemas for describing the credibility and reliability ............................................. 69
Table 8. Selected relations with domain “Event” ....................................................................................... 71
Table 9. Forensic processes with different phases can be represented as an Action Lifecycle (from [109])
.................................................................................................................................................................... 72
Table 10. The Action Lifecycle construct can be used to represent types of offender activities (from [109])
.................................................................................................................................................................... 73
Table 11. CT-CWC-18 collection: number of sentences (#Sent.) and the number of check-worthiness (#CW)
on the training and test sets ..................................................................................................................... 127
Table 12. List of features used in FR to represent texts (Facebook comments and tweets).................... 131
Table 13. Distribution of training, validation and testing data on TRAC 2018 data collection ................. 134
Table 14. Result for the English (Facebook and Twitter) task. Bold values are the best performances for
our approaches ......................................................................................................................................... 135
Table 15. Comparison of Outlier Detection Methodologies ..................................................................... 145
Table 16. Interface description of the tool that identifies outliers in the financial data.......................... 148
Table 17. Statistics of the CDR dataset ..................................................................................................... 151
Table 18. Interface description of the tool that identifies outliers in the telecommunication data ........ 153

H2020-SU-FCT03-2018-833115 PREVISION Project Page 11 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Glossary

 LEA Law Enforcement Agency
 RTF Result Transferability Framework
 WP Work Package
 OC-SVM One Class Support Vector Machine
 OC-NN One-Class Neural Networks
 DAE De-noising Autoencoders
 RBM Restricted Boltzmann Machine
 LSTM Long Short-Term Memory
 DAML DARPA Agent Markup Language
 DARPA Defense Advanced Research Projects Agency
 DL Description logic
 OIL Ontology Inference Layer
 OWL Web Ontology Language
 RDF Resource Description Framework
 RDFS Resource Description Framework Schema
 UCO Unified Cyber Ontology
 URI Uniform Resource Identifier

H2020-SU-FCT03-2018-833115 PREVISION Project Page 12 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

Executive Summary
Section 2 describes the basic framework on which the PREVISION semantic reasoning components build
on. The core part of the system is made up of an ontology comprising knowledge relevant for investigating
criminal cases. It spans from simple facts on everyday life to crime domain specific knowledge. Section 2.1
gives an insight in technical aspects as well as abstract concepts of ontology modelling. PREVISION specific
aspects of the ontology are postponed to section 4.2.

Facts contained in the ontology can be queried using SPARQL, and further facts can be derived via logical
inference. The basics of logical reasoning modules in PREVISION are explained in section 2.2. An emphasis
is put on probabilistic reasoning deploying Markov Logic Networks in section 2.2.2.

Some of the facts stored in the ontology originate from text files. As a basis for the analysis of textual
content, PREVISION is endowed with a jargon detection tool. It is based on word embeddings and
described in section 2.3.

In practise, available data happens to be incomplete. Therefore, tools are needed, which reconstruct
missing data within one data source from context information or by making use of information gained
from other data sources. System components with this purpose are described in section 3.

Section 3.2 introduces a system component, which is able to add missing connections between people in
social networks by detecting similarities in the content of messages they send or receive.

Section 3.3 deals with the fusion of incomplete data collected from various data sources.

Section 3.4 provides a system component supporting the detection of illicit traffic of cultural artefacts by
associating visual and textual information from various heterogeneous databases to render a coherent
picture of any case in question.

Section 3.5 approaches the completion of missing data by visualizing gaps of information as well as
connections between data values in general, while putting an emphasis on missing data in the domain of
cyber defence.

Section 4 aims at the logical combination of the functionality and result values provided by the collectivity
of PREVISION system components to infer crime relevant information emerging from a combination of
facts originating from various data sources.

To this end, section 4.1 constitutes a first step towards monitoring and standardizing the information flux
between individual system components towards the PREVISION ontology, whose crime specific concepts
are described in section 4.2. A possible technical implementation of one of the PREVISION use cases is
illustrated in section 4.3.

Sections 4.4 and 4.5 describe various techniques for data preprocessing and classification, putting an
emphasis on decision trees and random forests.

Predictive policing is the subject of section 5.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 13 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

After an introduction of the topic in section 5.1, section 5.2 explains the methodology of recognizing
trends in natural language.

Regression trees and gradient boosted regression algorithms are studied in section 5.3.

The final part of section 5, section 5.4, provides a method for estimating the relevance of automatically
generated alarm signals.

Section 6 introduces system modules, which analyse behavioural changes and detect anomalies in data
sets.

Section 6.1.1 suggests two methods for detecting aggressive content in textual communication using
random forests and linear regression as well as CNN and LSTM deep learning techniques.

Section 6.1.2 visually analyses the sentiment of textual sources as well as the frequency of contained
topics and the connections between them.

Section 6.2 deals with anomaly detection by classification of image and video data.

Section 6.3 documents the design and functionality of two system modules for outlier detection in
financial and telecommunication data based on clustering techniques.

Section 6.4 introduces methods for making deep neural networks robust against adversarial attacks.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 14 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

1 Introduction
 Motivation
Figure 1, which has been taken from the Grant Agreement, depicts the overall functionality of platform
PREVISION. Speaking in terms of this diagram, work package 3, which deliverable 3.1 makes up the first
part of, consists in the design of the “Cognitive Services Layer” drawn in green.

 Figure 1. PREVISION Platform Architecture as shown in the Grant Agreement.

 Intended Audience
This deliverable describes the functionality and current state of the PREVISION semantic reasoning
components. In this respect, it is valuable for technical partners as well as for LEA partners of the project.

The technical partners developing the stream processing tools in work package 2, which corresponds to
the red layer in Figure 1, will have an interest in providing interfaces of their components or return values
of their algorithms, which are amenable to the semantic reasoning components developed in this work
package. On the other hand, partners involved in work package 4, which corresponds to the blue layer in
Figure 1, are acquainted by this document with the asset of machine learning and machine based logical
inference modules at their disposal for building situation specific applications.

LEA officers in turn get a glimpse on how psychological as well as predictive policing models developed in
work package 1 are technically implemented here, which might further improve communication and
collaboration between technical and non-technical project partners. Moreover, LEAs might be interested
in the state of the art in semantic reasoning, which can be of advantage in the refinement of use case
specifications and end user needs elaborated in work package 1.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 15 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 Relation to Other Deliverables
The deliverable at hand is based on the technical specification given in deliverable 5.1 “Initial PREVISION
Architecture” which aims at meeting the end user needs compiled in deliverable 1.1 “End User Needs and
Use Cases (Initial Release)”. Further technical details relevant for work package 3 are drawn from
deliverable 2.1 “Heterogeneous Data Streams Processing Tools (Initial Release)”. Finally, the tools
developed here can build on the predictive models elaborated in deliverable 1.2 “Predictive Policing –
Psycho-sociological Models”.

The present document is the first of a total of two deliverables of work package 3. It will be refined in
deliverable 3.2 “Machine Learning and Automation for Crime Prevention and Investigation (Refined
Release)”.

 Technical Requirements
Deliverable 5.1 compiles 23 technical requirements raising from the user requirements defined in
deliverable 1.1. Among these, we subsequently quote the ones that the PREVISION cognitive reasoning
system modules mainly need to conform to.

TR4. User profiles and access levels should be applicable to both system modules and use cases as well
 as data.
TR9. Data anonymization/pseudonymization techniques should be utilized, contributing to the data
 privacy principles followed in the project.
TR15. The multimedia analysis should support content matching among different files.
TR17. An authorised human operator has to approve/modify the recommendations performed by
 PREVISION before their transmission to another module within the system.
TR18. PREVISION should facilitate the utilization of multiple data sources for processing and information
 extraction.
TR19. The system shall provide an efficient search interface for content discovery.
TR20. The system’s user interface shall facilitate the operators to perform the criminal investigations of
 the PREVISION use cases. The user interface shall enable the users to perform the data collection,
 analysis and information discovery activities in the context of criminal investigations.
TR21. The user interface shall offer an intelligent and customizable notification mechanism to alert
 users.
TR22. The analytics modules shall be able to handle streaming data.

These requirements might, among others, serve as a guideline for all contributions to work package 3
being affected by them.

 Deliverable Structure
For every positive integer n smaller than 6, section n+1 of the deliverable at hand contains the results of
Task 3.n defined in the PREVISION Grant Agreement. Finally, section 7 provides a short summary and
further conclusions.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 16 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

2 Semantic Information Processing
Semantic technologies provide a computable framework for systems to deal with knowledge in a
formalized manner. In the paradigm of semantic technologies, the metadata that represent data objects
are expressed in a manner in which their deeper meaning and interrelations with other concepts are made
explicit, by means of an ontology and the models generated by the machine learning algorithms.

This approach provides the underlying computing systems with the capability not only to extract the
values associated with the data but also to relate pieces of data one to another, based on the details of
their inner relationships. Thus, using reasoning processes new information will be extracted. The semantic
information model that is based on the PREVISION ontology, allows, therefore, navigation through the
data and discovery of correlations not initially foreseen, thus broadening the spectrum of knowledge
capabilities for the LEAs. The semantic tools developed within this task are:

 • Knowledge modeling toolkit for the semantic representation of the PREVISION ontology
 • Probabilistic reasoning based on Markov Logic Networks
 • Logical reasoning
 • Jargon detection
 • ……

 Ontology Modelling
2.1.1 Ontology Engineering
An ontology consists of concepts, hierarchical (is-a) organization of them, relations among them (in
addition to is-a and part-of), axioms to formalize the definitions and relations. Ontologies are typically
specified using languages that allow some abstraction and expression of semantics, such as First-order
logic languages. The Web Ontology Language Description Language (OWL-DL for short) is the best-known
representative of an ontology language. OWL is a language for making ontological statements, developed
as a follow-on from RDF and RDFS, as well as earlier ontology language projects including OIL, DAML and
DAML+OIL. OWL is intended to be used over the World Wide Web, and all its elements (classes, properties
and individuals) are defined as RDF resources, and identified by URIs.

Common components of ontologies include (cf. [1]):

 • Individuals
 Instances or objects (the basic or "ground level" objects).

 • Classes
 Sets, collections, concepts, types of objects, or kinds of things.

 • Attributes
 Aspects, properties, features, characteristics, or parameters that objects (and classes) can
 have.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 17 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 • Relations
 Ways in which classes and individuals can be related to one another.

 • Restrictions
 Formally stated descriptions of what must be true in order for some assertion to be
 accepted as input.

 • Rules
 Statements in the form of an if-then (antecedent-consequent) sentence that describe the
 logical inferences that can be drawn from an assertion in a particular form.

 • Axioms
 Assertions (including rules) in a logical form that together comprise the overall theory
 that the ontology describes in its domain of application.

In ontology engineering, you have to deal with questions such as the following meta questions:

 • Intrinsic property
 The intrinsic property of a thing is a property, which is essential to the thing, and it loses its identity
 when the property changes.

 • The ontological definition of a class
 A thing which is a conceptualization of a set X can be a class if and only if each element x of X
 belongs to the class X if and only if the “intrinsic property” of x satisfies the condition of X.

 • is-a relation
 This definition holds only between classes. holds if and only if the instance
 set of A is a subset of the instance set of B. The skeleton of ontology is an is-a hierarchy. Its
 appropriateness is critical in the success of ontology building.

 • 
 It is true that any physical object is made from matter. However, identity criteria of these are
 different. Matter of some amount is still the same matter after it is divided into half, while most
 physical objects lose their identity by such division. Therefore, a physical object cannot be a
 subclass of amount of matter, since it contradicts with the principle of identity criterion
 preservation in an is-a hierarchy.

 • 
 The reason is the same as the above. An association can still be the same after some members
 are changed, while a group is not, since the identity of a group (of persons) is totally based on its
 members.

 • and 

H2020-SU-FCT03-2018-833115 PREVISION Project Page 18 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 This is an example of multiple inheritances, which causes a trouble when considering instance
 extinction. A human loses his/her existence and hence he/she cannot be an instance of human
 when he/she dies, while his/her body still exists as a physical object. The system needs a special
 managing function of extinction of an instance for such a case together with a mechanism for
 representing necessary information on identity inheritance. To avoid such a difficulty, ontology is
 required to have a class human body under the physical object separately from human, which has
 human body as its part, under living thing.

2.1.2 Ontology Editor
The PREVISION Ontology is developed using the open source product Protégé (cf. [2]), a very common
ontology editor, which was developed by the University of Stanford. Protégé is a free, open source
ontology editor and a knowledge management system. The tool provides a graphic user interface to define
ontologies. It also includes deductive classifiers to validate that models are consistent and to infer new
information based on the analysis of an ontology. Protégé is a framework for which various other projects
suggest plugins.

Some of the main features in Protégé Desktop, versions 4, 5 and higher are (cf. [3]):

• Modularization
• Intelligent use of local/global repositories to handle import dependencies
• Loading of multiple ontologies into a single workspace
• Switching between ontologies dynamically
• UI hints for showing in which ontology statements are made
• Refactoring: merging ontologies and removal of redundant imports
• Refactoring: moving axioms between ontologies
• Refactoring Tools
• Renaming (including multiple entities)
• Handling disjoints/different
• Quick defined class creation
• Various transforms on restrictions (including covering)
• Conversion of IDs to labels
• Moving axioms between ontologies
• Reasoning Support
• Inferred axioms show up in most standard views
• DL Query tab for testing arbitrary class expressions
• Direct interface to FaCT++ reasoner
• Direct interface to Pellet reasoner
• Reasoners are plug-ins

A built-in plugin of Protégé named DL Query (Description Logic) enables e.g. to request relationships
between classes or between instances. The following figure shows an example of a query. This query
returns the name of the brother of a person named Karl. In this example, Karl and Dieter are instances of
the concept Person. isBrotherOf is an object property with the domain Person and with the range Person.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 19 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 Figure 2. Example of a query in Protégé.

Model visualisation is enabled in Protégé by several plugins, e.g., OntoGraf [4], OWLViz [5], OntoViz [6],
or ProtégéVOWL [7]. Figure 3 shows an example of the visualisation of a small subtree of the PREVISON
taxonomy with OntoGraf. OntoGraf gives support for interactively navigating the relationships of OWL
ontologies. Different relationships are supported: subclass, individual, domain/range object properties,
and equivalence. The example illustrates that the classes "Running", "Cycling" and "Swimming" are
subclasses of the class "Movement".

 Figure 3. Example of a visualisation with OntoGraf.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 20 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

 Rule-based Reasoning Tools
2.2.1 General Aspects

2.2.1.1 Reasoning
Reasoning is a procedure that allows the addition of rich semantics to data, facilitating the system to
automatically gather and process new information deeper level. Specifically, by logical reasoning
PREVISION is able to uncover derived facts that are not expressed in the knowledge base explicitly, as well
as discover new knowledge of relations between different objects and items of data.

A reasoner is a piece of software that is capable of inferring logical consequences from stated facts in
accordance with the ontology’s axioms, and of determining whether those axioms are complete and
consistent. Reasoning is part of the PREVISION system and it is able to infer new knowledge from existing
facts available in the PREVISION knowledge base. In this way, the inputs of the reasoning system are data
collected from all the entities of the PREVISION environment, while the output from the reasoner will
assist crime analysis and investigation capabilities. Two types of reasoning are addressed in PREVISION:
logical reasoning and probabilistic reasoning. They are described in the following sections.

2.2.1.2 Rules
In order for a reasoner to infer new axioms from the ontology’s asserted axioms a set of rules should be
provided to the reasoner.

Rules are of the form of an implication between an antecedent (body) and a consequent (head). The
intended meaning can be read as: whenever the conditions specified in the antecedent hold, then the
conditions specified in the consequent must also hold, i.e.:

 ⇒ 

The antecedent is the precondition that has to be fulfilled, so that the rule will be applied, while the
consequent is the result of the rule that will be true in this case.

Both the antecedent and consequent consist of zero or more atoms or predicates. The antecedent is a
single predicate or a conjunction of predicates, separated the character ^.

An atom or predicate is of the form C(x), P(x,y) where C is an OWL class description (concept) or data
range, P is an OWL property or relation, x and y are either variables, instances or literals, as appropriate.

An empty antecedent is treated as trivially true (i.e. satisfied by every interpretation), so the consequent
must also be satisfied by every interpretation; an empty consequent is treated as trivially false (i.e., not
satisfied by any interpretation), so the antecedent must also not be satisfied by any interpretation.
Multiple atoms are treated as a conjunction. Note that rules with conjunctive consequents could easily be
transformed (via the Lloyd-Topor transformations [8]) into multiple rules each with an atomic consequent.

An example of an antecedent is “isChildOf(s1, p) ∧ isChildOf(s2, p)” A Conjunction of terms means that
the two terms (called literals) are connected with a logical “AND”, this means that the antecedent is

H2020-SU-FCT03-2018-833115 PREVISION Project Page 21 of 174
D3.1 Machine Learning and Automation for Crime Prevention and Investigation (Initial Release)

fulfilled if both predicates are true. The logical AND is represented by the comma character (“,”) or the
character “^”.

The consequent is usually a single predicate or a disjunction of predicates. In this example the consequent
could be “isSiblingOf(s1, s2)”. Therefore, the rule which expresses that children of the same Parent are
siblings is written as:

isChildOf(s1, p) ∧ isChildOf(s2, p) ⇒ isSiblingOf(s1, s2)

If the evidence in the ontology (the CRM) is

isChildOf(Benny, Jacob) and isChildOf(Joseph, Jacob)

Then the result of applying the rule will be “isSiblingOf(Benny, Joseph)”.

Some of the rules are predefined by the PREVISION ontology definition. Following rules will be generated
automatically:

 • Taxonomy related rules on classes: If the concept “Car” is subclass of the concept “Vehicle”, the
 rule “Car(x) => Vehicle(x)” is generated.
 • Taxonomy related rules on properties: If the property “isSonOf” is a sub-property of “isChildOf”,
 then the rule “isSonOf(s,p) => isChildOf(s,p)” is generated
 • Domain and Range related rules: Relations often have a single concept class for the domain or the
 range defined. The domain defines the concept that the relation arrow starts from, the range
 defines the concept that the arrow points to. For example, the relation “involvesPerson” has the
 domain “Event” and the range “Person”. Therefore, it connects an event with a person that is
 involved in this event. From the definition of the relation “involvesPerson”, two rules result:
 involvesPerson(e,p) => Event(e)
 involvesPerson(e,p) => Person(p)

2.2.2 Probabilistic Reasoning Based on Markov Logic Networks
This module provides a semantic reasoning technique, which aims at the enrichment of existing
information, as well as the discovery of new knowledge and relations between different objects and items
of data. The technique employed is the so-called Markov Logic Networks (MLN), which allow probabilistic
reasoning by combining a probabilistic graphical model with first-order logic.

2.2.2.1 Markov Logic Networks
Markov Logic Networks (MLNs) were introduced in 2006 by M. Richardson and P. Domingos, see [9]. Since
then they have been an active area of research and were widely applied in different scenarios, e.g.
ontology matching, statistical learning and probabilistic inference, as you may see in [10].

In the following paragraph, the basic theory of MLNs is presented, along with a simple example, which
illustrates their utilization as a reasoning tool. Finally, a framework which implements the theoretical
concepts is defined.

H2020-SU-FCT03-2018-833115 PREVISION Project Page 22 of 174
You can also read