The Alan Turing Institute Internship Programme 2018 - Amazon AWS

Page created by William Rojas
 
CONTINUE READING
The Alan Turing Institute Internship
                              Programme 2018

Contents

Project 1 – An interdisciplinary approach to programming by example……………………...2

Project 2 – Algorithms for automatic detection of new word meanings from social media to
understand language and social dynamics……………………………………………………….5
Project 3 – High performance, large-scale regression…………………………………………..8
Project 4 – Design analysis and applications of efficient algorithms for graph based
modelling……………………………………………………………………………………………10
Project 5 – Privacy-aware neural network classification & training…………………………. 12

Project 6 – Clustering signed networks and time series data……………………………….. 14

Project 7 – Uncovering hidden cooperation in democratic institutions…………………..…..17

Project 8 – Deep learning for object tracking over occlusion…………………………………20

Project 9 – Listening to the crowd: Data science to understand the British Museum
visitors……………………………………………………………………………………………….22

                                                                                           1
Project 1 - An interdisciplinary approach to programming
by example
Project Goal

To compare approaches to the versatile idea of ‘programming by example’, which has
relevance in various different fields and contexts, and design interdisciplinary new
techniques.

Project Supervisors

Adria Gascon (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Nathanaël Fijalkow (Research Fellow, The Alan Turing Institute, University of Warwick)

Brookes Paige (Research Fellow, The Alan Turing Institute, University of Cambridge)

Project Description

Programming by example is a very natural and simple approach to programming: instead of
writing a programme, give the computer a desired set of inputs and outputs, and hope that
the programme will write itself out of these examples. In general, nothing prevents the
computer from relying on training data, initiating an interactive dialogue with the user to
resolve uncertainties, or even relying on the Internet, e.g. StackOverflow, to produce a
solution that realises the user’s intent.

A typical application is for an excel sheet: you write 2,4,6, and click on “continue”. You hope
that the computer will output 8,10,12... Another application is for robotics, where
Programming by Example is often called Programming by demonstration. The goal there is
to teach robots complicated behaviours, not by hardcoding them, which would be too costly
and complicated, but by showing a few examples and asking the robot to imitate.

Automated program synthesis, namely having programs write correct programs, is a problem
with a rich history in computer science that dates back to the origins of the field itself. In
particular, the simple paradigm of “Programming by Example” has been independently
developed within several subfields (at least formal verification, programming languages, and
learning) under different names and with different approaches. This project is about

                                                                                                  2
understanding the tradeoffs between these techniques, comparing them, and possibly
devising one to beat them all.

Programming by example can be seen as a concrete framework for program synthesis. In
synthesis the specification for the programme is given by a high level specification, for
instance a logical formula. The special case where only inputs and outputs are given is
nonetheless pertinent in synthesis (see, for example https://dspace.mit.edu/openaccess-
disseminate/1721.1/90876). Adria Gascon has a long experience on synthesis, in particular
using SMT solvers. This will be one of the approaches to look at.

Programming by example can be attempted by neural networks and probabilistic inference.
There is some recent work in this direction which attempts to solve the program induction
problem directly (see for instance https://arxiv.org/abs/1703.04990), as well as work which
adopts deep learning as a way to provide assistance to SMT solvers (e.g.
https://arxiv.org/abs/1611.01989). Brooks Paige is familiar with such approaches. This will
be a second approach to look at.

Programming by example can be seen as an automaton learning task. In this scenario, the
goal is to learn a weighted automaton, which is a simple recursive finite-state machine
outputting real numbers. There are powerful techniques for learning weighted automata, for
instance through spectral techniques. Nathanaël Fijalkow has worked on these questions.
This will be a third approach to look at.

Besides studying their formal guarantees, we plan to empirically evaluate our algorithms,
and hence the project will involve a significant amount of coding.

Number of Students on Project: 2

Internship Person Specification

Essential Skills and Knowledge

   •   Interest in theoretical computer science in general

   •   Interest in various computational models: automata, neural networks

   •   Interest in programming languages

                                                                                              3
•   Interest in interdisciplinarity (inside maths and computer
       science), as the different techniques to be understood and compared are rather
       diverse

   •   Coding skills

Desired Skills and Knowledge

   •   Previous experience in SMT solving

   •   Previous experience in NNs

   •   Previous experience in automata learning

Return to Contents

                                                                                        4
Project 2 – Algorithms for automatic detection of new word
meanings from social media, to understand language and
social dynamics.

Project Goal

To develop computational methods for identifying the emergence of new word meanings
using social media data, advance understandings of cultural and linguistic interaction online,
and improve natural language processing tools.

Project Supervisors

Barbara McGillivray (Research Fellow, The Alan Turing Institute, University of Cambridge)

Dong Nguyen (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Scott Hale (Turing Fellow, The Alan Turing Institute, University of Oxford)

Project Description

This project focuses on developing a system for identifying new word meanings as they
emerge in language, focussing on words entering English from different languages and
changes in their polarity (e.g., from neutral to negative or offensive). An example is the word
kaffir, which, starting from a neutral meaning, has acquired an offensive use as a racial or
religious insult. The proposed research furthers the state of the art in Natural Language
Processing (NLP) by developing better tools for processing language data semantically, and
has impact on important social science questions.

Language evolves constantly through social interactions. New words appear, others become
obsolete, and others acquire new meanings. Social scientists and linguists are interested in
investigating the mechanisms driving these changes. For instance, analysing the meaning of
loanwords from foreign languages using social media data helps us understand the precise
sense of what is communicated, how people interact online, and the extent to which social
media facilitate cross-cultural exchanges. In the case of offensive language, understanding
the mechanisms by which it is propagated can inform the design of collaborative online
platforms and provide recommendations to limit offensive language where this is desired.

                                                                                               5
Detecting new meanings of words is also crucial to improve the accuracy of NLP tools for
downstream tasks, for example in the estimation of the "polarity" of words in sentiment
analysis (e.g. sick has recently acquired a positive meaning of 'excellent' alongside the
original meaning of 'ill'). Work to date has mostly focused on changes over longer time
periods (cf., e.g., Hamilton et al. 2016). For instance, awful in texts from the 1850s was a
synonym of 'solemn' and nowadays stands for 'terrible'. New data on language use and new
data science methods allow for studying this change at finer timescales and higher
resolutions. In addition to social media, online collaborative dictionaries like Urban Dictionary
are excellent sources for studying language change as it happens; they are constantly
updated and the threshold for including new material is lower than for traditional dictionaries.

The meaning of words in state-of-art NLP algorithms is often expressed by vectors in a low-
dimensional space, where geometric closeness stands for semantic similarity. These vectors
are usually fed into neural architectures built for specific tasks. The proposed project aims at
capturing meaning change on a fine-grained, short time scale. We will use the algorithm
developed by Hamilton et al. (2016), who used it to identify new meanings using Google
Books. We will train in-house vectors on multilingual Twitter data collected from 2011 to
2017. Through this process we will identify meaning change candidates and evaluate them
against the dictionary data by focussing on analysing the factors that drive foreign words to
enter the English language and to change their polarity. In doing so, we will shed light on the
extent to which the detected meaning changes are driven by linguistically internal rather than
external (e.g. social, technological, etc.) factors.

The original contributions of this research are:

   •   The development of an NLP system for detecting meaning change occurring in a
       relatively short time period, so as to further the state of the art in NLP.

   •   The design of an evaluation framework which compares automatically derived
       candidates for meaning change against dictionary data.

   •   The analysis of subsets of such candidates to answer social science questions about
       the dynamics of human behaviour online.

The specific tasks of this project are:

    a) Implement existing algorithms for identifying words that acquire new meanings as
        they appear in the English language using social media data from Twitter collected
        over a multiyear period (2011-2017).

                                                                                                6
b) Validate candidate words from Task (a) against Urban Dictionary and other
       dictionaries.
   c) Evaluate word meaning change in areas such as foreign loanwords and polarity
       change, and address research questions regarding cultural and linguistic exchanges
       online, as well as the creation and propagation of offensive language online.
   d) Prepare an article to be submitted to a journal or conference.

Number of Students on Project: 2

Internship Person Specification

Essential Skills and Knowledge

   •   All interns will need to have advanced NLP skills, linguistic interest, and experience
       with working with large datasets and cloud computing. At least one of the interns
       should have some social data science experience.
   •   Experience developing R packages would be beneficial although training can be
       provided.

Desired Skills and Knowledge

   •   Previous experience in SMT solving

   •   Previous experience in NNs

   •   Previous experience in automata learning

Return to Contents

                                                                                                7
Project 3 – High performance, large-scale regression

Project Goal

To investigate distributed, scalable approaches to the standard statistical task of high-
dimensional regression with very large amounts of data, with the ultimate goal of informing
current best practice in terms of algorithms, architectures and implementations.

Project Supervisors

Anthony Lee (Research Fellow, The Alan Turing Institute, University of Cambridge)

Rajen Shah (Turing Fellow, The Alan Turing Institute, University of Cambridge)

Yi Yu (University of Bristol)

Project Description

The ultimate goal is to critically understand how different, readily available, large-scale
regression algorithms/software and frameworks perform for distributed systems, and isolate
both computational and statistical performance issues. A specific challenging dataset will
also be included to add additional focus, and there is the opportunity to investigate more
sophisticated, but less readily-available algorithms for comparison.

This project aligns to the Institute’s strategic priorities in establishing leadership and
providing guidance for common data analysis tasks at scale. It can feed in to a larger data
science at scale software programme around performance and usability, which it is hoped
will be developed in 2018.

Phases:

First phase: benchmark and profile available approaches on the Cray Urika-GX, and
potentially other architectures, for a scalable example class of models with carefully chosen
characteristics. Different regimes can be explored where there are substantial effects on
performance.

Second phase: use the benchmarks and profiling information to identify which, if any,
recently proposed approaches to large-scale regression may improve performance, with the
advice of Yi Yu and Rajen Shah.

Third phase: apply the skills and software developed to a large and challenging data set.

                                                                                                8
Throughout the project, documentation will be written to enable other data scientists to
perform large scale regressions with greater ease, and understand the implications of using
different architectures, frameworks, algorithms, and implementations.

This project is supported by Cray Computing.

Number of Students on Project: 2

Internship Person Specification

Essential Skills and Knowledge
   •   Familiarity with a cluster computing framework for data science / machine learning,
       e.g., Spark

   •   Basic statistical understanding of regression

Desirable Skills and Knowledge

   •   Some experience with high-performance computing

Return to Contents

                                                                                              9
Project 4 - Design, analysis and applications of efficient
algorithms for graph based modelling

Project Goal

To develop fast and efficient numerical methods for optimization problems on graphs,
making use of continuum (large data) limits in order to develop multi-scale methods, with
real-world applications in medical imaging and time series data.

Project Supervisors

Matthew Thorpe (University of Cambridge)
Kostas Zygalakis (Turing Fellow, The Alan Turing Institute, University of Edinburgh)
Carola-Bibiane Schönlieb (Turing Fellow, The Alan Turing Institute, University of Cambridge)
Elizabeth Soilleux (University of Cambridge)
Mihai Cucuringu (Research Fellow, The Alan Turing Institute, University of Oxford)

Project Description

Many machine learning methods use a graphical representation of data in order to capture
the geometry it, in the absence of a physical model. If we consider the problem of classifying
a large data set, say 107 data points, then one common approach is spectral clustering. The
idea behind spectral clustering is to project the data onto a small number of discriminating
directions where the data should naturally separate into classes. In practice one uses the
eigenvectors of the graph Laplacian as directions and then uses off-the-shelf methods such
as k-means for the clustering. More importantly, this methodology easily extends to the
semi-supervised learning context.

A bottleneck in the above approach is in the computation of eigenvectors of the graph
Laplacian. The dimension of the graph Laplacian is equal to the number of data points and
therefore becomes infeasible for large data sets. Our approach is to use continuum (large
data) limits of the graph Laplacian to approximate the discrete problem with a continuum
PDE problem. We can then use standard methods to discretise the continuum PDE problem
on a potentially much coarser scale compared to the original discrete problem. In particular,
instead of computing eigenvectors of the graph Laplacian, one would compute

                                                                                               10
eigenfunctions of the continuum limit of the graph Laplacian and use these instead. This
should remove the bottleneck in spectral clustering methods for large data. The approach is
amenable to multi-scale methods, in particular by computing coarse approximations and
iteratively refining using known scaling results

The project will start by implementing modifications of existing algorithms, in particular we
will replace bottlenecks such as computing eigenvalues with an approximation based on
continuum limits. Once we have a working algorithm we aim to take the project further by
developing classification algorithms for diagnosing coeliac disease from medical images. In
particular, using our algorithms, we aim to improve on the current state of the art methods of
diagnosing coeliac disease (microscopic examination of biopsies), which is inaccurate with
around 20%misclassification.

Number of Students on Project: 1

Internship Person Specification
Essential Skills and Knowledge
   •   Good scientific computing skills, preferably in either Matlab or Python

   •   Competence in basic linear algebra

   •   Some functional analysis and PDEs

   •   Strong communication skills

Desirable Skills and Knowledge

   •   Experience with implementing Bayesian methods

Return to Contents

                                                                                                11
Project 5 - Privacy-aware neural network classification &
training

Project Goal

To invent new encrypted methods for neural network training and classification.

Project Supervisors

Matt Kusner (Research Fellow, The Alan Turing Institute, University of Warwick)

Adria Gascon (Research Fellow, The Alan Turing Institute, University of Warwick)

Varun Kanade (Turing Fellow, The Alan Turing Institute, University of Oxford)

Project Description

Neural networks crucially rely on significant amounts of data to achieve state-of-the-art
accuracy. This makes paradigms such as cloud computing and learning on distributed
datasets appealing. In the former setting, computation and storage are efficiently outsourced
to a trusted computing party, e.g. Azure, while in the latter, the computation of accurate
models is enabled by aggregating data from several sources.

However, because of regulatory and/or ethical reasons data can’t always be shared. For
instance, many hospitals may have overlapping patient statistics which, if aggregated, could
produce highly-accurate classifiers. However, this may compromise highly-personal data.
This kind of privacy concern prevents useful analysis on sensitive data. To tackle this issue,
privacy-preserving data analysis is an emerging area involving several disciplines such as
statistics, computer science, cryptography, and systems security. Although privacy in data
analysis is not a solved problem, many theoretical and engineering breakthroughs have
made privacy-enhancing technologies such as homomorphic encryption, multi-party
computation, and differential privacy related techniques into approaches of practical interest.

However, such generic techniques do not scale to input sizes required for training accurate
deep learning models, and custom approaches carefully combining them are necessary to

                                                                                             12
overcome scalability issues. Recent work on sparsifying neural networks and discretising the
weights used when training neural networks would be suitable avenues to enable application
of modern encryption techniques. However, issues such as highly non-linear activation
functions and the requirement for current methods to keep track of some high-precision
parameters may inhibit direct application.
The project will focus on both these aspects:
•   Designing training procedures that use only low-precision weights and simple activation
    functions.
•   Adapting cryptographic primitives, such as those used in homomorphic encryption and
    multi-party computation, to enable private training on these modified training procedures.
The ultimate goal of the project is to integrate both of these aspects into an implementation
of a provably privacy-preserving system for Neural Network Classification & Training.

Number of Students on Project: 2

Internship Person Specification

Essential Skills and Knowledge

     • Interest in theoretical aspects of computer science

     • Knowledge of public-key cryptography (RSA, Pallier, GSW)

     • Knowledge of ML and NN (Residual Networks, Convolutional Networks)

     • Experience in implementing secure and/or data analysis systems

     • Experience in implementing distributed systems

Desired Skills and Knowledge
     • Experience implementing cryptographic protocols

     • Experience implementing multi-party computation protocols

Return to Contents

                                                                                            13
Project 6 - Clustering signed networks and time series data

Project Goal

To implement and compare several recent algorithms, and potentially develop new ones, for
clustering signed networks, with a focus on correlation matrices arising from real-world
multivariate time series data sets.

Project Supervisors

Mihai Cucuringu (Research Fellow, The Alan Turing Institute, University of Oxford)

Hemant Tyagi (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Project Description

Clustering is one of the most widely used techniques in data analysis, and aims to identify
groups of nodes that exhibit similar features. Spectral clustering methods have become a
fundamental tool with a broad range of applications in areas including network science,
machine learning and data mining. The analysis of signed networks - with negative weights
denoting dissimilarity or distance between a pair of nodes in a network - has become an
increasingly important research topic in recent times. Examples include social networks that
contain both friend and foe links, and shopping bipartite networks that encode like and
dislike relationships between users and products. When analysing time series data, the most
popular measure of linear dependence between variables is the Pearson correlation taking
values in [−1, 1], and clustering such correlation matrices is important in certain applications.

This proposal will develop k-way clustering in signed weighted graphs, motivated by social
balance theory, where the task of clustering aims to decompose the network into disjoint
groups. These will be such that individuals within the same group are connected by as many
positive edges as possible, while those from different groups by as many negative edges as
possible.

We expect that the low-dimensional embeddings obtained via the various approaches we
will investigate could be of independent interest in the context of robust dimensionality
reduction in multivariate time series analysis. Of particular interest is learning nonlinear
mappings from time series data which are able to exploit (even weak) temporal correlations

                                                                                               14
inherent in sequential data, with the end goal of improving out-of-sample prediction. We will
focus on a subset of the following problems.

(1) Signed Network Embedding via a Generalized Eigenproblem. This approach is
inspired by recent work that relies on a generalised eigenvalue formulation which can be
solved extremely fast due to recent developments in Laplacian linear system solvers, making
the approach scalable to networks with millions of nodes.

(2) Signed clustering via Semidefinite Programming (SDP). This approach relies on a
semidefinite programming-based formulation, inspired by recent work in the context of
community detection in sparse networks. We efficiently solve the SDP program efficiently via
a Burer-Monteiro approach, and extract clusters via minimal spanning tree-based clustering.

(3) An MBO scheme. Another direction relates to graph-based diffuse interface models
utilizing the Ginzburg-Landau functionals, based on an adaptation of the classic numerical
Merriman-Bence-Osher (MBO) scheme for minimizing such graph-based functionals. The
latter approach bears the advantage that it can easily incorporate labelled data, in the
context of semi-supervised clustering.

(4) Another research direction is along the lines of clustering time series using Frechet
distance. The existing algorithm in the literature is quite complicated and not directly
implementable in practice. It essentially involves a pre-processing step where each time
series is replaced with its lower complexity version via its “signature”. This leads to a faster
algorithm for clustering (in theory). The approach via signatures could prove powerful, and
one could consider forming the signature via randomized sampling of the “segments” of the
time series.

(5) Graph motifs. This approach relies on extending recent work on clustering the
motif/graphlet adjacency matrix, as proposed recently in a Science paper by Benson, Gleich,
and Leskovec.

(6) Spectrum-based deep nets. A recent approach in the literature focuses on fraud
detection in signed graphs with very few labelled training sample points. This problem and
its setup are very similar to the topic of an ongoing research grant “Accenture and Turing

                                                                                               15
alliance for Data Science”, using network analysis tools for fraud detection, that could benefit
from any algorithmic developments that would take place during the internship. The
approach proposes a novel framework that combines deep neural networks and
spectral graph analysis, by relying on the low-dimensional spectral coordinates (extracted
by our approaches (1) - (5) detailed above) as input to deep neural networks, making the
later computationally feasible to train.

(1), (2), (3) already have a working MATLAB implementation available, that could be built
upon and compared to (4), (5). Time permitting, (6) can also be explored.

There will be freedom to pursue any subset of the above topics that align best with the
candidates’ background and maximise the chances of a publication.

A strong emphasis will be placed on assessing the performance of the algorithms on real-
world, publicly available data sets arising in economic data science, meteorology, medicine
monitoring or finance.

Number of Students Project: 2

Internship Person Specification
Essential Skills and Knowledge
   •   Both students will have familiarity with the same programming language (either R,
       Python, or MATLAB)

   •   Solid knowledge of linear algebra and algorithms

   •   Familiarity with basic machine learning tools such as clustering, linear regression and
       PCA

Desirable Skills and Knowledge

   •   Desirable but not needed: basic familiarity with spectral methods, optimization,
       nonlinear dimensionality reduction, graph theory, model selection, LASSO/Ridge
       regression, SVMs, NNs

       Return to Contents
                                                                                             16
Project 7 - Uncovering hidden cooperation in democratic
institutions

Project Goal
To generalise the method of Vote-Trading Networks, previously developed to study hidden
cooperation in the US Congress, to a wider set of democratic institutions, developing a
research programme in the measurement and characterisation of hidden cooperation on a
large scale.

Project Supervisors
Omar A Guerrero (Research Fellow, The Alan Turing Institute, University College London)
Ulrich Matter (University of St Gallen)
Dong Nguyen (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Project Description
The project aims at improving our understanding of cooperation in democratic institutions. In
particular, it will shed new light on cooperative behaviour that is intentionally ‘hidden’. An
example of such hidden cooperation is when two legislators agree to support each other’s
favourite bills, despite their ideological preferences, and/or despite such support being
disapproved of by their respective voters or campaign donors. This kind of behaviour is key to
the passage or blockage of critical legislation; however, we know little about it due to its
unobservable nature. The objective of this project is to exploit newly available big data on
voting behaviour from different institutional contexts and state-of-the-art methods from data
science, in order to develop two distinct research papers with clear policy implications for the
design and evaluation of political institutions.

Political institutions, such as parliaments and congresses, shape the life of every democratic
society. Hence, understanding how legislative decisions arise from hidden agreements has
direct implications on the guidelines that governments follow when conducting policy
interventions. Moreover, decision making by voting is common in other areas than legislative
law-making. It is prevalent in courts, international organizations, as well as in board rooms of
private enterprises.

The supervisors have collected comprehensive data sets on two institutions; the US Supreme
Court and the United Nations General Assembly. Each intern will work on one institution, using

                                                                                             17
the data provided by the supervisors and, sometimes, collecting
complementary data (through web scraping). The work conducted on the two institutions will
share a set of tools and methods, but also have unique requirements. In order to streamline
the workflow, the internship will be structured in three phases. Every week, there will be a
group meeting where each intern will give a presentation of his or her progress. This will be
an opportunity to share ideas, questions, challenges and solutions that the interns have
experienced. It will also serve to evaluate progress and adjust goals and objectives. In
addition, the documentation of their progress will be the basis for a final report to be handed
on the last week.

Phase 1: Introduction (1 to 1.5 weeks)
The interns will receive an introduction to the topic of cooperation in social systems, with a
particular focus on political institutions and situations in which cooperation is intentionally
hidden, such as vote trading, and, hence, unobservable in real-world data. Some specifics
about this phase are the following:
   •   Introduction to vote trading in democratic institutions, its societal relevance, evidence,
       measurements and challenges.
   •   Introduction to web scraping and text mining.
   •   Tutorial on network science.
   •   Tutorial on stochastic and agent-based models.
   •   Tutorial on the Vote-Trading Networks framework.

Phase 2: Work with Data (3 to 4 weeks)
In this phase, the interns will conduct independent work to prepare their datasets and perform
statistical analysis to understand its structure. The supervisors will provide the ‘core’ datasets,
which will then be processed, pruned and analysed by the interns. Preparation work varies
depending on the project. The intern working with US Supreme Court data will apply natural
language processing (NLP) techniques to a large set of raw text documents, and then, match
the extracted information to voting records. Given the nature of the problems related to NLP,
this work could require substantially more time than the UN project. Hence, the goals and
timelines for this project will be adjusted according to progress. The intern working with UN
data will extend a web scraper, previously developed by the supervisors, in order to download
data from the UN Library on resolutions, and match it to voting data from the UN General
Assembly.

                                                                                                18
Once the data sets have been prepared, the interns will conduct statistical analysis. This will
serve the group to get a better understanding about the composition of the population, their
characteristics, voting patterns, voting outcomes, etc.

Phase 3: Computational Analysis (rest of the internship)
In this phase, the interns will bring together their understanding about institutions, ideas behind
hidden cooperation, data sets and computational methods. The interns will write up their
results, with the goal of publishing two distinct research articles.

Number of Students on Project: 2

Internship Person Specification
Essential Skills and Knowledge
   •   Knowledgeable in the Python and/or R programming language
   •   Familiar with statistical concepts such as random variables, probability distributions
       and hypothesis testing
   •   Experience working with empirical micro-data

Desirable Skills and Knowledge
   •   Knowledgeable about political institutions and economic behaviour
   •   Familiar with complexity science and complex networks
   •   Familiar with agent-based modelling and Monte Carlo simulation

Return to Contents

                                                                                                19
Project 8 - Deep learning for object tracking over occlusion

Project Goal

To use deep learning to discover occluded objects in an image.

Project Supervisors

Vaishak Belle (Turing Fellow, The Alan Turing Institute, University of Edinburgh)

Chris Russell (Turing Fellow, The Alan Turing Institute, University of Surrey)

Brooks Paige (Research Fellow, The Alan Turing Institute, University of Cambridge)

Project Description

Numerous applications in data science require us to parse unstructured data in an
automated fashion. However, many of these models are not human-interpretable. Given the
increasing need for explainable machine learning, an inherent challenge is whether
interpretable representations can be learned from data.

Consider the application of object tracking. Classically, algorithms simply track the changing
positions of objects across frames. But in many complex applications, ranging from robotics
to satellite images to security, objects get occluded and thus disappear from the
observational viewpoint. The first task here is then to learn semantic representations for
concepts such as "inside", "behind" and "contained in."

The first supervisor (V. Belle) has written a few papers on using probabilistic programming
languages to define such occlusion models -- in the sense of instantiate them as graphical
models -- and use that construction in particle filtering (PF) problems, and decision-theoretic
planning problems.

However, the main barrier to success here was these occlusion models need to be defined
carefully by hand by a human, which makes them difficult to deploy in new contexts. The
main challenge of this internship is to take steps towards automating the learning of these

                                                                                              20
occlusion models directly from data.

Specifically, the idea is to jointly train a state estimation model -- specifically a particle filter
(PF) -- with a background vision segmentation model so that we can predict the next position
of an occluded object. The second supervisor (C. Russell) has extensive experience in
vision and segmentation who will serve as the principal point of contact at the ATI for the
interns. (The first supervisor will also make a continuous visit during the initial stages.) We
will focus on using variational auto encoders, recurrent neural nets or other relevant deep
learning architectures such as sum product networks to enable to the learning of semantic
representations. For instantiating deep learning architectures, B. Paige will be contributing
his recent approaches to integrate the learning framework with PyTorch and/or Pyro, the
latter recent proposed by UBER.

For the data, we plan on using 2 kinds of data sets. From the object tracking community, we
will be using tracking videos and clips to annotate occluded objects to train the models.
(Russell's working relationship with our new partner QMUL gives us direct access to their
tracking expertise, and sports and commuter tracking datasets. In consultation with them, we
intend to apply PF-RNN to these problems.)

The expected outcome is the following: a learned model M such that for any clip C where
object O gets occluded at some point in C, a query about the position of O against M would
correctly identify that O is occluded and where it's position is, based on the velocity of O’s
movement and its position the last time it was occluded.

Number of Students on Project: 2

Internship Person Specification
Essential Skills and Knowledge
   •    Background in machine learning and deep learning
   •    Preferably background in handling image data
   •    Background in sum product networks or pytorch, pyro, etc. would be beneficial

Return to Contents

                                                                                                    21
Project 9 – Listening to the crowd: Data science to
understand the British Museum visitors

Project Goal

To analyse and understand the British Museum visitors’ behaviour and feedback, using
different sets of data including the Trip Advisor feedback, the Wifi access and “intelligent
counting data, and methods such as natural language processing and time series analysis.

Project Supervisors

Taha Yasseri (Turing Fellow, The Alan Turing Institute, University of Oxford)

Coline Cuau (British Museum)

Harrison Pim (British Museum)

Project Description

There is more to The British Museum than Egyptian mummies and the Rosetta Stone - more
than 6 million people walk through the doors each year, travelling from every corner of the
globe to see the Museum's collection and get a better understanding of their shared
histories. Those visitors offer us a unique test bed for data science and real world testing at
scale.

In order to address some of the challenges of welcoming such a large number of visitors, the
British Museum is constantly gathering feedback and information about the visiting
experience. Research about visitors informs decisions made by teams around the Museum
and help the Museum evolve along with its audience. The tools at the museum’s disposal
include direct feedback channels (such as email or comment cards), “intelligent counting
data, wifi data, audio guide data, social media conversations, satisfaction surveys, on-site
observation and conversations on online review sites such as Trip Advisor.

Trip Advisor reviews are one of the largest and richest qualitative datasets the Museum has
access to. On average, over 1,000 visitors review their visit on the platform every month.
These reviews are written in over 10 languages by visitors from all parts of the world, and
historical data stretches back over two years. In these comments, visitors discuss the

                                                                                               22
positive and negative aspects of their visits, make recommendations to others, and rate their
satisfaction. The data set is an opportunity for the Museum to learn more about its visitors, to
understand what the most talked about topics are, and which factors have the biggest
impact on satisfaction.

This research project aims to dig into a rich set of qualitative data, uncovering actionable
insights which will have a real impact on the Museum. The research will have an immediate
and tangible effect and will help the organisation improve the visiting experience currently on
offer at the Museum. The Museum is currently undergoing pivotal strategic change, and the
insights will also feed into future iterations of the display and audience strategies. As far as
we know, the British Museum is the first institution of its kind to take a programmatic
approach to this kind of qualitative data. This pioneering research could potentially impact
the rest of the cultural sector and show the way to a new method of evaluation and visitor
research.

Some of the questions we hope to answer with this data are:
   •   Understanding satisfaction – what it means, how it affects propensity to recommend,
       and which aspects of a visit have the biggest impact on overall satisfaction.
   •   Analysing the different topics talked about in different languages. Do positive and
       negative experiences vary according to language?
   •   Analysing which parts of the collection or objects visitors talk about the most, and
       how feedback differs from one area of the Museum to another.
   •   Tracking comments regarding a variety of key topics, and understanding how they
       relate to one another (tours and talks, audio guides, access, facilities, queues,
       overcrowding…).
   •   Understanding and anticipating external factors which might impact decisions made
       to visit (economy, weather, security concerns, strikes, politics…).

The Museum has recently set up a partnership with Trip Advisor, which gives us access to
the reviews in an XML format. This file includes the date and URL of the reviews, as well as
their title, score, language and full review text. The Museum could take a manual approach
to tagging and analysing reviews, but we believe that more insight can be generated through
computational approaches.

                                                                                               23
The proposed research project will therefore involve heavy use of modern Natural Language
Processing (NLP) techniques. The complete corpus of review text consists of approximately
7,500,000 words in 50,000 distinct reviews. Recent advances in machine learning and NLP
provide a wide range of potential approaches to the subject, but suggested methods include:
   •   Topic modelling
   •   Clustering/classifying reviews by topic or sentiment
   •   word2vec style approaches to training/using word embeddings
   •   Automating the tagging of new reviews
   •   Time series analysis and principal component analysis

Number of Students on Project: 2

Internship Person Specification
Essential Skills and Knowledge
   •   Familiarity with large scale data analysis
   •   Experience in scientific programming (R or Python)
   •   Interest in natural language processing techniques
   •   Interest or past experience with advanced statistics methods such as time series
       analysis and PCA

Desirable Skills and Knowledge

   •   Interest in culture and museums and familiarity with context of the project

Return to Contents

                                                                                          24
You can also read