Informatics 1: Data & Analysis - Lecture 14: Example Corpora Applications Ian Stark - Learn

Page created by Ashley Reeves
 
CONTINUE READING
Informatics 1: Data & Analysis - Lecture 14: Example Corpora Applications Ian Stark - Learn
Informatics 1: Data & Analysis
Lecture 14: Example Corpora Applications

                Ian Stark

            School of Informatics
         The University of Edinburgh

           Thursday 7 March 2019
             Semester 2 Week 7

                                           https://course.inf.ed.ac.uk/inf1-da
Informatics 1: Data & Analysis - Lecture 14: Example Corpora Applications Ian Stark - Learn
Lecture Plan

 XML — The Extensible Markup Language
 We start with technologies for modelling and querying semistructured data.

            Semistructured Data: Trees and XML
            Schemas for structuring XML
            Navigating and querying XML with XPath

  Corpora
  One particular kind of semistructured data is large bodies of written or spoken text: each one a
  corpus, plural corpora.

            Corpora: What they are and how to build them
            Applications: corpus analysis and data extraction
Ian Stark                                       Inf1-DA / Lecture 14                        2019-03-07
Applications of Corpora

 Answering empirical questions in linguistics and cognitive science:

            Corpora can be analyzed using statistical tools;
            Hypotheses about language processing and acquisition can be tested;
            New facts about language structure can be discovered.

  Engineering natural-language systems in AI and computer science:

            Corpora represent the data that these systems have to handle;
            Algorithms can find and extract regularities from corpus data;
            Text-based or speech-based applications can learn automatically from corpus data.

Ian Stark                                        Inf1-DA / Lecture 14                           2019-03-07
Outline

  1    Finding Things and Counting Them

  2    Small Application

  3    Large Application

  4    Closing

Ian Stark                                 Inf1-DA / Lecture 14   2019-03-07
Extracting Information from Corpora

  Once we have an annotated corpus, we can begin to use it to find out information and answer
  questions. For now, we start with the following:

            The basic notion of a concordance in a text.

            Statistics of word frequency and relative frequency, useful for linguistic questions and
            natural language processing.

            Word groups: unigrams, bigrams and n-grams.

            Words that mean something together: collocations.

Ian Stark                                        Inf1-DA / Lecture 14                              2019-03-07
Concordances

  Concordance: all occurrences of a given word, shown in context.

 That’s the simplest form. More generally, a concordance might mean all occurrences of a certain
 part of speech, a particular combination of words, or all matches for a query expression.

            Specialist concordance programs will generate these from a given keyword.

            This query might specify a single word, some annotation (POS, etc.) or more complex
            information (e.g., using regular expressions).

            Results are typically displayed as keyword in context (kwic): a matched keyword in the
            middle of a line with a fixed amount of context to left and right.

Ian Stark                                       Inf1-DA / Lecture 14                           2019-03-07
Example Concordance

 These are the opening kwic lines of a concordance for all forms of the word “remember” in a
 collection of novels by Charles Dickens.
 This was generated with the Corpus Query Processor: the samecqp tool that you will use for the
 current tutorial exercises.

               ’s cellar . Scrooge then     to have heard that ghost
               , for your own sake , you    what has passed between
               e-quarters more , when he    , on a sudden , that the
               corroborated everything ,    everything , enjoyed eve
               urned from them , that he    the Ghost , and became c
               ht be pleasant to them to    upon Christmas Day , who
               its festivities ; and had    those he cared for at a
               wn that they delighted to    him . It was a great sur
               ke ceased to vibrate , he    the prediction of old Ja
               as present myself , and I    to have felt quite uncom
               ...

Ian Stark                                  Inf1-DA / Lecture 14                          2019-03-07
Frequencies

  Frequency information obtained from corpora can be used to investigate characteristics of the
  language represented.

            Token count N: the number of tokens (words, punctuation marks, etc.) in a corpus;
            i.e., the size of the corpus.
            Absolute frequency f(t) of type t: the number of tokens of type t in a corpus.
            Relative frequency of type t: the absolute frequency of t scaled by the overall token count;
            i.e., f(t)/N.
            Type count: the number of different types of token in a corpus.

  Here “tokens of type t” might mean a single word, or all its variants, or every use of a certain
  part of speech.

Ian Stark                                        Inf1-DA / Lecture 14                             2019-03-07
Frequency Example

 Here is a comparison of frequency information between two sources: the British National Corpus
 (BNC) and the Sherlock Holmes story A Case of Identity by Sir Arthur Conan Doyle.

                                          BNC                   A Case of Identity
                Token count N      100,000,000                   7,006
                Type count             636,397                   1,621
                f(“Holmes”)                 890                     46
                f(“Sherlock”)               209                      7
                f(“Holmes”)/N                   0.0000089             0.0066
                f(“Sherlock”)/N                 0.00000209            0.000999

Ian Stark                                Inf1-DA / Lecture 14                            2019-03-07
Unigrams

 We can now ask questions such as: what are the most frequent words in a corpus?

            Count absolute frequencies of all word types in the corpus.

            Tabulate them in an ordered list.

            Result: list of unigram frequencies — frequencies of individual words.

Ian Stark                                       Inf1-DA / Lecture 14                 2019-03-07
Unigram example

                                   BNC                  A Case of Identity
                             6,184,914    the               350   the
                             3,997,762    be                212   and
                             2,941,372    of                189   to
                             2,125,397    a                 167   of
                             1,812,161    in                163   a
                             1,372,253    have              158   I
                             1,088,577    it                132   that
                               917,292    to                117   it

 The unigram rankings are different, but we can see similarities. For example, the definite article
 “the” is the most frequent word in both corpora; and prepositions like “of” and “to” appear in
 both lists.
Ian Stark                                  Inf1-DA / Lecture 14                              2019-03-07
n-grams

 The notion of unigram generalises:

            Bigrams — pairs of adjacent words;

            Trigrams — triples of adjacent words;

            n-grams — n-tuples of adjacent words.

 These larger clusters of words carry more linguistic significance than individual words; and, again,
 we can make use of these even before finding out anything about their semantic content.

Ian Stark                                        Inf1-DA / Lecture 14                          2019-03-07
n-grams example

 The most frequent n-grams in A Case of Identity, for n = 2, 3, 4.

               bigrams               trigrams                            4-grams
              40   of the     5   there was no                   2   very morning of the
              23   in the     5   Mr. Hosmer Angel               2   use of the money
              21   to the     4   to say that                    2   the very morning of
              21   that I     4   that it was                    2   the use of the
              20   at the     4   that it is                     2   the King of Bohemia

 Note that frequencies of even the most common n-grams naturally get smaller with increasing n.
 As more word combinations become possible, there is an increase in data sparseness.

Ian Stark                                 Inf1-DA / Lecture 14                             2019-03-07
Bigram and POS Example Concordance

 Here is a concordance for all occurrences of bigrams in the Dickens corpus in which the second
 word is “tea” and the first is an adjective.
 This query uses the POS tagging of the corpus to search for adjectives.

            [pos="J.*"][word="tea"]
              87773:   now , notwithstanding the    they had given me before
             281162:   .’ ’ Shall I put a little    in the pot afore I go ,
             565002:   o moisten a box-full with    , stir it up on a piece
             607297:   tween eating , drinking ,    , devilled grill , muffi
             663703:   e , handed round a little    . The harp was there ;
             692255:   e so repentant over their    , at home , that by eigh
            1141472:   rs. Sparsit took a little    ; and , as she bent her
            1322382:   s illness ! Dry toast and    offered him every night
            1456507:   of robing , after which ,    and brandy were administ
            1732571:   rsty . You may give him a    , ma’am , and some dry t

Ian Stark                                   Inf1-DA / Lecture 14                           2019-03-07
Outline

  1    Finding Things and Counting Them

  2    Small Application

  3    Large Application

  4    Closing

Ian Stark                                 Inf1-DA / Lecture 14   2019-03-07
Sample Linguistic Application: Collocations

 A collocation is a sequence of words that occur close together ‘atypically often’ in language
 usage. For example:

            To “run amok”: the verb “run” can occur on its own, but “amok” does not.
            To say “strong tea” is much more natural English than “powerful tea” although the literal
            meanings are much the same.
            Phrasal verbs such as “settle up” or “make do”.
            “heartily sick”, “heated argument”, “commit a crime”,. . .

  Both Macmillan and Oxford University Press have specialist dictionaries that provide extensive
  lists of collocations specifically for those learning English. You can also buy collocation lists for
  linguistic research at http://www.collocates.info/.

 The inverted commas around ‘atypically often’ are because we need statistical ideas to make this precise.

Ian Stark                                       Inf1-DA / Lecture 14                               2019-03-07
Identifying Collocations

 We would like to automatically identify collocations in a large corpus.

  For example, collocations in the Dickens corpus involving the word “tea”.

            The bigram “strong tea” occurs in the corpus. This is a collocation.
            The bigram “powerful tea” does not, in fact, appear in the corpus.
            However, “more tea” and “little tea” do occur in the corpus.
            These are not collocations. These word sequences do not occur with any frequency above
            what would be suggested by their component words.

 The challenge is: how do we detect when a bigram (or n-gram) is a collocation?

Ian Stark                                       Inf1-DA / Lecture 14                         2019-03-07
Looking at the Data
 Here are the most common bigrams from the Dickens corpus where the first word is “strong” or
 “powerful”.

            strong   and           31                      powerful   effect        3
                     enough        16                                 sight         3
                     in            15                                 enough        3
                     man           14                                 mind          3
                     emphasis      11                                 for           3
                     desire        10                                 and           3
                     upon          10                                 with          3
                     interest      8                                  enchanter     2
                     a             8                                  displeasure   2
                     as            8                                  motives       2
                     inclination   7                                  impulse       2
                     tide          7                                  struggle      2
                     beer          7                                  grasp         2

Ian Stark                                Inf1-DA / Lecture 14                           2019-03-07
Filtering Collocations

 We observe the following from the bigram tables.

            Neither “strong tea” nor “powerful tea” are frequent enough to make it into the top 13.
            Some potential collocations for “strong”: like “strong desire”, “strong inclination”, and
            “strong beer”.
            Some potential collocations for “powerful”: like “powerful effect”, “powerful motives”, and
            “powerful struggle”.
            A possible problem: bigrams like “strong and”, “strong enough” and “powerful for”, have
            high frequency. These do not seem like collocations.
 To distinguish collocations from non-collocations, we need some way to filter out noise.

Ian Stark                                        Inf1-DA / Lecture 14                             2019-03-07
What We Need is More Maths
  Problem: Words like “for” and “and” are very common anyway: they occur with “strong” by
  chance.
  Solution: Use statistical tests to identify when the frequency of a bigram is atypically high given
  the frequencies of its constituent words.

                                           “beer”        ¬“beer”     Total
                               “strong”         7           618         625
                             ¬“strong”       127        2310422     2310549
                                  Total      134        2311040     2311174

  In general, statistical tools offer powerful methods for the analysis of all types of data. In
  particular, they provide the principal approach to the quantitative (and qualitative) analysis of
  unstructured data.
 We shall return to the problem of finding collocations later in the course, when we have some
 appropriate statistical tools.
Ian Stark                                    Inf1-DA / Lecture 14                              2019-03-07
Coursework                                                                                         !

 Written Assignment

 The Inf1-DA assignment will go online by the end of the week. This runs alongside your usual
 tutorial exercises for two weeks; ask your tutor for help with any problems.

 The assignment is based on past examination questions. Your tutor will give you marks and
 feedback on your work in the last tutorial of semester, and I shall distribute a solution guide.

 These marks will not be part of your final grade for Inf1-DA — this formative assessment is
 entirely for your feedback and learning.

 You are free to look things up, discuss with others, share advice, discuss on Piazza, and do
 whatever helps you learn. Please do.

Ian Stark                                  Inf1-DA / Lecture 14                              2019-03-07
Outline

  1    Finding Things and Counting Them

  2    Small Application

  3    Large Application

  4    Closing

Ian Stark                                 Inf1-DA / Lecture 14   2019-03-07
Engineering Natural-Language Systems

 Two Informatics system-building examples which use corpora extensively:

            Natural Language Processing (NLP): Computer systems that accept or produce readable
            text. For example:

                Summarization: Take a text, or multiple texts, and automatically produce an abstract or
                summary.
                Machine Translation (MT): Take a text in a source language and turn it into a text in the
                target language. For example Google Translate or Microsoft Translator.

            Speech Processing: Systems that accept or produce spoken language.

  Building these draws on probability theory, information theory and machine learning to extract
  and use the language information in large text corpora.

Ian Stark                                         Inf1-DA / Lecture 14                                2019-03-07
Example: Machine Translation
 The aim of machine translation is to automatically map sentences in one source language to
 corresponding sentences in a different target language, while preserving the meaning of the text.

  Historically, there have been two major approaches:

            Rule-based Translation: Long history including Systran and Babel Fish (Alta Vista, then
            Yahoo, now disappeared).

            Statistical Translation: Much recent growth, leading to Google Translate and Microsoft
            Translator.

  Both approaches make use of multilingual corpora.

                                               “The Babel fish,” said The Hitchhiker’s Guide to the Galaxy quietly,
                                  “ is small, yellow and leech-like, and probably the oddest thing in the Universe”
Ian Stark                                         Inf1-DA / Lecture 14                                      2019-03-07
Rule-Based Machine Translation
 A typical rule-based machine translation (RBMT) scheme might include:
     1      Automatically assign part-of-speech information to a source sentence.
     2      Build up a syntax tree for the sentence using grammatical rules.
     3      Map this parse tree in the source language into the target language, using a dictionary to
            translate individual words, and rules to find correct inflections and word ordering for
            translated sentence.
  Some systems use an interlingua between the source and target language.

  In any real implementations each of these steps will be much refined; even so, the central point
  remains to have the system translate a sentence by identifying its structure and, to some extent,
  its meaning.

 These systems use corpora to train algorithms that identify part-of-speech information and
 grammatical structures across different languages.
Ian Stark                                       Inf1-DA / Lecture 14                             2019-03-07
Examples of Rule-Based Translation

  From http://www.systranet.com/translate

                          The capital city of Scotland is Edinburgh

                                    English −→ German

                        Die Hauptstadt von Schottland ist Edinburgh

                                    German −→ English

                            The capital of Scotland is Edinburgh

Ian Stark                               Inf1-DA / Lecture 14          2019-03-07
Examples of Rule-Based Translation

  From http://www.systranet.com/translate

                   Sales of processed food collapsed across Europe when the
                                          news broke.

                                      English −→ French

                    Les ventes de la nourriture traitée se sont effondrées à
                     travers l’Europe quand les actualités se sont cassées.

                                      French −→ English

                   The sales of treated food crumbled through Europe when
                                        the news broke.
Ian Stark                                Inf1-DA / Lecture 14                  2019-03-07
Examples of Rule-Based Translation

  From http://www.systranet.com/translate and Robert Burns

                               My love is like a red, red rose
                               That’s newly sprung in June

                                     English −→ Italian

                          Il mio amore è come un rosso, rosa rossa
                           Quello recentemente è balzato a giugno

                                     Italian −→ English

                             My love is like red, pink a red one
                              That recently is jumped to june
Ian Stark                               Inf1-DA / Lecture 14         2019-03-07
Issues with Rule-Based Translation

 A major difficulty with rule-based translation is gathering enough rules to cover the very many
 special cases and nuances in natural language.

 As a result, rule-based translations often have a very unnatural feel.

 This issue is a serious one, and rule-based translation systems have not yet overcome the
 challenge.

  However, even though the translations seem a little rough to read, they may well be enough to
  successfully communicate meaning.

 (The problem with the example translation on the last slide is of a different nature. The source text is
 poetry, which routinely takes huge liberties with grammar and use of vocabulary. It’s not a surprise that
 this puts it far outside the scope of rule-based translation.)

Ian Stark                                     Inf1-DA / Lecture 14                                  2019-03-07
Statistical Machine Translation
 This uses a corpus of parallel texts, where the same text is given in both source and target
 languages. Translation might go like this:
     1      For each word and phrase from the source sentence find all occurrences of that word or
            phrase in the corpus.
     2      Match these words and phrases with the parallel corpus text, and use statistical methods to
            select preferred translations.
     3      Do some smoothing to find appropriate sizes for phrases and to glue translated phrases
            together to produce the translated sentence.

 Again, real implementations will refine these stages: for example, both source and target
 language corpora can be used to train neural networks that do the actual translation.

 To be effective, statistical translation requires a large and representative corpus of parallel texts.
 This corpus does not need to be heavily annotated.
Ian Stark                                       Inf1-DA / Lecture 14                             2019-03-07
Examples of Statistical Machine Translation
  From http://translate.google.com

                           The capital city of Scotland is Edinburgh

                                     English −→ German

                         Die Hauptstadt von Schottland ist Edinburgh

                                     German −→ English

                             The capital of Scotland is Edinburgh

Ian Stark                                Inf1-DA / Lecture 14          2019-03-07
Examples of Statistical Machine Translation
  From http://translate.google.com

                   Sales of processed food collapsed across Europe when the
                                          news broke.

                                      English −→ French

                    Les ventes d’aliments transformés se sont effondrées en
                          Europe lorsque la nouvelle a été annoncée.

                                      French −→ English

                    Processed food sales collapsed in Europe when the news
                                        was announced.
Ian Stark                                Inf1-DA / Lecture 14                 2019-03-07
Examples of Statistical Machine Translation
  From http://translate.google.com and Robert Burns.

                                My love is like a red, red rose
                                That’s newly sprung in June

                                      English −→ Italian

                           Il mio amore è come un rosso, rosa rossa
                              Questo è appena spuntato a giugno

                                      Italian −→ English

                                My love is like a red, red rose
                                That just popped up in June
Ian Stark                                Inf1-DA / Lecture 14         2019-03-07
Features of Statistical Machine Translation

  Statistical machine translation has challenges: it requires a very large corpus of parallel texts,
  and is computationally expensive to carry out.

  In recent years, these problems have diminished, at least for widely-used languages: large
  corpora have become available, and there have been improvements to algorithms and hardware.

  Given a large enough corpus, statistical translations can produce more natural translations than
  rule-based translations.

  Because it is not tied to grammar, statistical translation may work better with less rigid uses of
  language, such as poetry.

Ian Stark                                    Inf1-DA / Lecture 14                              2019-03-07
Features of Statistical Machine Translation
 At the moment, statistical translation is dominant: machine learning over large corpora is used
 to train neural networks that perform the actual translation.

  However, it has its limitations.

  If statistical translation is applied to a sentence that uses uncommon phrases, not in the corpus,
  then it can result in nonsense, while rule-based translation may survive.

  Large parallel corpora have often been compiled for reasons of political union: EU, UN, Canada.
  Quality can drop off sharply once we step outside the languages covered by these very large
  historical corpora.

  Some traditional generators of human-translated parallel corpora are now looking to save money
  by using machine translation . . .

 The future of machine translation looks interesting.
Ian Stark                                   Inf1-DA / Lecture 14                              2019-03-07
Outline

  1    Finding Things and Counting Them

  2    Small Application

  3    Large Application

  4    Closing

Ian Stark                                 Inf1-DA / Lecture 14   2019-03-07
Relevant Courses for Future Years                                                           !

    Year 2     Inf2A: Processing Formal and Natural Languages

    Year 3     Foundations of Natural Language Processing                           FNLP
               Introductory Applied Machine Learning                                IAML

    Year 4/5   Natural Language Understanding, Generation and Machine Translation   NLU+
               Topics in Natural Language Processing                                TNLP

Ian Stark                                 Inf1-DA / Lecture 14                        2019-03-07
Homework

  Read This
            Schuster, Johnson, Thorat                                         https://is.gd/zeroshot
            Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System
            Google Research blog, November 2016

  Do This
 Try out the Google Books Ngram Viewer at https://books.google.com/ngrams.
 Compare the relative frequencies over time of the words “computer”, “software” and
 “hardware”; and also the city names “Edinburgh”, “London”, “Paris” and “New York”.
 To find out about the more complex queries available, take a look at
 https://books.google.com/ngrams/info

Ian Stark                                      Inf1-DA / Lecture 14                           2019-03-07
Automatic Topic Identification                                                                     +

            David Mimno                                                      https://is.gd/topicsnyt
            1000 topics automatically extracted from 20 years of the New York Times.
            October 2012

            Ben Schmidt                                                       https://is.gd/tvtopics
            Typical TV episodes: visualizing topics in screen time
            Sapping Attention blog post, December 2014

Ian Stark                                        Inf1-DA / Lecture 14                          2019-03-07
You can also read