CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...

Page created by Anne Willis
 
CONTINUE READING
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
CS 175, Project in Artificial Intelligence

Introduction

Padhraic Smyth
Department of Computer Science
Bren School of Information and Computer Sciences
University of California, Irvine
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
2

Today’s Lecture

• Discuss class schedule and organization

• Applications of text analysis in the real world

• Challenges for AI

• Ideas for possible class projects

                                                    Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
3

Course Description for CS 175
Students in this project class will in 2 to 3 person teams to develop
artificial intelligence and machine learning algorithms and apply them to a
range of different problems related to natural language and text analysis.

These problems can include, for example, document classification and
clustering, sentiment analysis, dialog/chatbot systems, information
extraction, word prediction, text synthesis, question-answering systems,
and so on.

Projects can make use of real-world publicly-available data from sources
such as Twitter, Wikipedia, Reddit, news articles, product and movie
reviews, email data sets, the US patent database, and more.

                                                                  Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
4

The Turing Test for AI

                         Human or
                         Computer?

                                     Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
5

Class Organization

• Class Website: www.ics.uci.edu/~smyth/courses/cs175
    – This is where to find assignments, links to software, project guidelines, etc

• Piazza Website:
    – https://piazza.com/uci/winter2018/compsci175/home
    – Use this to post questions related to assignments, projects, etc
    – Piazza is where we will post announcements, answers to questions, etc

• My Office Hours
    – Fridays 9:30 to 11:30

• Teaching Assistant: Eric Nalisnick
    – Office hours weekly, Thursdays 1 to 3, DBH 4228
    – Discussion section this Friday on Python basics

                                                                         Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
6

Class Organization (continued)
• Textbook and Reading Materials
    – No official textbook
    – Several useful online texts: see class Webpage
    – Class Website will contain additional pointers to links and background
      reading that we will refer to in lectures and that will be useful for project
      work

• Discussion Section, Wed, 12 to 12:50, ICS 174
    – Discussion sections for at least the first few weeks
         • First discussion: this Wednesday, on basics of Python
    – Attendance at discussion is not required

• No final exam (but there is a final report due in finals week)

                                                                           Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
7

Contacting Instructor: use Piazza
• Use Piazza for all offline questions related to the class
    – Assignments, lectures, projects, data sets, ideas, etc

• Instructor and TA will monitor and answer questions
    – Students should also feel free to also answer questions
    – If you wish you can use “private mode” to ask questions that only the
      Professor or TA will see

• Use direct email only if other options do not work for some reason

                                                                      Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
8

Academic Integrity (also on the class Web page)

•   Please read the guidelines on academic integrity below. Academic integrity is taken
    seriously in this class. Failure to adhere to the policies below can result in a student
    receiving a failing grade in the class.

•   For assignments you are allowed to discuss the assignments verbally with other class
    members, but you are not allowed to look at or to copy anyone else's written solutions or
    code. All problem solutions and code submitted must be material you have personally
    written during this quarter, except for any standard library or utility functions.

•   For class projects all reports submitted must be written by you or members of your project
    team. Code generated for class projects can be a combination of code written by team
    members and publicly-available code. You should clearly indicate in your reports and in
    your code documentation which parts of your code was written by you or your team and
    which parts of your code was written by others.

•   It is the responsibility of each student to be familiar with UCI's Academic Integrity
    Policies and UCI's definitions and examples of academic misconduct.

                                                                                      Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
9

How this Course will work
• Early Weeks: Lectures and Assignments
    – Learn general principles of automated text analysis
    – Emphasis on machine learning for text, e.g., classifying a document
    – Combination of lectures, assignments (two), and background reading

• Later Weeks: Team Project
    –   build a prototype software system for text analysis (weeks 4 to 10)
    –   Propose an idea and plans for your class project (written proposal)
    –   Do background research and reading
    –   Develop ideas, implement algorithms, make use of libraries and packages
    –   Conduct experiments with real data sets
    –   Test and evaluate your system in a systematic manner
    –   Communicate your results (presentations and reports)

                                                                       Padhraic Smyth, UC Irvine: CS 175, Winter 2018
CS 175, Project in Artificial Intelligence Introduction - Padhraic Smyth Department of Computer Science Bren School of Information and Computer ...
10

Week     Monday                                      Wednesday

Jan 8    Lecture: Introduction and course outline    Lecture: Basic concepts in text analysis

         No class (university holiday)               Lecture: Text classification, part 1
Jan 15
                                                     Assignment 1 due, 5pm
                                                     Lecture: Discussion of class projects
Jan 22   Lecture: Text classification, part 2
                                                     Assignment 2 due, 5pm
                                                     Lecture: Neural networks for text, part 2
Jan 29   Lecture: Neural networks for text, part 1
                                                     Project proposal due, Friday 6pm

Feb 5    Office hours (no lecture)                   Lecture: Algorithm evaluation methods

Feb 12 Office hours (no lecture)                     Lecture: Unsupervised learning algorithms

                                                     Lecture: Discussion of progress reports
Feb 19 No class (university holiday)
                                                     Progress report due, Friday 6pm
Feb 26 Office hours (no lecture)                     Office hours (no lecture)

Mar 5    Office hours (no lecture)                   Lecture: Discussion of final reports

         Project Presentations (in class)            Project Presentations (in class)
Mar 12
         Upload slides by 4pm                        Upload slides by 4pm

Mar 19 Final project reports due (day/time TBD)

                                                                                            Padhraic Smyth, UC Irvine: CS 175, Winter 2018
11

Projects
• 2-person or 3-person teams
    – Project grading will be partly team-based and partly on individual contributions
    – Note that Assignments 1 and 2 are *not* team-based – these will be worked on
      and submitted individually
    – 3-person teams expected to do about 50% more than 2 person teams

• Each team will propose its own project
    – Suggestions for multiple different projects will be provided
    – Extensive use of libraries (in addition to writing some of your own code)

• Projects will be graded based on
    – Initial proposal
    – Intermediate and final reports
    – In-class presentation
    [We will discuss all of this in more detail in future lectures]

                                                                        Padhraic Smyth, UC Irvine: CS 175, Winter 2018
12

Software Environment for Assignments and Projects
• Python
    – Python will be the primary language we will use in this class
    – Assume that all students have a working knowledge of Python 3

• Packages and Libraries
    – We will make extensive use of additional packages and libraries in Python
    – NLTK: Natural Language Toolkit
    – Scikit-learn: machine learning library
    – Scientific computing/graphs/etc: matplotlib, numpy, scipy, etc
    You should download and install the Anaconda package: it contains many
    packages you need for this class (NLTK, scikit-learn, etc)

• Integrated Development Environment (IDE)
    – You are free to use whatever IDE you prefer
    – The Spyder IDE is a useful option(comes with Anaconda)

                                                                      Padhraic Smyth, UC Irvine: CS 175, Winter 2018
13

Screenshot of the Spyder IDE

                               Padhraic Smyth, UC Irvine: CS 175, Winter 2018
14

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
15

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
16

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
17

Assignment 1
Available on the class Web page

Due Wednesday Jan 17th by 5pm (to dropbox on EEE)

Outline
    – Read selected sections of Chapter 1 to 3 of the online NLTK book
    – Install Anaconda/NLTK/…
    – Write simple functions in Python for text analysis
        • Compute percentage of alphabetic characters in a string
        • Detect the first K words on a Web page
        • Parse text into parts of speech (nouns, verbs, etc)
    – Submit your code as a single python file via EEE

                                                                Padhraic Smyth, UC Irvine: CS 175, Winter 2018
18

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
19

APPLICATIONS OF TEXT ANALYSIS

(Thanks to Prof Sameer Singh for several of the slides in the remainder of this presentation)

                                                                                   Padhraic Smyth, UC Irvine: CS 175, Winter 2018
20

Automated Text Analysis
• Very large amounts of text now available in digital form
    ……huge increase in automated text analysis techniques and applications

• Examples of large text data sets
    –   Web pages
    –   Emails and text messages
    –   Blogs and microblogs
    –   Product reviews
    –   Search queries
    –   Scientific and medical articles
    –   Legal cases, patents, government documents
    –   News articles about companies and products
    –   Collections of digitized books and historical documents
    –   …and many more….

                                                                     Padhraic Smyth, UC Irvine: CS 175, Winter 2018
21

1985: ~ $100k
per gigabyte

                2015: ~ $0.3 cents
                per gigabyte

                Padhraic Smyth, UC Irvine: CS 175, Winter 2018
22

From www.internetlivestats.com, 11:50, Mar 9th 2017

                                                      Padhraic Smyth, UC Irvine: CS 175, Winter 2018
23

Who is interested in analyzing such data?
•   Web companies
     – Google, Facebook, Twitter, Microsoft, Yahoo!, and many more

•   Ecommerce
     – Automated analysis of product reviews + customer text such as emails, search queries, etc
     – eBay, Amazon, plus many “regular” companies that have a Web presence

•   Financial industry
     – Automated tracking of news and online blogs about companies and products

•   Law enforcement and intelligence agencies
     – Text mining of vast amounts of emails, blogs, etc

•   Medical researchers
     – Automated analysis/summarization of publications on diseases, genes, drugs, etc

•   Social scientists and humanities researchers
     – Studying history and social science through analysis of large text collections
                                                                                   Padhraic Smyth, UC Irvine: CS 175, Winter 2018
24

Google Search query = “beer”, over time

                                          Padhraic Smyth, UC Irvine: CS 175, Winter 2018
25

Google Search query = “beer”, over time

                                          Padhraic Smyth, UC Irvine: CS 175, Winter 2018
26

https://hedonometer.org/index.html

                                     Padhraic Smyth, UC Irvine: CS 175, Winter 2018
27

Tweets mentioning Coke (green) and Pepsi (red)

                                      from chimpler.wordpress.com

                                             Padhraic Smyth, UC Irvine: CS 175, Winter 2018
28

http://www.mapsandthecity.com/wp-content/uploads/2011/11/Schermafbeelding-2011-11-09-om-10.13.37.png
                                                                                       Padhraic Smyth, UC Irvine: CS 175, Winter 2018
29

The Google Books Project
• Google has digitized over 8 million books
    –   Books from 40 university libraries around the world
    –   4.5 million in English, rest in other languages. 6% of all books ever published.
    –   500 billion words
    –   Spans multiple centuries since 1500’s

• Reading the books manually is impossible
    – Reading only English-language entries since 2000, at the pace of 200
      words/minute, with no sleep/food interruptions, would take 80 years!

                                                                           Padhraic Smyth, UC Irvine: CS 175, Winter 2018
30

From Michel et al, “Quantitative Analysis of
Culture Using Millions of Digitized Books”,
Science, v331, 2011.

                             Padhraic Smyth, UC Irvine: CS 175, Winter 2018
31

From Michel et al, “Quantitative Analysis of
Culture Using Millions of Digitized Books”,
Science, v331, 2011.

                             Padhraic Smyth, UC Irvine: CS 175, Winter 2018
32

Applications of Text Analysis
• Document classification
    – Spam email classification: email text -> {spam, not spam}
    – Sentiment classification: product review text -> {positive, negative}
    – Web page classification: Web page text -> {sports, finance, entertainment, ….}

•   Machine translation
    – Automated translation of text from one language to another
    – e.g., for Web pages, for mobile phones

• Web search
    – Ranking of Web pages based on matching queries with content

• Web advertising
    – Matching search queries and Web page content to online advertisements

                                                                       Padhraic Smyth, UC Irvine: CS 175, Winter 2018
33

?           ?

    ?           Each ? represents
                an “ad slot”

                In a fraction of a
                second, algorithms
                predict which ads
                you are most likely
                to click on (from
                1000’s of ads)

        ?

                Padhraic Smyth, UC Irvine: CS 175, Winter 2018
34

The ads that are
most likely to lead
to a click are
selected and
displayed

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
35

Applications of Text Analysis (continued)
• Personalization
    – Creating customized Web pages, newspapers, interfaces for individuals

• Autocompletion
    – Predicting words to improve user interfaces on smartphones

• Corpus exploration
    – Developing visualization and search tools for researchers and lawyers exploring
      millions of patents

• Information extraction
    – Extracting mentions of entities (people, places, companies, …) from text
        • e.g., “Mr. Obama traveled to London to meet Mr. Cameron
    – Extraction of relations
        • e.g., travel_to(Obama, London), meet(Obama, Cameron)

                                                                       Padhraic Smyth, UC Irvine: CS 175, Winter 2018
36

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
37

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
38

Applications of Text Analysis (continued)
• Automated Dialog Agents
    – Bots that can carry on a conversation/dialog with a human via text
    – E.g., applications to answering customer inquiries (e.g., for troubleshooting)

• Text Summarization
    – Automated summaries of text documents
         • In applications such as law, medicine, etc

• Automated Essay Grading
    – E.g., for SAT, AP, GRE exams, or for online courses

• Natural Language Generation (NLG) or Text Synthesis
    – Applications to automated generation of news stories
    – Automatically generating replies to customer emails

                                                                         Padhraic Smyth, UC Irvine: CS 175, Winter 2018
39

Application: Text Synthesis

                              Graphic from: https://automatedinsights.com/examples/

                                                          Padhraic Smyth, UC Irvine: CS 175, Winter 2018
40

Application: Text Synthesis

                              Graphic from: https://automatedinsights.com/examples/

                                                          Padhraic Smyth, UC Irvine: CS 175, Winter 2018
41

CHALLENGES FROM AN AI PERSPECTIVE

                                    Padhraic Smyth, UC Irvine: CS 175, Winter 2018
42

Ambiguity in Human Language

            She saw the man with the telescope.

                                                  Padhraic Smyth, UC Irvine: CS 175, Winter 2018
43

 Another Example

One morning I shot an elephant in my pajamas.
How he got into my pajamas I'll never know.
                               - Groucho Marx

                                                Padhraic Smyth, UC Irvine: CS 175, Winter 2018
44

And many more….

  – Enraged Cow Injures Farmer with Ax

  – Ban on Nude Dancing on Governor’s Desk

  – Teacher Strikes Idle Kids

  – Hospitals Are Sued by 7 Foot Doctors

  – Iraqi Head Seeks Arms

  – Kids Make Nutritious Snacks

  – Local HS Dropouts Cut in Half

                                             Padhraic Smyth, UC Irvine: CS 175, Winter 2018
45

Many ways to say the same thing

    She gave the book to Tom vs. She gave Tom the book

       Some kids popped by vs. A few children visited

    Is that window still open? vs Please close the window

                                                    Padhraic Smyth, UC Irvine: CS 175, Winter 2018
46

Language understanding is far from a solved problem….

                                                  Padhraic Smyth, UC Irvine: CS 175, Winter 2018
47

Word Sparsity

                                              cornflakes
                                              mathematicians
                                              fuzziness
                the                           jumbling
                 of
                                              pseudo-rapporteur
                                              lobby-ridden
                      to
                                              perfunctorily
                      and
                                               Lycketoft
                                               UNCITRAL
                                               H-0695
                                               policyfor
                                               Commissioneris

                                 >1/3
                            occur only once

                                                    Padhraic Smyth, UC Irvine: CS 175, Winter 2018
48

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
49

EXAMPLES OF POSSIBLE CLASS PROJECTS

                                      Padhraic Smyth, UC Irvine: CS 175, Winter 2018
50

Examples of Past CS 175 Student Projects

  Description                             Data Sets

  Sentiment analysis                      Twitter (text + sentiment labels)

  Star/score prediction from text         Yelp, Movie Reviews: text + scores

  Predict number of upvotes for a         Reddit posts + votes + timestamps
  Reddit post
  Predict if a restaurant will close in   Yelp reviews (text, timestamps, metadata)
  the next month
  Simulate realistic text from an         Gutenberg books, tweets, movie scripts
  author/speaker/character
  Automated poetry or song lyrics         Text from song lyrics or poetry
  generation
  Automated essay grading                 Text for student essays with human scores

                                                                            Padhraic Smyth, UC Irvine: CS 175, Winter 2018
51

 Possible Projects: Document Classification

                    [ (NBA, 7), (Lakers, 3), basket (2)….]   Label = basketball

                             Document features                    Class label
                           (e.g., a “bag of words”)

Original document

                                                             Padhraic Smyth, UC Irvine: CS 175, Winter 2018
52

 Possible Projects: Document Classification

                            [ (NBA, 7), (Lakers, 3), basket (2)….]            Label = basketball

                                     Document features                             Class label
                                   (e.g., a “bag of words”)

Original document

                                                              Classification model, e.g.,
                                                               - naïve Bayes
           Tokenization,                                       - logistic regression
           Stemming,                                           - neural network
           Part of Speech Tagging, etc                        (Assignment 2 and next week’s
           (Wednesday’s lecture)                              lectures)

                                                                              Padhraic Smyth, UC Irvine: CS 175, Winter 2018
53

Possible Projects using Document Classification
• Use Wikipedia pages and categories as training data and build a
  classification algorithm that can classify news articles

• Build a sentiment classification algorithm that can predict if a product or
  movie review is positive or negative

• Develop an algorithm that can automatically classify emails into an
  appropriate folder (e.g., for Gmail)

• Conduct a systematic study of how document length, sample size, or other
  factors affect the accuracy of document classifiers on standard data sets

                                                                  Padhraic Smyth, UC Irvine: CS 175, Winter 2018
54

Possible Projects using Document Clustering
• Clustering of Documents:
    – Takes a set of documents (each represented as a bag of words) and
      automatically clusters/groups the documents

• Build an algorithm that can cluster news articles so that articles about the
  same news story end up in the same cluster
    – Note that to do this well may require extraction of information about people
      and places and time from the articles

• Develop a tool to download an individual’s email history (e.g., from Gmail)
  and to group emails into clusters on similar topics

                                                                      Padhraic Smyth, UC Irvine: CS 175, Winter 2018
55

Other Ideas for Projects
(this is just a small partial list…there are many other possibilities!)

• Information Extraction:
    – Extract names of products and companies from news articles
    – Extract names of actors and directors from movie reviews

• Change in Language over Time:
    – Develop an algorithm that can automatically identify key topics in US Patent
      data and track how these topics change over time

                                                                       Padhraic Smyth, UC Irvine: CS 175, Winter 2018
56

Examples of large
text data sets that
could be used for
projects
                                               PubMed: 20 million
                       Text from 4 million    abstracts of biomedical
                       Wikipedia articles        research papers

                                                 Twitter data: large
                                               streams of tweets via
                                                    Twitter API
                      Enron emails: 250,000
                        company emails
                                                       Padhraic Smyth, UC Irvine: CS 175, Winter 2018
57

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
58

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
59

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
60

 Reddit Statistics        2015
    total # posts         668M

total # users posting     8.2M

 # words per post          30.6

   total # words        >20 billion

                                      Padhraic Smyth, UC Irvine: CS 175, Winter 2018
61

Padhraic Smyth, UC Irvine: CS 175, Winter 2018
62

Output from a Neural Network Model Trained on Shakespeare

        KING LEAR:
        O, if you were a feeble sight, the courtesy of your law,
        Your sight and several breath, will wear the gods
        With his heads, and my hands are wonder'd at the deeds,
        So drop upon your lordship's head, and your opinion
        Shall be against your honour.

        Second Senator:
        They are away this miseries, produced upon my soul,
        Breaking and strongly should be buried, when I perish
        The earth and thoughts of many states.

        DUKE VINCENTIO: Well, your wit is in the care of side and that.

Examples from “The Unreasonable Effectiveness of Recurrent Neural Networks”,
Andrej Kaparthy, blog, http://karpathy.github.io/2015/05/21/rnn-effectiveness/

                                                                                 Padhraic Smyth, UC Irvine: CS 175, Winter 2018
63

Output from a Neural Network Model Trained on Cooking Recipes

 From https://gist.github.com/nylki/1efbaa36635956d35bcc

                                                           Padhraic Smyth, UC Irvine: CS 175, Winter 2018
64

Week     Monday                                      Wednesday

Jan 8    Lecture: Introduction and course outline    Lecture: Basic concepts in text analysis

         No class (university holiday)               Lecture: Text classification, part 1
Jan 15
                                                     Assignment 1 due, 5pm
                                                     Lecture: Discussion of class projects
Jan 22   Lecture: Text classification, part 2
                                                     Assignment 2 due, 5pm
                                                     Lecture: Neural networks for text, part 2
Jan 29   Lecture: Neural networks for text, part 1
                                                     Project proposal due, Friday 6pm

Feb 5    Office hours (no lecture)                   Lecture: Algorithm evaluation methods

Feb 12 Office hours (no lecture)                     Lecture: Unsupervised learning algorithms

                                                     Lecture: Discussion of progress reports
Feb 19 No class (university holiday)
                                                     Progress report due, Friday 6pm
Feb 26 Office hours (no lecture)                     Office hours (no lecture)

Mar 5    Office hours (no lecture)                   Lecture: Discussion of final reports

         Project Presentations (in class)            Project Presentations (in class)
Mar 12
         Upload slides by 4pm                        Upload slides by 4pm

Mar 19 Final project reports due (day/time TBD)

                                                                                            Padhraic Smyth, UC Irvine: CS 175, Winter 2018
65

Assignment 1
Available on the class Web page

Due Wednesday Jan 17th by 4pm (to dropbox on EEE)

Outline
    – Read Sections of Chapter 1 and 3 of the online NLTK book
    – Install Anaconda/NLTK/…
    – Write simple functions in Python for text analysis
        • Compute percentage of alphabetic characters in a string
        • Detect the first K words on a Web page
        • Parse text into parts of speech (nouns, verbs, etc)
    – Submit your code as a single python file via EEE

                                                               Padhraic Smyth, UC Irvine: CS 175, Winter 2018
You can also read