Tutorial: Experimenting IR/NLP with Terrier

Page created by John Gonzales
 
CONTINUE READING
Tutorial: Experimenting IR/NLP with Terrier
Tutorial: Experimenting IR/NLP with Terrier

                  Parth Gupta
              pgupta@dsic.upv.es

          Technical University of Valencia, Spain
Tutorial: Experimenting IR/NLP with Terrier
Reference

• Some references, which are extensively used in the tutorial.
  ◦ “Tutorial: Large-scale Information Retrieval Experimentation with
    Terrier” at CIKM, 2011.
  ◦ Documentation for Terrier 3.5 http://terrier.org/docs/v3.5/

2 of 45
Tutorial: Experimenting IR/NLP with Terrier
Terrier IR Platform

• Efficient - Has MapReduce support, Really fast indexing and
  retrieval, compressed data structures
• Effective - Has many IR models like TF-IDF, BM25, LM, DFR
  with many field based weighting scheme and proximilty options
• Flexible - Can be used cross platforms like Windows, Linux,
  MacOS
• Multilinguality - Supports many languages

3 of 45
Tutorial: Experimenting IR/NLP with Terrier
Other Serch Engine Options

 • Non-Academic                          • Academic
   ◦ Lucene/Nutch/Solr (Apache)            ◦ Terrier (Glasgow)
          • Java                              • Java
          • Basic models                      • Advanced Models including
      ◦ Xapian (Cambridge)                      DFR, LM etc
          • C++ (Many bindings                • Advance Pseudo RF modules
              available)                   ◦ Lemur/Indri (CMU/UMass)
          • Very fast                         • C++
          • Basic models                      • Advanced models except DFR
      ◦ Sphinx(Sphinx Inc.)                     family
          •   C++
          •   Tightly coupled with DBs
          •   Very Basic Models
          •   No Relevance Feedback

4 of 45
Content of this Tutorial

• Covered
  ◦ Designing and Executing IR/NLP expriments with Terrier
  ◦ Using parts of Terrier in your Application like
          • Tokeniser, Stemmer
          • Similarity Scores
          • Relevance Feedback etc.
   ◦ Analysis with Terrier
• Not Covered
  ◦ MapReduce Support
  ◦ Web Support (JSP)

5 of 45
Installation

• Get Terrier
  ◦ Download Latest Version v3.5 freely from http://terrier.org/
• Requirement
  ◦ Java JDK 1.6 or greater
  ◦ Eclipse (just for this Tutorial!)
• Setup
  ◦ Extract it and its ready to use.

6 of 45
IR Basics

7 of 45
Basic IR Concepts

• Crawling
  ◦ Crawl the necessary part of the Web and prepare a static collection of
    documents

• Indexing
   ◦ Preprocess to convert it into raw text format (ASCII or UTF-8)
   ◦ Stop-word removal [Term Pipeline]
   ◦ Stemming [Term Pipeline]
   ◦ Store relevant information of terms and documents like term
     frequency (TF) [doc and collection] and document length in direct
     and inverted index.

8 of 45
• Query Normalisation
   ◦ Pass the query from the same pipeline

• Ranking
   ◦ The simplest yet powerfull model is TF-IDF
                                       n
                                       X
                       Score(Q, D) =          tf (qi , D) ∗ idf (qi )
                                        i=1

   ◦ tf (qi , D) = Frequency of Term qi in D.
   ◦ idf (qi ) = log( # of docs Ncontaining qi )

9 of 45
• TF − IDF Scoring Example
    ◦      Doc1 = I2R is in Singapore
    ◦      Doc2 = I2R is in SG
    ◦      Doc3 = UPV is in Valencia
    ◦      Q = i2r sg

• Ranking
  ◦ Score(Q, Doc1) = (1+0)*(0.64) = 0.64 Rank - 2
  ◦ Score(Q, Doc2) = (1+1)*(0.64) = 1.28 Rank - 1
  ◦ Score(Q, Doc3) = (0+0)*(0.64) = 0.0 Rank - 3

10 of 45
Other Unsupervised Ranking Models

• BM25 - Probabilistic Model
• Language Model for IR

11 of 45
Terrier: Indexing

12 of 45
Indexing

13 of 45
Indexing

14 of 45
Collection

15 of 45
Document

• UTFTokeniser

16 of 45
TermPipeline

               • Stopwords Removal
               • Stemmer
                 ◦ PorterStemmer,
                    WeakPorterStemmer
                 ◦ SnowballStemmera
                 a
                     http://snowball.tartarus.org/

17 of 45
Indexers

• Indexing
   ◦ Single-pass Indexing - Only Inverted Index
   ◦ Double-pass Indexing - Inverted Index + DirectIndex
• Indexing structures
   ◦ InvertedIndex
   ◦ DirectIndex
   ◦ Lexicon
   ◦ DocumentIndex

18 of 45
Single-pass and Two-pass Indexing

19 of 45
Field based Indexing

20 of 45
Indexing: Hands-on

21 of 45
Installation of Java and Eclipse

22 of 45
Set up

• Download [Java Platform (JDK) 7u17] http://www.oracle.com/
   technetwork/java/javase/downloads/index.html
    ◦ Linux - Select the distro
    ◦ Windows: .exe
    ◦ MacOS: .dmg
• Download Eclipse [Eclipse IDE for Java EE Developers] from
   http://www.eclipse.org/downloads/index-developer.php
• Installation of Eclipse: Just extract it and its ready to use.

23 of 45
Terrier Directory Structure

 bin       -   Scripts to run terrier
 doc       -   Documentation
 etc       -   Configuration files
 lib       -   Required Java libraries (.jar files)
 share     -   Utility files like stopword list
 src       -   Source code
 var       -   Index and results directory

24 of 45
TREC style experiments

• Usually the IR evaluation forums like TREC, CLEF, NTCIR, FIRE
   release the data, query list and their relevance judgments (qrels)
• The task is to submit runs, which they will evaluate.
• This is much more conventional experiments with IR which is
   usually called Adhoc track, which can be monolingual or
   cross-lingual.
• Terrier has implicit way to carry them painlessly.
• The advantage is, most of the weighting models are already
   implemented like TF-IDF, BM25, DFR, LM and they are ready to
   serve you as a baseline.
• You just need to implement your improvement and compare with
   these baselines.
25 of 45
Indexing with Terrier

 # This will create a list of files that is needed to be indexed..
 > ./bin/trec_setup.sh 

 # Modify the properties of indexing
 look at the next two slides to modify the properties

 # This will index the documents in the file collection.spec, index is at /var/index/
 > ./bin/trec_terrier.sh -i

26 of 45
Default terrier.properties file

#default controls for query expansion
querying.postprocesses.order=QueryExpansion
querying.postprocesses.controls=qe:QueryExpansion

#default and allowed controls
querying.default.controls=
querying.allowed.controls=qe,start,end,qemodel

#document tags specification
#for processing the contents of
#the documents, ignoring DOCHDR
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.skip=DOCHDR

#query tags specification
TrecQueryTags.doctag=TOP
TrecQueryTags.idtag=NUM
TrecQueryTags.process=TOP,NUM,TITLE
TrecQueryTags.skip=DESC,NARR

#stop-words file
stopwords.filename=stopword-list.txt

#the processing stages a term goes through
termpipelines=Stopwords,PorterStemmer

 27 of 45
Properties

• You have many possible options to configure the terrier without
   even looking at the Source code.
• Walk-through the terrier.properties.sample File located at
   $terrier_home/etc
• Walk-through the properties page
   http://terrier.org/docs/v3.5/properties.html

28 of 45
Printing Index

> ./bin/trec_terrier.sh --printstats

> ./bin/trec_terrier.sh --printlexicon

america,term631 Nt=2 TF=2 @0 55 5
terid,term DF TF @File_Number start_offset_in_inv_ndex start_bit_offset_in_inv_index

> ./bin/trec_terrier.sh --printinverted

901 (0,2) (3,2) (4,3) (6,5) (7,1) (8,3)
902 (4,1) (8,2)

> ./bin/trec_terrier.sh --printdirect

8 (1,3) (5,11) (13,1) (15,1) (20,1) (26,7) (28,1) (30,1) (33,1) (35,1) (38,1) (43,1)...

> ./bin/trec_terrier.sh --printdocid

1: 175 136@0,20,1
id: doc_length entries@pointer info

 29 of 45
Terrier API with Eclipse

30 of 45
Eclipse Welcome Screen

31 of 45
Eclipse Welcome Screen

32 of 45
Eclipse Home

33 of 45
Starting Point - Hello World!
• Extract terrier-tut-code.zip
• File → New → Project
• Select “Java Project from an Existing Ant Buildfile” → Next
• Select “Browse” → Select the “build.xml” file from the just
  extracted “terrier-tut-code” directory
• Finish

    package i2r.hlt;

    public class HelloWorld {
      public static void main(String[] args) {
         System.out.println("Hello World!");
      }
    }

34 of 45
Code walk-though

• HelloWorld.java and HelloWorldAdvanced.java
• Eclipse Error Suggestion System
• How to Run and Debug with eclipse
• Basic Java details
  ◦ Java Objects
  ◦ Javadoc

35 of 45
Indexing with Eclipse

• Indexing.java
• IndexAnalysis.java

36 of 45
Using the API

• Most of the time we are not doing Adhoc experiments but we need
  to use individual components of the search engine API.
• For example,
    ◦ I need the “term frequency ” of term X in Document Y.
    ◦ I need top 100 Documents similar to “my ” document using
      TF-IDF/BM25/LM.
    ◦ I need a tokenised, stopwords removed and stemmed version of “my ”
      text.
    ◦ I need top 10 words of document X based on TF / IDF / TF-IDF.
    ◦ I need a TF of a term X and IDF of term Y.
    ◦ I need to compute term-document matrix for this collection.
    ◦ .... and many more.

37 of 45
How to use terrier in “your” code?
• Its very easy and that will be the main goal of the tutorial.
• You need to use the
  $terrier_home/lib/terrier-3.5-core.jar in your java
  program and thats it.
• We will see how everything above can be done without hassle
• Outline
    ◦ Write a simple program to index our simple text files and customise
      indexing.
    ◦ How to retrieve documents from this index and customise retrieval.
    ◦ How to use terrier for cross-lingual or multilingual applications.
    ◦ How to extract term and document statistics from the index.
    ◦ How to create a term-document matrix of a collection.
    ◦ How to use query expansion modules in your applications like
      ROCCHIO
    ◦ A case study: A Chat System - IRIS.
38 of 45
Terrier: Retrieval

39 of 45
Retrieval

40 of 45
Retrieval with Terrier

• To retrieve documents from the index using relevance models like
   TF_IDF, BM25etc.
    ◦ Retrieval.java
• Create Term-Document Matrix for the collection using Terrier
   Index.
  ◦ TDMatrix.java
• Fetching Term and Document related Statistics of the indexed
   documents.
  ◦ IndexAnalysis.java
• Get the Expanded terms using Pseudo Relevance Feedback.
  ◦ PseudoRelevanceFeedback.java
• Multilingual IR
  ◦ ? :)

41 of 45
Case Study: You have a new weighting scheme like
TF-IDF

• You just create a Java class implementing your formula and put it
   in package org.terrier.matching.models
• Repeat the same procedure as above with your weighting scheme
   instead of PL2
• Submit the runs!

42 of 45
TREC style experiments with Terrier

 # This will create a list of files that is needed to be indexed..
 > ./bin/trec_setup.sh 

 look at the next two slides

 # This will index the documents in the file collection.spec, index is at /var/index/
 > ./bin/trec_terrier.sh -i

 # This will retrieve the indexed documents
   for the queries in the query-file and generates .res files in /var/results/
 > ./bin/trec_terrier.sh -r -Dtrec.model=PL2 -c 10.99 -Dtrec.topics=

 # This will evaluate the retrieval of .res files and put it in .eval files
 > ./bin/trec_terrier.sh -e -Dtrec.qrels=

43 of 45
Summary

• We have learnt how to use Terrier for “our needs” of IR and NLP.

44 of 45
Thank You! :)

45 of 45
You can also read