Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo

Page created by Adrian Jackson
 
CONTINUE READING
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Recuperación de Información y Web Mining
            Modelo Booleano
             José M. Castaño
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Hoy
  Modelos
  Modelo Booleano

  Inverted Files
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Conceptos Básicos
  Documento
  Collección
  Information Need
  Query
  Document is described by a set of representative keywords (index
  terms)
  Terms: ( binary ) weights calculated from statistics of their frequency
  in text
       Terms vs Words/Tokens
  Retrieval: matching process between document terms and terms in
  queries

                                  3
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Modelos
  A model is an embodiment of the theory in which we define a set of
  objects about which assertions can be made and restrict the ways in
  which classes of objects can interact

  A retrieval model specifies the representations used for documents
  and information needs, and how they are compared. (Turtle y Croft,
  1992)
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Caracterización formal de IR Model

An information retrieval model is a quadrupole                             where

                                                    

                                                               

                                                                      
                                                          
    is a set of representations for the documents in the collection


    is a set of representations for the user information needs (queries)


   is a framework for modelling document representations, queries, and


their relationships
             ) is a ranking function which associates a real number with a
        


    

             

query                   and document representation

                                                                     
                     

                                                                  
                     
                 

                                                          
                                                      

                                                              
(Baeza-Yates y Ribeiro-Neto, 1999)

                                         5
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Caracterización formal (esquema)

                      6
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Implementación vs. Modelo
  An IR model is a formalization of the way of thinking about
  information retrieval
  Compare to implementation—how to operationalize the model in a
  given environment (e.g. file structures)

                                 7
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
8
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
IR using the Boolean model
  Queries are Boolean expressions, e.g., Caesar            Brutus

                                                  

                                                       
  IR system returns all documents that satisfy the Boolean expression
  Modelo basado en Teoría de Conjuntos y algebra booleana

  IR systems comerciales (Dialog, Lexis/Nexis)

  Semántica precisa

                                 9
Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
Modelo Booleano
  first online systems in 60s and 70s
  most widely used in commercial IR

  Advanced feature in most other systems
           ,       ,        , (), precedencia + Terms
  

       

                   

                       
               
               

  usually supplemented with proximity operators
  requires an exact match

  based on inverted file

                                        10
Drawbacks del sistema booleano
  Retrieval based on binary decision criteria with no notion of partial
  matching
  No ranking of the documents is provided (absence of a grading scale)

  Information need has to be translated into a Boolean expression
  which most users find awkward
  The Boolean queries formulated by the users are most often too
  simplistic
  As a consequence, the Boolean model frequently returns either too
  few (conjuntive) or too many (disjunctive) documents in response to
  a user query
Extensiones del sistema booleano
  How to extend the Boolean model (past focus)
      partial matching
      ranking

  Two extensions of boolean model:
      Fuzzy Set Model
      Extended Boolean Model
Matriz de incidencia

                       13
Vectores de términos
  So we have a 0/1 vector for each term.

  To answer query: take the vectors for Brutus, Caesar and Calpurnia
  (complemented)     bitwise        .
                                  

                                       
                     

  110100            110111             101111 = 100100.
           

                

                             

                                  
Corpora más grande?

Consider N = 1M documents, each with about 1K terms.
Avg 6 bytes/term incl spaces/punctuation
6GB of data in the documents.
Say there are m = 500K distinct terms among these.

                                   15
Matriz de incidencia

500K x 1M matrix has half-a-trillion 0s and 1s.
But it has no more than one billion 1s.
matrix is extremely sparse        Por qué
What’s a better representation?
We only record the 1 positions.

                                    16
Inverted Index

                 17
Inverted Index

                 18
Construcción (Inverted Index)

                      19
20
23
25
Procesamiento de la Query

Consider processing the query:
Brutus            Caesar
         

              

Locate Brutus in the Dictionary;
Retrieve its postings.
Locate Caesar in the Dictionary;
Retrieve its postings.
Merge the two postings:

                                   26
Merge

Walk through the two postings simultaneously, in time linear in the total
number of postings entries

If the list lengths are x and y, the merge takes O(x+y) operations.
Crucial: postings sorted by docID.

                                     27
Intersecting merging two postings lists
M ERGE
              
         

1
   

        

              

                    

                               

2 while      NIL and                                     NIL
                    ! "

                                                   ! "
              

3 do if
                                        !
                   %$
                        

                                               %$
                                                    
         #

                                          #
                               &
                                    '

                                                          &
                                                               '

4       then A DD
                                                               %$
                                                                    
                                        

                                              

                                                    
                                                         #

                                                                        &
                                        

5       else if                                                              '
                                     %$
                                        

                                                               %$
                                                                   
                            #

                                                         #
                                                   ()'
                                              &

                                                                         &
                                                                             '

6               then
                                              
                                                   

                                                         *+
                                          

                                                               &
                                                                    '

7               else
                                              
                                                   

                                                         *+

                                                               &
                                                                    '

8 return
                   

                          

                                

                                                                        28
Queries más generales

Cómo se adapta el algoritmo en este tipo de Queries?
Brutus                                  Caesar
         

                  
                      

                               

Brutus                              Caesar
              

                          
         
          

Es todavía                               o cuál es la complejidad?
                      * 
                  

                               ,
                                    -

Qué pasa con una fórmula Booleana como:
(Brutus               Caesar)
                                        

                                             
                                                 

                                                     
          
              

(Antony                Cleopatra)
              
              

Es siempre lineal?

                                                          29
Exact Match
  The Boolean Retrieval model is being able to ask a query that is a
  Boolean expression:
      Boolean Queries are queries using AND, OR and NOT to join
  query terms
         Views each document as a set of words
         Is precise: document matches condition or not.

  Primary commercial retrieval tool for 3 decades.
  Professional searchers (e.g., lawyers) still like Boolean queries:
       You know exactly what you’re getting.

                                  30
Ejemplo: Westlaw
  Largest commercial (paying subscribers) legal search service (started
  1975; ranking added 1992)

  Tens of terabytes of data; 700,000 users
  Majority of users still use boolean queries

  Example query:
        What is the statute of limitations in cases involving the federal
  tort claims act?
     LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3
  CLAIM
  /3 = within 3 words, /S = in same sentence

                                   31
Another example query:
    Requirements for disabled people to be able to access a
workplace
     disabl! /p access! /s work-site work-place (employment /3
place)

Note that SPACE is disjunction, not conjunction!
Long, precise queries; proximity operators; incrementally developed;
not like web search

Preference for Boolean search:
    Precision, transparency and control

But that doesn’t mean they actually work better

                                 32
Optimización

What is the best order for query processing?
Consider a query that is an AND of t terms.
For each of the t terms, get its postings, then AND them together.

                                    33
Optimización
   Process in order of increasing freq:
        start with smallest set, then keep cutting further.

Ejecutar como (Caesar AND Brutus AND Calpurnia

                                    34
Optimización, intersección de un conjunto

                                                                                            NIL
                                                                              '                                '
                                                                              '                                '
                                                                                             ! "    

                                                                                                                                             35
                                                                   .                        +     .
                                                                                              /0             &
                                                                              &                
                                                                                                                        '
                                                      +                                              +                     
                                                                    +
                                                                                                                             +1
                                                       S ORT B Y F REQ
                                                                                                    +

                                                             NIL and
                                                                    +                   '            1                            
                                                                    1
                                                                                                                            0
                                                                          3&                               3&      .
                                                                                                                           +
                                                                                                                         &   /0
                                                                          2                                2
                                                                                  .                              +        
                                                                    1                  &            1            +         
                                                                                              ! "    

                                                  M ERGE (1)
                                                                                 +                               
                                                                    +             +                +
                                                                                                                                      +
                                                                    #                              #                                /0
                                                                                                                                      
                                                                                            .   
                                                                                                                                      

                                            8 return
                                                                                                   
                                                  

                                            4 while
                                                                                                                   .
                                              +
                                                  
                                                                    +                          +     +1
                                                                    /0

                                            M ERGE

                                            5 do
                                                          .                    .                0             +
                                                                    
                                                          +                       +

                                            1
                                            2
                                            3

                                            6
                                            7
Generalizar la Optimización

(Caesar       Brutus)                (Hamlet        Cordellia)
                        

                             
          
          

                                               
                                                
Get freq’s for all terms.
Estimate the size of each             by the sum of its freq’s (conservative).
                             
                                 

Process in increasing order of             sizes
                                      
                                       

                                               36
Ejemplo

Cuál es el mejor orden para procesar:
( tangerine           trees )   

                                      
              
              

( marmalade               skies )
                                     

                                          
                  
                  

(kaleidoscope              eyes )
                  
                      

                                              37
Más allá de los términos
  What about phrases?
      Stanford University

  Proximity: Find Gates NEAR Microsoft.
      Constraint on AND
      Need index to capture position information in docs

  Zones in documents:
      Find documents with (author = Ullman) AND (text contains
  automata).

                                38
Frecuencia intra Documento

1 vs. 0 occurrence of a search term
2 vs. 1 occurrence
3 vs. 2 occurrences, etc.
Usually more seems better
Need term frequency information in docs

                                      39
Ranking

Boolean queries give inclusion or exclusion of docs.
Often we want to rank/group results
In practice: order chronologically
Need to measure proximity from query to each doc.
Need to decide whether docs presented to user are singletons, or a group
of docs covering various aspects of the query.

                                      40
Extended Boolean

Boolean model is simple and elegant.
But, no provision for a ranking
Fuzzy model, ranking by relaxing the condition on set membership. (No
evaluationon standard test sets)
Extend the Boolean model with the notions of partial matching and term
weighting Combine characteristics of the Vector model with properties of
Boolean algebra
p-norm is most famous
usually impractical to implement
usually hard for user to understand
Pseudo-Boolean Queries
  A new notation, from web search
      +cat dog +collar leash
      These are prefix operators

  Does not mean the same thing as AND/OR!
      + means mandatory, must be in document
      - means cannot be in the document
  Phrases:
      “stray cat” AND “frayed collar”
      is equivalent to
      “+stray cat +frayed collar”
Result Sets
  Run a query, get a result set

  Two choices
       Reformulate query, run on entire collection
       Reformulate query, run on result set
  Example: Dialog query
       (Redford AND Newman)
       -> S1 1450 documents
       (S1 AND Sundance)
       ->S2 898 documents

                                  43
Faceted Queries
  Strategy: break query into facets
       conjunction of disjunctions
       each facet expresses a topic
You can also read