Exploring Web Archives: Challenges and Solutions - KBS

Page created by Gloria Crawford
 
CONTINUE READING
Exploring Web Archives: Challenges and Solutions - KBS
Exploring Web Archives: Challenges and Solutions

       Vaibhav Kasturia     Supervisor: Prof. Dr. Wolfgang Nejdl
     vbh18kas@gmail.com                nejdl@l3s.de

                                                12. Juli 2016      1
Exploring Web Archives: Challenges and Solutions - KBS
Outline

      • Social Media Growth: Twitter
      • Social Media Content Loss
      • Need for Web Archiving
      • Temporal Information
      • Temporal Tagging
      • Applications and Challenges
      • Conclusion

http://iabireland.ie/wp-content/uploads/2015/11/social-media-original.jpg   Vaibhav Kasturia   2
Exploring Web Archives: Challenges and Solutions - KBS
Tremendous Growth of the Social Media

http://www.infinitdatum.com/wp-content/uploads/2014/12/social-media-data.jpg   Vaibhav Kasturia   3
Exploring Web Archives: Challenges and Solutions - KBS
What do we Preserve ?

http://tinyurl.com/hqgp4te; http://tinyurl.com/hdhcam8; http://tinyurl.com/hyarvmn   Vaibhav Kasturia   4
Exploring Web Archives: Challenges and Solutions - KBS
How much Social Media Content gets                                                                                                   Lost? [1]

        • Culturally Significant Events (June 2009 - March 2012)

                 H1N1 Virus Outbreak                                                                   Syrian Uprising                                             Egyptian Revolution

                     Iranian Elections                                                      Michael Jackson’s Death                                      Obama gets Nobel Peace Prize
                                                                                                                                                               [1] Salaheldeen, H.; Nelson, M. L.: Losing My
                                                                                                                                                               Revolution: How Many Resources Shared on Social
                                                                                                                                                               Media Have Been Lost? JCDL, Washington, USA 2012
http://cdni.wired.co.uk/620x413/d_f/FLU1.jpg; http://tinyurl.com/jy32puj; http://tinyurl.com/hwozc4o; http://tinyurl.com/z3uztlc; http://tinyurl.com/
grvutxu; http://tinyurl.com/bc893pf;
                                                                                                                                                        Vaibhav Kasturia                                 5
Exploring Web Archives: Challenges and Solutions - KBS
Tweets from Twitter

                 T = Timestamp                 U = Link to user posting the tweet          W = Tweet Content

http://tinyurl.com/jlktjg7; http://tinyurl.com/zz55q38                              Vaibhav Kasturia           6
Exploring Web Archives: Challenges and Solutions - KBS
Finding Relevant Tweets

                               Swine Flu                                  Common Cold

                                                     #h1n1 Versus #flu

http://tinyurl.com/hvh66mx; http://tinyurl.com/jo3tsj9                   Vaibhav Kasturia   7
Exploring Web Archives: Challenges and Solutions - KBS
Finding Relevant Tweets

         Michael Jackson’s Death                                                     Paul Walker’s Death

                                  #michaeljackson or #mj Versus #rip

http://tinyurl.com/z2k6zo5; http://tinyurl.com/zghqh8t; http://tinyurl.com/hxzgr5g         Vaibhav Kasturia   8
Exploring Web Archives: Challenges and Solutions - KBS
Finding Relevant Tweets

                                                          #obama ?

                                                                                     White House Correspondent’s Dinner

  Getting Nobel Peace Prize

                                                             Visit to Hannover

http://tinyurl.com/jf48jek; http://tinyurl.com/jtpbqrq; http://tinyurl.com/hxdhmvv       Vaibhav Kasturia          9
Exploring Web Archives: Challenges and Solutions - KBS
Finding Relevant Tweets

Table 1: Twitter hashtags generated for filtering and their frequency of occurring[1]

                                                          Vaibhav Kasturia              10
Uniqueness Check and Duplicate Elimination

 http://www.formula1.com                               http://www.f1.com
http://www.formula1.com                       Vaibhav Kasturia       11
Checking for Lost and Archived Resources
    • Success Class
        ! 200 OK

    • Failure Class
        ! 404 Not Found
        ! 403 Forbidden
        ! 410 Gone
        ! 30X Redirect Family
        ! 50X Server Error
        ! Soft 404s          http://www.ibm.com/us          http://www.ibm.com/us/blahblah
                                                Soft 404 Detection[2]
                                                                [2] Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic
                                                                Transit Gloria Telae: Towards an Understanding of the
                                                                Web’s Decay. In: Proceedings of the 13th International
                                                                Conference on World Wide Web, WWW 2004, pp. 328–337

http://www.ibm.com/us-en/                                Vaibhav Kasturia                                           12
Building Model

Fig. 1. URIs shared per day corresponding to each event[1]
                                         Vaibhav Kasturia    13
Building Model
                       Table 2: The Split Dataset[1]

Fig. 2. Percentage of content missing and archived as a function of time[1]
                                                  Vaibhav Kasturia       14
Observations from Model
• Linear Relationship between
  ! Content Lost Percentage or Content Archived Percentage
  ! Age in Days

         Content Lost Percentage = 0.02(Age in Days) + 4.20

       Content Archived Percentage = 0.04(Age in Days) + 6.74

• An year after publishing content on Social media, about 11% will
  be gone
• After this point, we lose roughly 0.02% of content per day
• Two and three years later, about 19% and 26% of content is lost

                                              Vaibhav Kasturia      15
Twitter Content                        Generation[3]

     • 50 % of content on Twitter generated by 0.05 % of users

                      Lady Gaga                                Ashton Kutcher                         Oprah Winfrey

     • Content reaching masses through intermediate layer of opinion
       leaders (not celebrities)
                                                                                             [3] Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who
                                                                                             Says What to Whom on Twitter. In: Proceedings of the
                                                                                             20th International Conference on World Wide Web,
                                                                                             WWW 2011, pp. 705–714 (2011)

http://tinyurl.com/hovtg77; http://tinyurl.com/jsq2qo6; http://tinyurl.com/hg4pj2g    Vaibhav Kasturia                                      16
Tweet           Lifetimes [3]

     • Media Generated Content URIs(e.g. Breaking News): Short Lived
     • Blog Content URIs (e.g. Cooking tips, Parenting Tips) have more life
     • Music Video URIs : Most Lived

       Merkel visits CeBIT 2016                                       Cooking Tips                       Music Videos

http://tinyurl.com/j3tewp9; http://tinyurl.com/gtxopdo; http://tinyurl.com/gvsfllv    Vaibhav Kasturia                  17
Web   Archives [4]

  • Important to archive culturally significant resources
  • Need to develop tools, models and techniques
  • Research in L3S : ALEXANDRIA PROJECT
  • Searching: Semantic Based or Time Based or Both
  • Searching along Time dimension: Temporal Information Retrieval
                                                         [4] 1st ALEXANDRIA Workshop (http://alexandria-
                                                         project.eu/1st_alex_ws/)

http://tinyurl.com/h7lygpc                        Vaibhav Kasturia                               18
Characteristics of Temporal Information[5]
       • Clear Relationship between Events
             ! Before

                                       Attack on Charlie Hebdo (7 Jan 2015)                                           Paris Attacks(13 Nov 2015)

             ! Overlap

                               European Migrant Crisis(Jan 2015-Today)                                       Russian Intervention in Syria (Sep 2015-Today)

                                                                                                                             [5] Alonso, O.; Strötgen, J.; Baeza-Yates, R.; Gertz, M.:
                                                                                                                             Temporal information Retrieval: Challenges and
                                                                                                                             opportunities. Temporal Web Analytics Workshop
                                                                                                                             (TWAW), WWW, Hyderabad, India, 2011
http://tinyurl.com/hzwjw5o, http://tinyurl.com/j2cr7ks, http://tinyurl.com/j2ffp8v, http://tinyurl.com/hew6huv        Vaibhav Kasturia                                        19
Characteristics of Temporal Information[5]
      • Clear Relationship between Events
          ! After

   Iran-Saudi Arabia cut diplomatic ties (4 Jan 2016)    Execution of Shia Cleric Sheikh al-Nimr (2 Jan 2016)

      • Temporal Information can be Normalized
      • Suitable Granularity can be chosen (Coarse or Fine)

http://tinyurl.com/gvct8d6, http://tinyurl.com/z7thpmt                 Vaibhav Kasturia                 20
Clustering & Exploring Search Results using Timelines[6]
   • TCluster Algorithm

Fig. 3.Timeline cluster for the query [football world cup][6]   Fig. 4.Timeline cluster for [avian flu] tweets[6]

                                                                                   [6] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and
                                                                                   Exploring Search Results Using Timeline Constructions. In
                                                                                   Proceedings of the 18th ACM International Conference on
                                                                                   Information and Knowledge Management (CIKM ’09), pages
                                                                                   97–106, 2009
                                                                            Vaibhav Kasturia                                          21
Types of Temporal Information

     • Explicit Temporal Information
          ! December 25, 2015

     • Implicit Temporal Information
          ! New Year 2016

http://tinyurl.com/hzehprd; http://tinyurl.com/jnjyres      Vaibhav Kasturia   22
Types of Temporal Information
     • Relative Information
          ! “Tear gas was fired at refugees at the Greece border yesterday”
          ! “On Monday, voting was conducted to decide whether UK should
           remain part of the EU”
          ! “Over the past few years, pressure has been rising on Greece to
           pay off its EU debt”

          Migrant Clashes                                  UK’s Future in EU                   Greek Financial Crisis

http://tinyurl.com/z6tpgfm; http://tinyurl.com/goukwbu; http://tinyurl.com/jgh2clc   Vaibhav Kasturia                   23
Temporal Tagging
• TempEval-2 Challenge : HeidelTime Temporal Tagger

                Fig. 5. HeidelTime System Architecture[7]
                                                        [7] J. Strötgen and M. Gertz. HeidelTime: High Quality
                                                        Rule-based Extraction and Normalization of Temporal
                                                        Expressions. In Proceedings of the 5th International
                                                        Workshop on Semantic Evaluation (SemEval ’10), pg 321-324

                                                 Vaibhav Kasturia                                       24
Temporal Tagger Application
• TimeTrails: HeidelTime used as Temporal Tagger

                Fig. 6.TimeTrails System Architecture[8]
                                                         [8] J. Strötgen and M. Gertz. TimeTrails: A System for
                                                         Exploring Spatio-Temporal Information in Documents. In
                                                         Proceedings of the 36th International Conference on Very
                                                         Large Data Bases (VLDB ’10), pages 1569–1572, 2010

                                                  Vaibhav Kasturia                                       25
Temporal Tagger Application
     • Visualizes information extracted as Document Trajectories
     • Intersection of Trajectories: Documents (may) have same
       Spatio-Temporal Scope

           James Joyce                                                            Samuel Beckett

          Fig. 7.TimeTrails: Multiple Document View and Intersection of Trajectories[8]

http://tinyurl.com/z2o99mf; http://tinyurl.com/z29xwx6         Vaibhav Kasturia                    26
Further Applications and Challenges
• Enhancing functionality of Temporal Information Retrieval Apps
• Finding trending news from Twitter before getting published as Article
• Temporal Summaries for Search Results
• Perform
  ! Temporal Clustering
  ! Temporal Querying
  ! Temporal Question-Answering
  ! Temporal Similarity between Documents
• Web Archiving: Predicting how often Web Content Change happens
  for efficient Web Crawling
• Many Open Research Challenges
• Huge Future Scope for Development

                                              Vaibhav Kasturia      27
References
[1] Salaheldeen, H.; Nelson, M. L.: Losing My Revolution: How Many Resources
Shared on Social Media Have Been Lost? JCDL, Washington, USA, 2012

[2] Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae:
Towards an Understanding of the Web’s Decay. In: Proceedings of the 13th
International Conference on World Wide Web, WWW 2004, pp. 328–337 (2004)

[3] Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who Says What to Whom on
Twitter. In: Proceedings of the 20th International Conference on World Wide Web,
WWW 2011, pp. 705–714 (2011)

[4] 1st ALEXANDRIA Workshop (http://alexandria-project.eu/1st_alex_ws/)

[5] Alonso, O.; Strötgen, J.; Baeza-Yates, R.; Gertz, M.: Temporal information retrieval:
Challenges and opportunities. Temporal Web Analytics Workshop (TWAW), WWW,
Hyderabad, India, 2011

                                                           Vaibhav Kasturia            28
References

[6] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and Exploring Search Results
Using Timeline Constructions. In Proceedings of the 18th ACM International Conference
on Information and Knowledge Management (CIKM ’09), pages 97–106, 2009

[7] J. Strötgen and M. Gertz. HeidelTime: High Quality Rule-based Extraction and
Normalization of Temporal Expressions. In Proceedings of the 5th International Workshop
on Semantic Evaluation (SemEval ’10), pages 321-324, 2010

[8] J. Strötgen and M. Gertz. TimeTrails: A System for Exploring Spatio-Temporal
Information in Documents. In Proceedings of the 36th International Conference on Very
Large Data Bases (VLDB ’10), pages 1569–1572, 2010

[9] BBC News (bbc.com/news)

[10] CNBC: Major Global Events of 2015(http://www.cnbc.com/2015/12/31/major-global-
events-that-shook-2015.html)
                                                         Vaibhav Kasturia            29
Discussion

             Vaibhav Kasturia   30
You can also read