Exploring Web Archives: Challenges and Solutions - KBS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Exploring Web Archives: Challenges and Solutions
Vaibhav Kasturia Supervisor: Prof. Dr. Wolfgang Nejdl
vbh18kas@gmail.com nejdl@l3s.de
12. Juli 2016 1Outline
• Social Media Growth: Twitter
• Social Media Content Loss
• Need for Web Archiving
• Temporal Information
• Temporal Tagging
• Applications and Challenges
• Conclusion
http://iabireland.ie/wp-content/uploads/2015/11/social-media-original.jpg Vaibhav Kasturia 2Tremendous Growth of the Social Media http://www.infinitdatum.com/wp-content/uploads/2014/12/social-media-data.jpg Vaibhav Kasturia 3
What do we Preserve ? http://tinyurl.com/hqgp4te; http://tinyurl.com/hdhcam8; http://tinyurl.com/hyarvmn Vaibhav Kasturia 4
How much Social Media Content gets Lost? [1]
• Culturally Significant Events (June 2009 - March 2012)
H1N1 Virus Outbreak Syrian Uprising Egyptian Revolution
Iranian Elections Michael Jackson’s Death Obama gets Nobel Peace Prize
[1] Salaheldeen, H.; Nelson, M. L.: Losing My
Revolution: How Many Resources Shared on Social
Media Have Been Lost? JCDL, Washington, USA 2012
http://cdni.wired.co.uk/620x413/d_f/FLU1.jpg; http://tinyurl.com/jy32puj; http://tinyurl.com/hwozc4o; http://tinyurl.com/z3uztlc; http://tinyurl.com/
grvutxu; http://tinyurl.com/bc893pf;
Vaibhav Kasturia 5Tweets from Twitter
T = Timestamp U = Link to user posting the tweet W = Tweet Content
http://tinyurl.com/jlktjg7; http://tinyurl.com/zz55q38 Vaibhav Kasturia 6Finding Relevant Tweets
Swine Flu Common Cold
#h1n1 Versus #flu
http://tinyurl.com/hvh66mx; http://tinyurl.com/jo3tsj9 Vaibhav Kasturia 7Finding Relevant Tweets
Michael Jackson’s Death Paul Walker’s Death
#michaeljackson or #mj Versus #rip
http://tinyurl.com/z2k6zo5; http://tinyurl.com/zghqh8t; http://tinyurl.com/hxzgr5g Vaibhav Kasturia 8Finding Relevant Tweets
#obama ?
White House Correspondent’s Dinner
Getting Nobel Peace Prize
Visit to Hannover
http://tinyurl.com/jf48jek; http://tinyurl.com/jtpbqrq; http://tinyurl.com/hxdhmvv Vaibhav Kasturia 9Finding Relevant Tweets
Table 1: Twitter hashtags generated for filtering and their frequency of occurring[1]
Vaibhav Kasturia 10Uniqueness Check and Duplicate Elimination http://www.formula1.com http://www.f1.com http://www.formula1.com Vaibhav Kasturia 11
Checking for Lost and Archived Resources
• Success Class
! 200 OK
• Failure Class
! 404 Not Found
! 403 Forbidden
! 410 Gone
! 30X Redirect Family
! 50X Server Error
! Soft 404s http://www.ibm.com/us http://www.ibm.com/us/blahblah
Soft 404 Detection[2]
[2] Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic
Transit Gloria Telae: Towards an Understanding of the
Web’s Decay. In: Proceedings of the 13th International
Conference on World Wide Web, WWW 2004, pp. 328–337
http://www.ibm.com/us-en/ Vaibhav Kasturia 12Building Model
Fig. 1. URIs shared per day corresponding to each event[1]
Vaibhav Kasturia 13Building Model
Table 2: The Split Dataset[1]
Fig. 2. Percentage of content missing and archived as a function of time[1]
Vaibhav Kasturia 14Observations from Model
• Linear Relationship between
! Content Lost Percentage or Content Archived Percentage
! Age in Days
Content Lost Percentage = 0.02(Age in Days) + 4.20
Content Archived Percentage = 0.04(Age in Days) + 6.74
• An year after publishing content on Social media, about 11% will
be gone
• After this point, we lose roughly 0.02% of content per day
• Two and three years later, about 19% and 26% of content is lost
Vaibhav Kasturia 15Twitter Content Generation[3]
• 50 % of content on Twitter generated by 0.05 % of users
Lady Gaga Ashton Kutcher Oprah Winfrey
• Content reaching masses through intermediate layer of opinion
leaders (not celebrities)
[3] Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who
Says What to Whom on Twitter. In: Proceedings of the
20th International Conference on World Wide Web,
WWW 2011, pp. 705–714 (2011)
http://tinyurl.com/hovtg77; http://tinyurl.com/jsq2qo6; http://tinyurl.com/hg4pj2g Vaibhav Kasturia 16Tweet Lifetimes [3]
• Media Generated Content URIs(e.g. Breaking News): Short Lived
• Blog Content URIs (e.g. Cooking tips, Parenting Tips) have more life
• Music Video URIs : Most Lived
Merkel visits CeBIT 2016 Cooking Tips Music Videos
http://tinyurl.com/j3tewp9; http://tinyurl.com/gtxopdo; http://tinyurl.com/gvsfllv Vaibhav Kasturia 17Web Archives [4]
• Important to archive culturally significant resources
• Need to develop tools, models and techniques
• Research in L3S : ALEXANDRIA PROJECT
• Searching: Semantic Based or Time Based or Both
• Searching along Time dimension: Temporal Information Retrieval
[4] 1st ALEXANDRIA Workshop (http://alexandria-
project.eu/1st_alex_ws/)
http://tinyurl.com/h7lygpc Vaibhav Kasturia 18Characteristics of Temporal Information[5]
• Clear Relationship between Events
! Before
Attack on Charlie Hebdo (7 Jan 2015) Paris Attacks(13 Nov 2015)
! Overlap
European Migrant Crisis(Jan 2015-Today) Russian Intervention in Syria (Sep 2015-Today)
[5] Alonso, O.; Strötgen, J.; Baeza-Yates, R.; Gertz, M.:
Temporal information Retrieval: Challenges and
opportunities. Temporal Web Analytics Workshop
(TWAW), WWW, Hyderabad, India, 2011
http://tinyurl.com/hzwjw5o, http://tinyurl.com/j2cr7ks, http://tinyurl.com/j2ffp8v, http://tinyurl.com/hew6huv Vaibhav Kasturia 19Characteristics of Temporal Information[5]
• Clear Relationship between Events
! After
Iran-Saudi Arabia cut diplomatic ties (4 Jan 2016) Execution of Shia Cleric Sheikh al-Nimr (2 Jan 2016)
• Temporal Information can be Normalized
• Suitable Granularity can be chosen (Coarse or Fine)
http://tinyurl.com/gvct8d6, http://tinyurl.com/z7thpmt Vaibhav Kasturia 20Clustering & Exploring Search Results using Timelines[6]
• TCluster Algorithm
Fig. 3.Timeline cluster for the query [football world cup][6] Fig. 4.Timeline cluster for [avian flu] tweets[6]
[6] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and
Exploring Search Results Using Timeline Constructions. In
Proceedings of the 18th ACM International Conference on
Information and Knowledge Management (CIKM ’09), pages
97–106, 2009
Vaibhav Kasturia 21Types of Temporal Information
• Explicit Temporal Information
! December 25, 2015
• Implicit Temporal Information
! New Year 2016
http://tinyurl.com/hzehprd; http://tinyurl.com/jnjyres Vaibhav Kasturia 22Types of Temporal Information
• Relative Information
! “Tear gas was fired at refugees at the Greece border yesterday”
! “On Monday, voting was conducted to decide whether UK should
remain part of the EU”
! “Over the past few years, pressure has been rising on Greece to
pay off its EU debt”
Migrant Clashes UK’s Future in EU Greek Financial Crisis
http://tinyurl.com/z6tpgfm; http://tinyurl.com/goukwbu; http://tinyurl.com/jgh2clc Vaibhav Kasturia 23Temporal Tagging
• TempEval-2 Challenge : HeidelTime Temporal Tagger
Fig. 5. HeidelTime System Architecture[7]
[7] J. Strötgen and M. Gertz. HeidelTime: High Quality
Rule-based Extraction and Normalization of Temporal
Expressions. In Proceedings of the 5th International
Workshop on Semantic Evaluation (SemEval ’10), pg 321-324
Vaibhav Kasturia 24Temporal Tagger Application
• TimeTrails: HeidelTime used as Temporal Tagger
Fig. 6.TimeTrails System Architecture[8]
[8] J. Strötgen and M. Gertz. TimeTrails: A System for
Exploring Spatio-Temporal Information in Documents. In
Proceedings of the 36th International Conference on Very
Large Data Bases (VLDB ’10), pages 1569–1572, 2010
Vaibhav Kasturia 25Temporal Tagger Application
• Visualizes information extracted as Document Trajectories
• Intersection of Trajectories: Documents (may) have same
Spatio-Temporal Scope
James Joyce Samuel Beckett
Fig. 7.TimeTrails: Multiple Document View and Intersection of Trajectories[8]
http://tinyurl.com/z2o99mf; http://tinyurl.com/z29xwx6 Vaibhav Kasturia 26Further Applications and Challenges
• Enhancing functionality of Temporal Information Retrieval Apps
• Finding trending news from Twitter before getting published as Article
• Temporal Summaries for Search Results
• Perform
! Temporal Clustering
! Temporal Querying
! Temporal Question-Answering
! Temporal Similarity between Documents
• Web Archiving: Predicting how often Web Content Change happens
for efficient Web Crawling
• Many Open Research Challenges
• Huge Future Scope for Development
Vaibhav Kasturia 27References
[1] Salaheldeen, H.; Nelson, M. L.: Losing My Revolution: How Many Resources
Shared on Social Media Have Been Lost? JCDL, Washington, USA, 2012
[2] Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae:
Towards an Understanding of the Web’s Decay. In: Proceedings of the 13th
International Conference on World Wide Web, WWW 2004, pp. 328–337 (2004)
[3] Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who Says What to Whom on
Twitter. In: Proceedings of the 20th International Conference on World Wide Web,
WWW 2011, pp. 705–714 (2011)
[4] 1st ALEXANDRIA Workshop (http://alexandria-project.eu/1st_alex_ws/)
[5] Alonso, O.; Strötgen, J.; Baeza-Yates, R.; Gertz, M.: Temporal information retrieval:
Challenges and opportunities. Temporal Web Analytics Workshop (TWAW), WWW,
Hyderabad, India, 2011
Vaibhav Kasturia 28References
[6] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and Exploring Search Results
Using Timeline Constructions. In Proceedings of the 18th ACM International Conference
on Information and Knowledge Management (CIKM ’09), pages 97–106, 2009
[7] J. Strötgen and M. Gertz. HeidelTime: High Quality Rule-based Extraction and
Normalization of Temporal Expressions. In Proceedings of the 5th International Workshop
on Semantic Evaluation (SemEval ’10), pages 321-324, 2010
[8] J. Strötgen and M. Gertz. TimeTrails: A System for Exploring Spatio-Temporal
Information in Documents. In Proceedings of the 36th International Conference on Very
Large Data Bases (VLDB ’10), pages 1569–1572, 2010
[9] BBC News (bbc.com/news)
[10] CNBC: Major Global Events of 2015(http://www.cnbc.com/2015/12/31/major-global-
events-that-shook-2015.html)
Vaibhav Kasturia 29Discussion
Vaibhav Kasturia 30You can also read