Preserving 2008 US Presidential Election Videos

Page created by Darryl Bell

Current Events

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Preserving 2008 US Presidential Election Videos

Preserving 2008 US Presidential Election Videos

                               Chirag Shah                                              Gary Marchionini
             School of Information and Library Science                      School of Information and Library Science
                   University of North Carolina                                   University of North Carolina
                   Chapel Hill NC 27599, USA                                      Chapel Hill NC 27599, USA
                            chirag@unc.edu                                             march@ils.unc.edu

ABSTRACT                                                                  tence and inform future generations. As more information is
Online digital video hosting and sharing sources are becom-               produced, distributed, and stored in digital form, the prob-
ing increasingly popular. People are not only uploading and               lem of long-term preservation has raised new kinds of chal-
viewing videos at these sites, but are also discussing them.              lenges for archivists and records managers. An active re-
In addition to the comments, visitors also rate and link these            search and development effort is underway to address these
videos. Statistics about these and other user behaviors on                challenges.
such sites can tell us a lot about social trends and patterns,
which in turn can serve as important context to aid in inter-             Our work in this area focuses on preserving digital video,
preting archival materials. We are interested in analyzing                a medium that is taking on increasing importance as tools
the correlation between this behavior and an actual social                to create and distribute it become common. Our work is
event. We chose the 2008 US Presidential Election as the                  rooted in two beliefs: First, it is the responsibility of schol-
event and YouTube as the video source. In this paper we re-               ars to preserve content that is beyond the massively popular
port a system that facilitates harvesting presidential election           content that mainstream media organizations will preserve
videos from YouTube along with corresponding metadata                     and that will be copied by multitudes of people and by vari-
and contextual information. The system is built to revisit                ous cultural institutions (the LOCKSS hypothesis [3]). Thus
the same video pages periodically and collect the contextual              we aim to identify and preserve video that includes items
information for every visit. From this data we are hoping to              below the mainstream, possibly ephemeral in nature. Sec-
understand various social behaviors that may help us find-                ond, preserving the digital video bits is useless without the
ing some interesting connections with the real-life events and            metadata that makes finding and viewing possible and the
results and inform digital video curatorial policies.                     contextual information that makes interpretation possible.
                                                                          Determining what context to include is a kind of collection
Categories and Subject Descriptors                                        policy and thus much of our attention has focused on this
H.3.6 [Information Storage and Retrieval]: Library Au-                    specific issue.
tomation
                                                                          Our emerging framework for context [5] includes several
                                                                          facets, however, in this paper we focus on temporal and
General Terms                                                             social facets: how context evolves over time, and how the
Design, Human Factors, Management                                         digital medium facilitates and propagates social interactions
                                                                          around video objects. We first describe a system that har-
Keywords                                                                  vests video, accompanying metadata, and sets of temporal
Preservation, Video harvesting, YouTube, Presidential elec-               and social contextual information. The system is a compo-
tions, Metadata, Social context                                           nent of a generalized video curator?s toolkit that will allow
                                                                          curators to specify sources as well as harvesting policies for
1.    INTRODUCTION                                                        video, metadata, and context. The second part of the paper
                                                                          discusses the metadata and context policies and describes
Corporations and governments have long recognized the im-
                                                                          the preliminary decisions we have made to date.
portance of preserving records to serve decision-making and
legal requirements. Cultural institutions such as archives,
                                                                          Before we proceed, let us try to understand temporal and
libraries and museums also have long aimed to preserve cul-
                                                                          social context with respect to digital preservation.
tural artifacts and information to document human exis-
                                                                          Temporal context. Artifacts change over time through physi-
                                                                          cal deterioration and information changes over time through
                                                                          use and interpretation. We view a digital information archive
                                                                          as dynamic, requiring ongoing maintenance and updating.
                                                                          This is in stark contrast to physical archives that grow but
                                                                          depend on maintaining static artifacts.
IWAW’07. June 23, 2007, Vancouver, British Columbia, Canada. This work
is licence under a Attribution- NonCommercial -NoDerivs 2.0 France Cre-
ative Commons Licence.                                                    Social context. Digital access allows many more people to

Figure 1: YouTube video harvesting and monitoring

use, interact with, annotate, and share archival materials. concludes in section 4 with some pointers to the future work.
With the exponentially increasing participation of web users
in blogs, forums, and social networks, studying their online
behavior has become an essential part of learning about var-
2. DESIGN
We ultimately aim to provide a curator?s toolkit that is
ious social trends and patterns. For many organizations, it
easy to use and mainly runs automatically. The system
has become very important to know what content people
described here is a first pass to demonstrate feasibility and
are reading or watching, what they are talking about them,
gather enough data to investigate curatorial policies for con-
and how these activities reflect and affect their lives. For
text mining and to inform system design parameters. We
archival materials, these social interactions serve as an im-
implemented the system to take a set of seed queries and
portant kind of context.
then use them to retrieve videos on YouTube. Thus, the
curator expresses the collection development policy by spec-
In the work reported here, we are interested in analyzing
ifying queries and the system collects and organizes results.
people’s online video watching and their participation for
The selection of these seed queries and other policy decisions
discussions relating to 2008 US Presidential Elections. We
are described in the next section. The architecture of our
chose this event because it is one of the most important po-
system is given in Figure 1. Following is a brief description
litical events in the US; a large portion of the US population
of its workflow.
participates in discussions and consumes a large amount of
information related to the elections. We chose to analyze
video postings and user feedback on YouTube [7] because
1. The curator provides a set of seed queries to monitor
it is one of the largest online video sharing sources. As
(Figure 2).
of August 2006 [4], YouTube had hosted more than 6 mil-
lion videos; and the total time people spent watching these 2. The system uses these queries to go out and search on
videos since its inception in February 2005 was 9,305 years. YouTube.
In this paper we describe a system that we have built for 3. A set of metadata is extracted from a subset of the
harvesting YouTube videos related to 2008 US Presidential results returned from YouTube. We define metadata
Elections. The rest of the paper is organized as follows. to be the information about the given video which are
In section 2, we provide the overall design schema for the provided by the author of that video, and are usually
system and some of the implementation details. Specific static in nature. For instance, the genre of the video
collection policy issues are discussed in section 3. The paper (called ‘channel’ on YouTube).

4. The Video downloader component checks the meta- event). By examining instances of these two kinds of trig-
data table to see which videos have not been previously gers over time, we hope that we can make recommendations
downloaded and grabs those videos in Flash format. for where curators should put their computational resources
and personal analysis time when mining context for their
5. The video converter component checks which videos video collections.
are downloaded and not converted, and converts them
into MPEG format. 3. POLICY DECISIONS
6. The context capturing component goes out to YouTube Two main kinds of policies are required in any preservation
and captures various contexts about the video items effort: what to collect (the appraisal process) and how to
for which the metadata is already collected. Each time collect it. In our case, policies must be established for the
context is captured, a time-stamp is recorded. We de- videos, metadata, and context. In the case of the primary
fine social context as the data contributed by visitors video data, we have identified a small number of classes of
to the video page. This would include fields such as video and are in the process of identifying others. Several
ratings and comments. Note that other types of so- specific questions related to these two kinds of policy follow.
cial context in blogs and other sources could also be
harvested with different components. The context cap- Specific what policy questions:
turing component runs periodically and updates time-
sensitive data such as new comments or video postings, 1. What genre of video?
thus capturing temporal context.
2. Which metadata to include?

Thus, there are four major processes: (1) metadata collec- 3. Which context to include?
tion, (2) context collection, (3) video downloader, and (4)
video converter. Each of these parts can be run indepen- Specific how policy questions:
dently and they all will check the overlapping functions with
other parts to guarantee consistency and integrity of the
whole system. 1. Which sources to use to find the video?
2. What queries should be used to find the video, includ-
ing lexical variants?
3. To what depth in the results lists should video be har-
vested?
4. How frequently should data be harvested?
5. Should links to other sites be followed and if so, to
what depth?

Figure 2: Providing seed queries 6. How should video responses (videos uploaded by the
public as comments on the video) be handled?
We implemented the system using PHP 5 and MySQL 5. 7. How should duplication be handled while still acquir-
We also used YouTube APIs [8] to access some of the ser- ing comment and download data?
vices from YouTube. The data, in our case, are videos from
YouTube. Since these videos are embedded in YouTube
web-pages and viewed by streaming, we had to use some For this paper, we consider the 2008 US Presidential cam-
mechanism that would allow us to download and store those paign videos hosted on YouTube as answers to the genre
videos. We used a freely available tool called ‘youtube-dl’ [2] and source questions. In order to search for these videos,
for downloading the video in Flash format from a YouTube we decided to run a set of queries on YouTube. To deter-
page. In addition to this, we use FFmpeg tool [1] to convert mine queries, we decided to use all the present presidential
downloaded Flash videos to MPEG format. The system is candidates’ names available from WikiPedia [6] as queries.
operational and we are using it to collect a test set of data. In addition, we also chose the following six general queries:
As of June 1, 2007, 35 crawls (one each day) were conducted election 2008, US election 2008, United States election 2008,
for 56 different queries related to the presidential campaign. presidential election 2008, campaign 2008, decision 2008.
These crawls yielded 5129 unique videos. The range of num- This yielded a total of 56 queries. To simplify the process in
ber of viewings per query was enormous, ranging from 123 this early state, the queries are entered as free form text (no
for one candidate (Rebecca Rotzler) to 29165 for another special delimiters such as quotes) and candidate names are
person who is not currently a candidate (Al Gore). Like- exactly as expressed in the Wikipedia listing rather than all
wise the number of comments posted varied dramatically. variants.
Our next steps will be to systematically investigate pat-
terns over time and the kinds of triggers that change use The question of what metadata was answered as follows. We
and social interaction. We postulate two kinds of triggers: collect the following metadata: title, description, categories,
predictable (e.g., a debate, a primary election) and unpre- time-stamp, the query for which a given video was retrieved,
dictable (e.g., a speech or public appearance snafu, personal username of the user who posted that video, date when the

video was uploaded, time when it was uploaded, URL of the distinct for two kinds of event ?triggers? that cause large
video page, URL of the thumbnail image, tags, recording changes in ranks of results and interactions with specific
location and country. A snapshot of collected metadata for videos. Examination of the ?churn,? the ebb and flow of
various videos is given in Figure 3. temporal and social context each over time are especially
interesting for this genre of heavily contemporary video.
The question of what context is complex at the heart of Comparing this to video that is less event-driven will be
our research effort. For these preliminary investigations, we important in our sister studies.
collect the following temporal and social contextual informa-
tion: rank of the video for a given query, number of views, In the coming months we will continue collecting the data,
number of ratings, average rating, number of comments, all metadata, and contexts from YouTube. After this trial
the comments along with the user names, time-stamp, num- phase, we shall be able to make more informed policy de-
ber of times the video was favorited, number of links coming cisions about some of the questions that we discussed in
to that video, number of clicks from the sites that link to the previous section. We plan to keep harvesting and pre-
the video, number of times the video received some honor serving such data, metadata, and context at least until the
(e.g., best in comedy, video of the week, etc.), and time presidential inauguration in January of 2009. We have also
when it was last updated. A snapshot of collected context begun similar crawls for several other topics that will pro-
for various videos is given in Figure 4. vide us with data that will allow more generalizable sugges-
tions for curatorial policies. We believe our collection will
There is no experiential or literature to guide a decision on serve as a valuable resource for understanding some of the
how many results to harvest for each query. For convenience, social trends and patterns relating to the elections during
we could simply get the top 20 results for each query. How- this time period and inform the continued development of
ever, we do not know how the results are ordered and what tools for digital video curators.
the rank of a result signifies. We choose to run a pilot run
for a month collecting top 100 results for each query to see 5. ACKNOWLEDGMENTS
how different results we get every time we crawl. Although The authors wish to thank Helen Tibbo, Cal Lee, Paul
it is too soon to tell what kinds of ?churn? occurs in the top Jones, Terrell Russell, and Yaxio Song for their valuable
100 ranks over time, our preliminary estimates are that the inputs in this project. This work is supported by a grant
top 100 give suitable coverage and do not severely burden from the National Science Foundation and the Library of
the system resources. Congress NDIIPP.
The frequency decision is also tricky. Some preliminary run
data suggested that there are no significant differences be- 6. REFERENCES
tween data collection at one day’s time interval when only [1] FFmpeg. FFmpeg. http://ffmpeg.mplayerhq.hu/
top 20 results are considered. However, no significant po- [Accessed: April 27, 2007], 2007.
litical event happened during those crawls. Once again, in [2] Ricardo Garcia Gonzalez. youtube-dl: Download videos
order to understand the effect of different sampling period, from YouTube.com.
we decided to crawl once every day. Given that the load has http://www.arrakis.es/ rggi3/youtube-dl/ [Accessed:
not been overly burdensome, we will continue to crawl on a April 27, 2007], 2006.
daily basis. [3] Vicky Reich and David S. H. Rosenthal. LOCKSS: A
Permanent Web Publishing and Access System. DLib
Magazine, 7(6), June 2001.
4. CONCLUSIONS [4] Steve Rubel. Micro Persuasion: YouTube by the
This paper outlines the first steps in an ongoing effort to
Numbers. http://www.micropersuasion.com/2006/08/
capture and preserve context for digital video. Both tem-
youtube by the .html [Accessed: March 1, 2007],
poral and social contexts are taken to be crucial to future August 2006.
understanding and interpretation of the video, requiring an
[5] Helen R. Tibbo, Chirstopher A. Lee, Gary Marchionini,
ongoing data collection effort. Automated systems guided
and Dawne Howard. VidArch: Preserving Meaning of
by human-generated and controlled collection policies are
Digital Video over Time through Creating and Capture
the only workable approach to such dynamic data collec-
of Contextual Documentation. In Proceedings of
tion. This paper presents such a system and some of the
Archiving, pages 210–215, 2006.
policy decisions behind its operation.
[6] WikiPedia. United states pres-
At present we have completed a pilot run for a month and idential election, 2008 - wikipedia, the free encyclopedia.
offer the following observations: Harvesting videos from sys- http://en.wikipedia.org/wiki/United States presidential
tems such as YouTube that publish an API is quite feasi- election, 2008#Candidates and potential candidates
ble and does not require enormous system resources (good [Accessed: April 27, 2007], 2007.
bandwidth and a few terabytes of disk). Conducting daily [7] YouTube. YouTube: Broadcast Yourself.
crawls to the depth of 100 returns per query is highly fea- http://www.youtube.com [Accessed: March 1, 2007],
sible. New kinds of analytical tools for analyzing the col- 2007.
lected data are required, especially with respect to changes [8] YouTube. Youtube: Broadcast yourself.
over time. Textual comments in this particular domain yield http://www.youtube.com/dev [Accessed: April 27,
much chaff?ranging from the profane to the arcane. Query 2007], 2007.
variations make some difference in results and deserve fur-
ther investigation. Context mining policies will likely be

Figure 3: Sample of collected metadata

Figure 4: Sample of collected contextual information

You can also read

LED LCD TV OWNER' S MANUAL - MODEL: LE22P600 LE24P600 Please READ this manual carefully before operating your TV, and retain it for future reference

YouTube Marketing Effectiveness Guide - Think with Google

CORONAVIRUS WASH in SCHOOLS - 1 2 GET READY FOR THE SAFE - The Global ...

LEARN.UQ BASICS eLearning Systems and Support - eLearning@UQ ...

Soul kit G2+ 2 wire installation Art 7W monitor with Wi-Fi connectivity - Instructions manual - Video Türsprechanlagen ...

Collaboration device product matrix - Endpoint Product Matrix Cisco Public - May 2021

Index of Customer Generated Documents - AvaSure

The Effects of Video Games on Human Intelligence - SJSU ScholarWorks

Data Connecting Your Network - Hikvision 2018 - RWL Advanced Solutions

Video Entertainment Market Outlook - COVID-19 has provided a massive boost to SVOD at the expense of other video formats - Venture Insights

RSV80 Standalone Video Collaboration Bar - Version 1.13 - Release Notes

Hourly Traffic Prediction of News Stories

Definitions and Explanations

The activity of the 2004 Geminid meteor shower from global visual observations

Making Communication Choices in COVID-19 - Some food for thought on media strategy - Ogilvy Social Lab

Global Reset - COVID-19 Marketing Guidance - Agentur LOOP

Video Surveillance & Storage - Opportunities at the Intersection of IT & Physical Security

Engaging People's Enthusiasm in 2020 Population Census by Scrapping Social Media - UNECE

Temporal ventriloquism: crossmodal interaction on the time 2. Evidence from sensorimotor synchronization

Coronavirus News and Information on YouTube: A Content Analysis of Popular Search Terms - The ...

Media Data 2019 - Your tool for better meetings Digital and Print - tw tagungswirtschaft

Social Review - Our social impact report for friends and 2020 - Fab Lab Barcelona

Integrated approach to study river fluxes, water and sediment sources apportionment in sparsely monitored catchment

Combining Representations For Effective Citation Classification

Prosodic Cues to Monolingual versus Code-switching Sentences in English and Spanish

3M FY21 RESULTS PRESENTATION 28.05.2021 - Food Delivery Brands

Earrach i Leabharlanna Fhine Gall 2021 - Go leor le feiceáil go leor le déanamh - Fingal County Council

E-LEARNING JULY 2021 FOR PARENTS/CAREGIVERS - Mount Barker High School

SELLING YOUR HOUSE - SPRING 2021 - Russell Miller Properties

B2B Buyer Insights Report 2021 - Bitly

INDEX GUIDELINE - Version 2.4 - Solactive Emerging Markets Bond Index Series

TRANSFORM YOUR T&L OPERATIONS WITH REAL-TIME APPLICATIONS

Digital Microwave Model: SM22036 (all colours) Help line: 0333 220 6050 v1.0 - Haushalts-infos.de

Frequently Asked Questions for housing and related service providers

Infusion Centers 5 steps to mastering the daily scramble for infusion chairs - LeanTaaS

NeighborHunting: A Location-Aware Craigslist

SELLING YOUR HOUSE THINGS TO CONSIDER WHEN - EDITION - MYKCM

SELLING YOUR HOUSE THINGS TO CONSIDER WHEN - SPRING 2021 - Team Rita

Population Dynamics and the Characteristics of Inmates in the Cook County Jail

2019 "From exciting party nights to relaxing family Christmas dinners, West Lodge Park has something for everyone this festive season."

Roy Hobbs World Series Playing Rules

Value & Momentum - Building a Unique Canadian Equity Portfolio

The ATLAS EventIndex and its evolution based on Apache Kudu storage

COMMENT ON COLEMAN-DELUCCIA INSTANTONS - TOM BANKS DEPARTMENT OF PHYSICS AND NHETC RUTGERS UNIVERSITY, PISCATAWAY, NJ 08854 E-MAIL: ...

EDITION SPRING 2021 - MBA Mortgage

DIGITAL DESKTOP FOR OMNICHANNEL CUSTOMER SERVICE - INTHECHAT

DAY 25 April 2021 Lest we Forget - Whats On Bundaberg

TRAINING & DEVELOPMENT PROGRAMME - AUTUMN/WINTER 2019 - LET'S TALK BUSINESS #MAKINGITHAPPEN - LOCAL ENTERPRISE OFFICE

Astro2020 Science White Paper Opportunities in Time-Domain Extragalactic Astrophysics with the NASA Near-Earth Object Camera (NEOCam)

SELLING YOUR HOUSE - SPRING 2021 - imgix