A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES IN INTRODUCTORY PROGRAMMING COURSES - NADEEM ABDUL HAMID - Sinbad.key

Page created by Bradley Carroll
 
CONTINUE READING
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES IN INTRODUCTORY PROGRAMMING COURSES - NADEEM ABDUL HAMID - Sinbad.key
A GENERIC FRAMEWORK
FOR ENGAGING ONLINE DATA SOURCES
IN INTRODUCTORY PROGRAMMING COURSES
NADEEM ABDUL HAMID
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES IN INTRODUCTORY PROGRAMMING COURSES - NADEEM ABDUL HAMID - Sinbad.key
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES                                      2

“LIVE” DEMO
   https://datahub.io/dataset/ubigeo-peru/resource/12c2cc3a-5896-496b-96f6-d95cd1618d61
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES IN INTRODUCTORY PROGRAMMING COURSES - NADEEM ABDUL HAMID - Sinbad.key
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES   3

CONNECT - LOAD - FETCH

  import core.data.*;

  public class PeruData1 {
    public static void main(String[] args) {
      DataSource ds = DataSource.connect("https://...
      ds.load();
      String[] names = ds.fetchStringArray("NOMBRE");

          System.out.println(names.length);
          System.out.println(names[367]);
      }
  }
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES   4

WHAT’S IN THE DATA?
import core.data.*;

      public class PeruData1 {
        public static void main(String[] args) {
          DataSource ds = DataSource.connect("https://..
          ds.load();
          ds.printUsageString();
          String[] names = ds.fetchStringArray("NOMBRE")

            System.out.println(names.length);
            System.out.println(names[367]);
        }
  }
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES     5

USAGE STRING
-----
Data Source: https://commondatastorage.googleapis.com/.../
Ubigeo2010.csv
URL: https://commondatastorage.googleapis.com/.../
Ubigeo2010.csv

The following data is available:
   A list of:
      structures with fields:
      {
        CODDIST : *
        CODDPTO : *
        CODPROV : *
        NOMBRE : *
      }
-----
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES   6

USER-DEFINED CLASS
class Geo {
   String name;
   int pop;
   int elev;

     public Geo(String name, int pop, int elev) {
       this.name = name;
       this.pop = pop;
       this.elev = elev;
     }

     public String toString() {
       return String.format("%s (pop. %d): %d m.",
                             name, pop, elev);
     }
 }
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES            7

DEMO - ADDITIONAL FEATURES
    DataSource ds = DataSource.connectAs("TSV",
          "http://download.geonames.org/export/dump/PE.zip");
    ds.setOption("fileentry", "PE.txt");
    ds.setOption(“header",
       “geoid,name,asciiname,altnames,lat,long,feature-class,
        feature-code,cc,cc2,admin1,admin2,admin3,admin4,ppl,
        elev,dem,tz,mod");
    ds.load();

    Geo g = ds.fetch("Geo", "name", "ppl", "dem");
    System.out.println(g);

    ArrayList places = ds.fetchList("Geo",
                                       "name", "ppl", "dem");
    System.out.println(places.size());
    for (Geo p : places)
      if (p.name.equals("Arequipa"))
        System.out.println(p);
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES   8

OUTPUT

      Brazo Tigre (pop. 0): 0 m.
      102315
      Arequipa (pop. 1218168): 3351 m.
      Arequipa (pop. 0): 3164 m.
      Arequipa (pop. 841130): 2355 m.
      Arequipa (pop. 0): 106 m.
      Arequipa (pop. 0): 2327 m.
      Arequipa (pop. 0): 404 m.
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES   10

OUTLINE

▸ Motivation

▸ Goals

▸ Usage & Functionality

▸ Design & Implementation

▸ Related & Future Work

▸ Conclusion
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES         11

MOTIVATION

▸ The “Age of Big Data”

▸ Incorporate the use of online data sets in introductory
  programming courses

  ▸ Provide a simple interface

  ▸ Hide I/O connection, parsing, extracting, data binding
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES                 12

GOALS
▸ Minimal syntactic overhead

▸ Direct access via URL (or local file path)

▸ No requirement of pre-supplied data schemas/templates

▸ Bind (instantiate) data objects based on user-defined data
  representations (i.e. student-defined classes)

▸ Other good stuff                    ArrayList places
                                         = ds.fetchList(“Geo”, ...

  ▸ Caching
  ▸ Help/usage
  ▸ Error handling/reporting
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES              13

USAGE

▸ 3-step approach: • Connect       • Load    • Fetch
▸ Infer data format if possible — XML, CSV, JSON
▸ Display inferred structure of data — printUsageString()
▸ Fetching atomic values              ds.fetch("Geo",
                                               “info/name/std”,
  ▸ provide a path into the data               “metrics/pop",
                                               “phys/elev”);
▸ Structured data:
  ▸ provide name of class and paths of data to be supplied to
    the constructor
▸ Collections: fetchStringArray / fetchArray / fetchList / …
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES   14

OTHER FUNCTIONALITY

▸ Data source specifications

▸ Query parameters

▸ Iterator-based access

▸ Cache control

▸      Processing support
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES                                          15

DESIGN & IMPLEMENTATION

▸ Connect
  ▸ prepare URL/path; set parameters, options, data type

                                                                                      code$
▸ Load:
                                                                2&
                                        data$sources$                fetch&
  ▸ get the data
                                                                                     object(s)$
  ▸ infer a schema                         1&                        signature$
                                                load&
                                                                                                  3&

▸ Fetch:                                                field$schema$              instan.ate&

  ▸ build a signature for type requested by user
  ▸ unify schema with signature - instantiate as objects
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES                                                                16

EXPERIENCE

▸ Limited to date: “Creative Computing”

▸ Tutorial-style labs

▸ Sample data sets used/discovered by students:
Name                                                   Source                              Type           Records
(Asterisk indicates data set discovered by students)
*1000 songs to hear before you die                     opendata.socrata.com                XML            1,000
Abalone data set                                       UCI Machine Learning Repository     CSV            4,177
*Airport Weather Mashup                                NWS + FAA                           XML            fixed
*Chicago life expectancy by community                  data.cityofchicago.org              XML            ˜80
Earthquake feeds                                       US Geological Survey                JSON           variable
*Fuel economy data                                     US EPA                              XML            35,430
*Jeopardy! question archive                            reddit                              JSON           216,930
Live auction data                                      Ebay                                XML            100/page
Magic the Gathering card data                          mtgjson.com                         JSON           variable
Microfinance loan data                                 Kiva                                XML            variable
*SEC Rushing Leaders 2014                              ESPN                                CSV (manual)   variable
                Figure 5. Examples of data sources used in lecture examples and student final projects.
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES       17

ISSUES

▸ Finding proper links to raw data (students can have trouble)

▸ Sites requiring “developer” registration

▸ Error messages not helpful (yet)

▸ XML as common intermediate format

▸ Better caching (of schemas as well as raw data)

  ▸ Streaming, pagination, sampling…
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES           18

FUTURE
▸ Redo abstraction layer over data formats
▸ GUI tools
▸ Multiple language support (Python, Racket)
  ▸ Different language mechanisms to achieve dynamic binding
    (reflection, macros)
▸ Additional data formats
  ▸ HTML tables, web scrapers (regexps)
  ▸ Customized for popular APIs (ebay, twitter, etc.)

‣ Curriculum resources
▸ Evaluation of effectiveness
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES          19

RELATED WORK & ACKNOWLEDGEMENTS

▸ CORGIS Dataset Project - http://think.cs.vt.edu/corgis/

▸ XML Data Access Interfaces

   ▸ JAXB, Castor: schema-based; compile-time setup
      required
   ▸ FasterXML (Jackson): dynamic binding to POJOs;
     emphasis on Java → XML direction; tight coupling
▸ XML schema inference

▸ Contributions by Steven Benzel, Stephen Jones, Alan Young
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES     20

CONCLUSION

▸ Facilitate incorporation of online data sources into
  programming assignments

  ▸ Painlessly

  ▸ Seamlessly
Use a data set in your next assignment!

         cs.berry.edu/sinbad
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES

DATA SOURCE                                         DataSource.connectUsing("geospec-pe.spec");
SPECIFICATION FILE                                  {
                                                         "name": "Geographical Data - Peru",
‣ Data source URL and format.
                                                         "format": "TSV",
‣ Human-friendly name and description, along             "path": "http://download.geonames.org/export/dump/PE.zip",
  with URL to a project or informational page            "infourl": "http://www.geonames.org/",
  about the data source.
                                                          "options": [
‣ A specification of pre-supplied and user-
                                                       {      "name": "fileentry",
  supplied (required and optional) query                     "value": "PE.txt" },
  parameters or path parameters. The latter are        {
  user-provided strings that are substituted in                  "name": "header",
  for placeholders in the URL path.                              "value":
‣ Programmatic options specific to the              "geoid,name,asciiname,altnames,lat,long,feature-class,feature-
                                                    code,cc,cc2,admin1,admin2,admin3,admin4,pop,elev,dem,tz,mod"
  particular data source object (such as a                }],
  header for CSV files).                            }
‣ Cache settings, such as cache directory path
  or timeout.
‣ A data schema defining the exposed data
  structures and fields from the source with
  various helpful annotations such as textual
  descriptions of fields that can be displayed by
  printUsageString().
A GENERIC FRAMEWORK FOR ENGAGING ONLINE DATA SOURCES                                  25
                  (base type) B := boolean | int | String | . . .
   SCHEMAS & SIGNATURES
              (constructor) C := any Java class name

                        (schema)          := ⇤ | [p ] | {f0p0 :   0 , . . .}

                           (signature) ⌧ := ⌧B | [⌧ ] | C{f0 :⌧0 ,...}

 ersion) h := parseB(   ) | h( [i])
      Primitive, List, or Structure | h( .p) | new C(h   , . . .) | new list[h
   ▸                                                   0                       0

   The following data is available:
      A structure with fields:
      {
)h      row : A list of:
                A structure with fields:
schema       unifies with signature ⌧ to produce
                {                                a conversion
                                             ds.fetch("Prop",
                                                              expression h.
                  Address_1 : *
                  Electricity_Use_-_Grid_Purchase_kWh : *        "row/Property_Name",
                  Energy_Cost_ : *
                  ...                                            "row/Year_Ending",
                    prim-prim
                  Natural_Gas_Use_therms : *                prim-singleton-comp
                                                                 "row/Energy_Cost_");
                  Property_GFA_-_Self-Reported_ft : *
                  Property_Id : *                                   ⇤k⌧ )h
                  Property_Name : *
                  ...⇤ k ⌧ ) parseB( )
                            B
                  Weather_Normalized_Site_EUI_kBtu-ft : *
                                                            ⇤ k C{f :⌧ } ) new C(h( ))
                  Year_Ending : *
                }
         list-list                                                list-strip
(constructor) C := any Java class name
                          (constructor) C := any Java class name
A GENERIC FRAMEWORK      (schema)     := ⇤ | ONLINE
                            FOR ENGAGING     [p ] | {fDATA
                                                       0p0 : SOURCES
                                                               0 , . . .}                    26
                         (schema) := ⇤ | [p ] | {f0p0 : 0 , . . .}
                           (signature) ⌧ := ⌧B | [⌧ ] | C{f0 :⌧0 ,...}
UNIFICATION                (signature) ⌧ := ⌧B | [⌧ ] | C{f0 :⌧0 ,...}
   (conversion) h := parseB( ) | h( [i]) | h( .p) | new C(h0 , . . .) | new list[h0 , . . .]
   (conversion) h := parseB( ) | h( [i]) | h( .p) | new C(h0 , . . .) | new list[h0 , . . .]

  k⌧ )h
  k ⌧ )schema
 means   h            unifies with signature ⌧ to produce a conversion expression h.
 means schema         unifies with signature ⌧ to produce a conversion expression h.
                              prim-prim                                         prim-singleton-comp
                              prim-prim                                                 ⇤k⌧ )h
                                                                                prim-singleton-comp
                                                                                          ⇤k⌧ )h
                              ⇤ k ⌧B ) parseB( )                                 ⇤ k C{f :⌧ } ) new C(h( ))
                              ⇤ k ⌧B ) parseB( )                                 ⇤ k C{f :⌧ } ) new C(h( ))
                 list-list                                                                 list-strip
                 list-list       k⌧ )h                                                     list-strip
                                                                                                k⌧ )h
                                 k⌧ )h                                                          k⌧ )h
                  [ ] k [⌧ ] ) new list([h( 0 ), . . .])                                   [ ] k ⌧ ) h( 0 )
                  [ ] k [⌧ ] ) new list([h( 0 ), . . .])                                   [ ] k ⌧ ) h( 0 )
            wrap-list                                                                  comp-strip
            wrap-list
              k⌧ )h           is not a list schema                                     comp-strip
                                                                                              k⌧ )h
                k⌧ )h         is not a list schema                                             k⌧ )h
                   k [⌧ ] ) new list([h( )])                                            {fp : } k ⌧ ) h( .p)
                   k [⌧ ] ) new list([h( )])                                            {fp : } k ⌧ ) h( .p)
      comp-comp
      comp-comp                                              i k ⌧i ) hi
                                                            i k ⌧i ) hi
      {f0p0 :   0 , . . . , f n pn   :   n,   g 0g 0 :   n+1 , . . .} k C{f0 :⌧0 ,...,fn :⌧n }   ) new C(h0 ( .p0 ), . . .)
      {f0p0 :   0 , . . . , f n pn   :   n,   g 0g 0 :   n+1 , . . .} k C{f0 :⌧0 ,...,fn :⌧n }   ) new C(h0 ( .p0 ), . . .)
Figure 2: A simple program demonstrating the Java Earthquake library
BART, ET AL.    1
                2
                    import
                    import
                             java . u t i l . List ;
                             j a v a . u t i l . HashSet ;
                3   import   r e a l t i m e w e b . e a r t h q u a k e s e r v i c e . main . E a r t h q u a k e S e r v i c e ;

FIGURE 2        4
                5
                6
                    import   r e a l t i m e w e b . e a r t h q u a k e s e r v i c e . domain . Earthquake ;

                    public c l a s s EarthquakeDemo {
                7
                8       public s t a t i c void main ( S t r i n g [ ] a r g s ) throws E a r t h q u a k e E x c e p ti o n {
                9           // Use t h e E a r t h q u a k e S e r v i c e l i b r a r y
               10           EarthquakeService es = EarthquakeService . getInstance ( ) ;
               11
               12             es . connect ( ) ;           // Remove t o u s e t h e l o c a l c a c h e
               13
               14             // 5 minute d e l a y , b u t i f we u s e t h e c a c h e no d e l a y i s needed !
               15             i n t DELAY = 5 ⇤ 60 ⇤ 1 0 0 0 ;
               16
               17             HashSet seenQuakes = new HashSet ( ) ;
               18
               19             // P o l l s e r v i c e r e g u l a r l y
               20             while ( true ) {
               21                 // Get a l l e a r t h q u a k e s i n t h e p a s t hour
               22                 L i s t  l a t e s t = e s . g e t E a r t h q u a k e s ( H i s t o r y . ALL ) ;
               23                 // Check i f t h i s i s a new e a r t h q u a k e
               24                 f o r ( Earthquake e : l a t e s t ) {
               25                         i f ( ! seenQuakes . c o n t a i n s ( e ) ) {
               26                                // Report new e a r t h q u a k e s
               27                                System . out . p r i n t l n ( ”New quake ! ” ) ;
               28                                seenQuakes . add ( e ) ;
               29                         }
               30                 }
               31                 // Delay t o a v o i d spamming t h e w e a t h e r s e r v i c e
               32                 Thread . s l e e p (DELAY) ;
               33             }
               34       }
               35   }
A.       Earthquake Program Code
         The following is the complete source code of a program duplicating
EQUIVALENT
         the behavior of EarthquakeDemo from Figure 2 of [4].
         import   big.data.*;
         import   java.util.Date;
         import   java.util.HashSet;
         import   java.util.List;

         public class EarthquakeDemo {
             public static void main(String[] args) {
                 int DELAY = 5;   // 5 minute cache delay

                   DataSource ds = DataSource.connectJSON("http://earthquake.usgs.gov/earthquakes/feed/v1
                   ds.setCacheTimeout(DELAY);

                   ds.load();
                   ds.printUsageString();

                   HashSet quakes = new HashSet();

                   while (true) {
                       ds.load();                          // this only actually reloads data when the cac
                       List latest = ds.fetchList("Earthquake",
                               "features/properties/title",
                               "features/properties/time",
                               "features/properties/mag",
                               "features/properties/url");
                       for (Earthquake e : latest) {
                           if (!quakes.contains(e)) {
                               System.out.println("New quake!... " + e.description + " (" + e.date() + ")
                               quakes.add(e);
                           }
                       }
                   }
              }
         }
quakes.add(e);
                 }
            }
PLUS…
}
    }
        }

class Earthquake {                    // this class may be instructor-provided, or left to students to define as an exercise
    String description;
    long timestamp;
    float magnitude;
    String url;

    public Earthquake(String description, long timestamp, float magnitude, String url) {
        this.description = description;
        this.timestamp = timestamp;
        this.magnitude = magnitude;
        this.url = url;
    }

    public Date date() {
        return new Date(timestamp);
    }

    public boolean equals(Object o) {        // introductory CS students would probably implement a simpler version of this
        if (o.getClass() != this.getClass())
            return false;
        Earthquake that = (Earthquake) o;
        return that.description.equals(this.description)
                && that.timestamp == this.timestamp
                && that.magnitude == this.magnitude;
    }

    public int hashCode() {                // technically, hashCode() should be overridden if equals() is
        return (int) (31 * (31 * this.description.hashCode()
                + this.timestamp) + this.magnitude);
    }
}

   Sample output of this program is provided on the next page.
Note that this is a continuously-running program. Once it prints
out the initial current set of reports, it continues to loop, report-
ing new events as they happen. A number of interesting extensions
You can also read