TRANSFORMING SINGLE SPREADSHEETS INTO NORMALIZED TABLES USING EXCEL - Kaylee P. Alexander

Page created by William Morris
 
CONTINUE READING
TRANSFORMING SINGLE
                      SPREADSHEETS INTO
                 NORMALIZED TABLES USING EXCEL

Kaylee P. Alexander                          askdata@duke.edu
Variable            Data Type     Description

                                                   artist              string        name of the primary creator of the object sold
        DESCRIPTION OF
        WORKSHOP DATA                              artist_nationality string         the nationality of the artist

                                                   object_title        string        the title of the object sold

                                                   category1           categorical   the primary subject category of the object sold

In this workshop we will be using a sample         category2           categorical   the secondary subject category of the object sold
dataset that contains information about art
                                                   type                categorical   the type of object sold
sales that took place at the Knoedler Gallery in
New York during the year 1946.                     dimensions_in       string        the dimensions of the object sold in inches

The master datasheet consists of 228 rows and      sale_year           string        year in which sale took place
14 columns (see table below).
                                                   sale_month          string        month in which sale took place

                                                   sale_day            string        day on which sale took place

                                                   sales_total         numerical     price object sold for

                                                   sales_currency      categorical   currency object was sold in

                                                   seller              string        name of the seller

                                                   buyer               string        name of the buyer

                                                               Data Source: http://www.getty.edu/research/tools/provenance/
WHAT IS A RELATIONAL DATA MODEL?

A relational data model organizes data into a series of tables containing
columns (‘attributes’) and rows (‘records’) with unique keys identifying
each record. Each table (or, ‘relation’) represents a single entity type and
its attributes. The primary benefits of a relational data model include
ensuring consistency as well as performing combinations of queries to
understand various relationships that exist among the information
contained in the various tables that would be otherwise difficult to
determine from a single spreadsheet. An additional benefit is the ability
to add records and edit information without the risk of compromising
other information contained in the database. For more on relational data
models, see: https://www.oracle.com/database/what-is-a-relational-
database/
SINGLE                                RELATIONAL DATA MODEL
  DATASHEET
Knoedler Main        Artists                  Sales_1946
                     artist_id                                    Subject Categories
artist                                        sale_id
                                                                  category_id
artist_nationality   artist_name              artist_id
                                                                  category
object_title         artist_nationality       object_title
category1                                     category1_id
category2                                     category2_id
type                                          type_id
dimensions_in                                 dimensions_in
sale_year                                     sale_year
                                                                  Object Types
sale_month           Collectors               sale_month
                                                                  type_id
sale_day             collector_id             sale_day
                                                                  type
sales_total          collector_name           sales_total
sales_currency                                sales_currency
seller                                        seller_id
buyer                                         buyer_id
WHAT IS DATA NORMALIZATION?

Data normalization is the process of organizing and restructuring data
attributes within a relational data model in order to reduce redundancy in the
data set, increase consistency, and facilitate querying. For more on database
normalization, see: http://agiledata.org/essays/dataNormalization.html.
GOALS

• Demonstrate simple steps in Excel that can be used to transform a
  single spreadsheet into a series of normalized tables for a relational
  data model
    • Identify data entities and attributes
    • Create unique value lists using UNIQUE( )
    • Assign unique keys to new tables
    • Populate columns in new tables using VLOOKUP( )
    • Relate tables using foreign keys
EXCEL FUNCTIONS USED IN THIS
                  WORKSHOP

• UNIQUE(array, [by_col], [exactly_once]) – returns a list of unique values in a list or range
   •   array – range or array from which to extract unique values
   •   by_col – [optional] FALSE = sort by row (default). TRUE = sort by column
   •   exactly_once – [optional] FALSE = all unique values (default); TRUE = values that occur once

• VLOOKUP(value, table, col_index, [range_lookup]) – looks up a value in a table by matching
  on the first column and returns the matched value
   •   value – value to look for in the first column of a table
   •   table – table from which to retrieve a value

   •   col_index – The column in the table from which to retrieve a value.
   •   range_lookup – [optional] TRUE = approximate match (default); FALSE = exact match
Transforming Single Spreadsheets into
                 Normalized Tables Using Excel
    Prepared by Kaylee P. Alexander | CDVS Graduate Assistant, Summer 2020

I. PURPOSE AND GOALS
 The purpose of this workshop is to demonstrate simple steps in Excel that you can
 take to transform a single spreadsheet (such as a master copy of your data that you
 used to facilitate the gathering process) into a series of normalized tables that can
 be used to populate a relational database model using, for example, MySQL. In
 order to accomplish this, we first identify entities and corresponding attributes of
 those entities within the master datasheet, and then create separate tables for each
 entity that can be connected to one another by foreign keys (columns that reference,
 by means of an identification code, columns present in other tables). This workshop
 does not require any coding experience, but it is recommended that users are
 familiar with Excel basics.

 Over the course of the workshop we will transform a single datasheet into five
 tables, each representing a set of unique entities (and their related attributes) from
 the original datasheet as well as a unique key that can be used to create
 relationships between the different tables.

II. WHAT IS A RELATIONAL DATA MODEL?
 A relational data model organizes data into a series of tables containing columns
 (‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table
 (or, ‘relation’) represents a single entity type and its attributes. The primary benefits
 of a relational data model include ensuring consistency as well as performing
 combinations of queries to understand various relationships that exist among the
 information contained in the various tables that would be otherwise difficult to
 determine from a single spreadsheet. An additional benefit is the ability to add
 records and edit information without the risk of compromising other information
 contained in the database. For more on relational data models, see:
 https://www.oracle.com/database/what-is-a-relational-database/.

III. WHAT IS DATA NORMALIZATION?
 Data normalization is the process of organizing and restructuring data attributes
 within a relational data model in order to reduce redundancy in the data set,
 increase consistency, and facilitate querying. For more on database normalization,
 see: http://agiledata.org/essays/dataNormalization.html.

                                           1
IV. EXCEL FUNCTIONS USED IN THIS WORKSHOP
  •   UNIQUE(array, [by_col], [exactly_once]) – returns a list of unique values in a list
      or range
          o array – range or array from which to extract unique values
          o by_col – [optional] FALSE = sort by row (default). TRUE = sort by column
          o exactly_once – [optional] FALSE = all unique values (default); TRUE =
             values that occur once

  •   VLOOKUP(value, table, col_index, [range_lookup]) – looks up a value in a table
      by matching on the first column and returns the matched value
         o value – value to look for in the first column of a table
         o table – table from which to retrieve a value
         o col_index – The column in the table from which to retrieve a value.
         o range_lookup – [optional] TRUE = approximate match (default); FALSE =
            exact match

V. DESCRIPTION OF WORKSHOP DATA
 In this workshop we will be using a sample dataset that contains information about
 art sales that took place at the Knoedler Gallery in New York during the year 1946.
 The master datasheet consists of 228 rows and 14 columns (see table below). Data
 Source: Getty Provenance Index.

  Variable             Data Type     Description
  artist               string        name of the primary creator of the object sold
  artist_nationality   string        the nationality of the artist
  object_title         string        the title of the object sold
  category1            categorical   the primary subject category of the object sold
  category2            categorical   the secondary subject category of the object sold
  type                 categorical   the type of object sold
  dimensions_in        string        the dimensions of the object sold in inches
  sale_year            string        year in which sale took place
  sale_month           string        month in which sale took place
  sale_day             string        day on which sale took place
  sales_total          numerical     price object sold for
  sales_currency       categorical   currency object was sold in
  seller               string        name of the seller
  buyer                string        name of the buyer

                                            2
WORKSHOP INSTRUCTIONS

I. IDENTIFYING ENTITIES & ATTRIBUTES
 The first step in beginning to break down a master spreadsheet into a series of
 tables is to first identify the various entities in your dataset and being to map out
 your data model. A data entity is an object in a data model. In deciding which
 variables to consider entities, think about columns that may have repeated
 information (e.g. names, object categories, etc.). In considering which variables
 should be attributes of these entities, consider which columns contain information
 about the specific entity. For example, if you have a dataset containing information
 on sales, that contains also information about the buyer such as an address, then
 sales and buyers would represent two different entities and be split into two tables
 (sales and buyers), while buyer_address would become an attribute within the new
 table for buyers. The newly created tables for sales and buyers would then be
 related by a common key column, buyer_id.

 For the purposes of this tutorial, our data entities have been identified as the
 following:

     1.   Sales_1946 – table of all sale events that took place in the year 1946
     2.   Artists – all unique values for artists in the dataset
     3.   Object Types – all unique values for object types
     4.   Subject Categories – all unique values for subject categories
     5.   Collectors – all unique values for sellers and buyers

 These five entities will become the names of the tables in our relational data model.
 The attributes for each of these tables will be as follows:

     1. Sales_1946 – sale_id, artist_id, object_title, category1_id, category2_id,
        type_id, dimensions_in, sale_year, sale_month, sale_day, sales_total,
        sales_currency, seller_id, buyer_id
     2. Artists – artist_id, artist_name, artist_nationality
     3. Object Types – type_id, type
     4. Subject Categories – category_id, category
     5. Collectors – collector_id, collector_name

 N.B. Over the course of the following steps, we will create the ID (key) columns that are listed as
 attributes above.

                                                   3
II. CREATING UNIQUE VALUE LISTS FOR EACH ENTITY
  Now that we have identified our data entities, we are ready to begin splitting up
  our master datasheet.

  To begin, we need to make sure that we create unique value lists for the following
  columns in the original dataset: artist, category, subcategory, type, seller, and
  buyer. These variables will form the basis of our artists, subject categories, object
  types, and collectors entity tables. Since the information in these columns is likely to
  be repeated in our dataset, we want to make sure that we have just one record per
  possible value in each of our entity tables.

  Let’s begin by making the new table for object types, since this will only contain two
  columns—type_id and type—and only contains values from one of the original
  columns.

    1. Open a new sheet in your excel workbook.

    2. Rename this sheet ‘Object Types.’

    3. Title the first two columns in this new sheet ‘type’ and ‘type_id,’ respectively.

    4. Go to cell A2 and enter the following formula: =UNIQUE('Knoedler Main'!F:F)

  This identifies all of the unique values included in the master datasheet’s column F
  (type) and lists them in your new sheet. You should now have the following listed
  under type in the Object Types sheet:

            •   type
            •   Painting
            •   Sculpture
            •   Pastel
            •   Watercolor
            •   0

  Since we have called the unique values for all of column F, ‘type’ and ‘0’ have been
  included as additional unique values. Obviously, we don’t want to include the
  column header or the zero value in our list, but we can’t just delete the items we

                                           4
don’t want because this is a formula and doing so would erase all of the other
  values. To fix this:

    5. Highlight the whole column and right click to cut it.

    6. Then, right click on an empty column and select Paste special à values. (This
       will paste only the values from the copied cells rather than the formula.

    7. Now you have all of the unique values from the original spreadsheet listed
       here and can delete the first and last rows containing ‘type’ and ‘0,’
       respectively.

    8. Cut and paste the table to return it to columns A and B of the spreadsheet.

III. ASSIGNING UNIQUE KEYS
  Now we want to assign primary key values for the column type_id. The ID
  numbers—unique keys—that we generate here will later be used to create
  relationships among our tables.

    9. Enter the value ‘1’ in cell B2.

    10. Double-click on the lower right corner of the cell to fill down, then hover over
        the drop-down list and select Fill Series.

  This will assign values 1 through 4 to your types. Although you can begin your ID
  numbers with 1 for every table, it is advisable to use different numbering systems
  for each entity to avoid confusion. For a small value list such as this single digits
  work fine, but for lists of hundreds or more values, you may want to begin with, for
  example, a 5-digit number such as 5000. For the purposes of this tutorial you may
  choose any numbering format you’d like for each table.

IV. TRY IT YOURSELF – Creating an Entity Table
    11. Repeat steps 1 through 10 for the column artist in the master datasheet.
        (Remember to adjust your =UNIQUE( ) function to match the column for artist,
        and try filling down with different starting ID numbers)

  For our Artists entity table we want to include an additional attribute for each
  record, so we will need to add an additional column to our table.

                                           5
12. Title the third column in the table Artists ‘artist_nationality.’

  In the next section we will look at how to populate this column with data from our
  master datasheet.

V. USING THE VLOOKUP( ) FUNCTION TO FILL ATTRIBUTE COLUMNS
  Now that we have our Artists table, we want to fill in the attribute artist_nationality
  with information from our master datasheet. Instead of going through this list and
  manually adding the nationalities for each artist, we can use the VLOOKUP( )
  function in excel to look up an artist’s name in our Artists table, match it to a value
  in the column artist from the master datasheet, and return the matching value from
  the column artist_nationality for that row.

    13. Enter the following formula into cell C2 of the table Artists:
        =VLOOKUP(B2,'Knoedler Main'!A:B,2,FALSE)

  This looks up the artist name in cell B2 of the Artists table, matches the value to the
  same name in column A of the master datasheet, then returns the value in the
  second column of the table (column B), containing the nationality for the matched
  artists name in column A, to cell C2 of the Artists table. The FALSE indicates that
  you only want exact matches rather than approximate matches.

    14. Double-click on the lower right corner of the cell to fill down. It should default
        to Fill series, but you can always double check by hovering over the drop-
        down to see which fill down form is selected.

    15. As we did after using the UNIQUE( ) function, select the whole table (all three
        columns), and right click to cut the table, and then Paste special à values into
        an empty column to remove the formula and save only the returned values.

    16. Cut and paste the table to return it to columns A, B and C of the spreadsheet.

VI. USING DATA FROM TWO COLUMNS TO CREATE AN ENTITY TABLE
  Since some of the buyers in this dataset might also be sellers and vice versa, it
  would be useful to create just one entity table for all “collectors” present in the
  master datasheet. In order to do this we will create a unique value list for all of the
  names that appear in the buyer and seller columns of our master datasheet.

    17. Open a new sheet in the excel workbook.

                                             6
18. Rename this sheet ‘Collectors.’

    19. Copy and paste all of the data from the column buyers into column A this new
        sheet.

    20. Then, copy and paste all of the data from the column sellers to the end of the
        list in column A of the new sheet.

    21. In column B of the Collectors sheet, enter the following formula in cell B2:
        =UNIQUE(A:A).

  This grabs all of the unique values from the combined list of buyers and sellers to
  create one unique value list for all collectors.

    22. Cut and Paste special à values to place only the text values into cell C2.

    23. Delete columns A and B, as well as any rows with ‘0.’

    24. Name this column ‘collector_name.’

    25. Name what is now column B ‘collector_id.’

    26. Assign primary key values to column B, as we did in Part III.

IV. TRY IT YOURSELF – Using Two Columns to Create an Entity Table
    27. Repeat steps 17 through 26 for the columns category1 and category2 in the
        master datasheet to create the entity table, Subject Categories, with a column
        for category and category_id.

VI. RELATING TABLES USING FOREIGN KEYS
  Now that we have created four of our five entity tables, we need to create our final
  table for our sales entities. This table will have each sale event as a record (row),
  populated with the data pertaining to the specifics of the sale (object_title,
  dimensions_in, sale_year, sale_month, sale_day, sales_total, and sales_curr) as well
  as the foreign keys (ID numbers) for the following variables in our other tables:
  artist_id, category1_id, category2_id, type_id, seller_id, buyer_id. We will also
  create a sale_id column to act as a primary key in this new table, Sales_1946. Here,
  we will again make use of the VLOOKUP( ) function in Excel.

                                          7
N.B. For the VLOOKUP ( ) to work for the following steps, the primary key column must be the right-
most column in each of four new datasheets. The column containing the entity name (i.e.
artist_name, type, category, and collector_name) must be in the left-most column of the table. If this
is not already the case, move the key columns to this position.

   28. Copy all of the data from the original file into a new sheet.

   29. Insert a new column between artist and artist_nationality. Name this column
       artist_id.

We will now use the VLOOKUP() function to populate this new column with the keys
for artist names that we created for the Artists table:

   30. Enter the following formula into cell B2 of the new table:
       =VLOOKUP(A2,Artists!A:C,3,FALSE)

   31. Then, fill down (fill series) for all other rows.

   32. Repeat this process to create the columns category1_id, category2_id,
       type_id, seller_id and buyer_id. Remember that category1_id and
       category2_id will both populate from the Subject Categories table, and
       seller_id and buyer_id will both populate from the Collectors table.

N.B. If no exact matching value can be found (i.e. there is a NULL value for that particular variable
within that record), the formula will return #N/A to the cell. Leave this for now, we will return to this
later on.

   33. Copy all of the data from the sheet you’ve just populated and Paste special à
       values into a new sheet.

   34. Title this new sheet Sales_1946. You may now delete the sheet you populated
       in steps 28 through 32.

Now that we’ve used the keys from our other four tables to create relationships to
Sales_1946, we can delete the columns containing data from our other tables.

   35. Delete the following columns from Sales_1946:
           • artist_nationality
           • category1
           • category2

                                                  8
•   type
            •   seller
            •   buyer

    36. Finally, add a sale_id column for Sales_1946, and fill series to create a primary
        key column for the table.

VII. FINISHING UP (optional)
  For programs such as MySQL you will want to make sure that any blank cells read as
  nulls, you want to make sure that NULL appears in all of these null-value cells. You
  can use search and replace in Excel to make sure that this is the case. You will also
  want to do this for any values that appear as #N/A, as this will not read as a null in
  MySQL.

  Finally, now that you have created these five normalized tables based on the
  original master datasheet, you can save each of these as a .csv file that can be
  uploaded into programs such as phpMyAdmin or MySQL Workbench to populate a
  relational database model with existing data.

                                          9
You can also read