Natural Language Processing for Computer Scientists and Data Scientists at a Large State University

Page created by Anthony Maldonado
 
CONTINUE READING
Natural Language Processing for Computer Scientists and Data Scientists at a Large State University
Natural Language Processing for Computer Scientists and
                   Data Scientists at a Large State University

                                         Casey Kennington
                                   Department of Computer Science
                                       Boise State University
                               caseykennington@boisestate.edu

                      Abstract                              I reflect on my experience setting up and main-
                                                            taining a class in NLP at Boise State University, a
    The field of Natural Language Processing                large state university, how to prepare students for
    (NLP) changes rapidly, requiring course offer-
                                                            research and industry careers, and how the class has
    ings to adjust with those changes, and NLP
    is not just for computer scientists; it’s a field
                                                            changed over four years to fit the needs of students.
    that should be accessible to anyone who has a              The next section explains the course objectives.
    sufficient background. In this paper, I explain         I then explain challenges that are likely common to
    how students with Computer Science and Data             many university student populations, how NLP is
    Science backgrounds can be well-prepared for            designed for students with Data Science and Com-
    an upper-division NLP course at a large state           puter Science backgrounds, then I explain course
    university. The course covers probability and           content including lecture topics and assignments
    information theory, elementary linguistics, ma-
                                                            that are designed to fulfill the course objectives. I
    chine and deep learning, with an attempt to bal-
    ance theoretical ideas and concepts with prac-          then offer a reflection on the three times I taught
    tical applications. I explain the course objec-         this course over the past four years, and future plans
    tives, topics and assignments, reflect on adjust-       for the course.
    ments to the course over the last four years, as
    well as feedback from students.                         2    Course Objectives

1   Introduction                                        Most students who take an NLP course will not
                                                        pursue a career in NLP proper; rather, they take
Thanks in part to a access to large datasets, in- the course to learn skills that will help them find
creases in compute power, and easy-to-use pro- employment or do research in areas that make use
gramming programming libraries that leverage neu- of language, largely focusing on the medium of text
ral architectures, the field of Natural Language Pro- (e.g., anthropology, information retrieval, artificial
cessing (NLP) has become more popular and has           intelligence, data mining, social media network
seen more widespread adoption in research and in        analysis, provided they have sufficient data science
commercial products. On the research side, the          training). My goal for the students is that they
Association for Computational Linguistics (ACL) can identify aspects of natural language (phonet-
conference—the flagship NLP conference—and              ics, syntax, semantics, etc.) and how each can be
related annual conferences have seen dramatic in- processed by a computer, explain the difference
creases in paper submissions. For example, in 2020      between classification models and approaches, be
ACL had 3,429 paper submissions, whereas 2019           able to map from (basic) formalisms to functional
had 2,905, and this upward trend has been happen- code, and use existing tools, libraries, and data sets
ing for several years. Certainly, lowering barriers     for learning while attempting to strike a balance
to access of NLP methods and tools for researchers      between theory and practice. In my view, there are
and practitioners is a welcome direction for the        several aspects of NLP that anyone needs to grasp,
field, enabling researchers and practitioners from      and how to apply NLP techniques in novel circum-
many disciplines to make use of NLP.                    stances. Those aspects are illustrated in Figure 1.
   It is therefore becoming more important to better       No single NLP course can possibly account for
equip students with an understanding of NLP to          a level of depth in all of the aspects in Figure 1, but
prepare them for careers either directly related to     a student who has taken courses or has experience
NLP, or which leverage NLP skills. In this paper, in at least two of the areas (e.g., they have taken a
                                                     115
                          Proceedings of the Fifth Workshop on Teaching NLP, pages 115–124
                          June 10–11, 2021. ©2021 Association for Computational Linguistics
Natural Language Processing for Computer Scientists and Data Scientists at a Large State University
3    Preparing Students with Diverse
                                                                  Academic and Technical Backgrounds
                                                             Boise State University is the largest university in
                                                             Idaho, situated in the capital of the State of Idaho.
                                                             The university has a high number of non-traditional
                                                             students (e.g., students outside the traditional stu-
                                                             dent age range, or second-degree seeking students).
                                                             Moreover, the university has a high acceptance rate
                                                             (over 80%) for incoming first-year students. As is
                                                             the case with many universities and organizations,
                                                             a greater need for “computational thinking" among
                                                             students of many disciplines has been an important
                                                             driver of recent changes in course offerings across
Figure 1: Areas of content that are important for a          many departments. Moreover, certain departments
course in Natural Language Processing.                       have responded to the need and student interest in
                                                             machine learning course offerings. In this section,
                                                             we discuss how we altered the Data Science and
statistics course and have experience with Python,           Computer Science curricula to meet these needs
or they have taken linguistics courses and have used         and the implications these changes have had on the
some data science or machine learning libraries)             NLP course.1
will find success in the course more easily than
those with no experience in any aspect.                  Data Science The Data Science offerings begin
   This introduces a challenge that has been ex- with a foundational course (based on Berkeley’s
plored in prior work on teaching NLP (Fosler- data8 content)               that has only a very basic math pre-
                                                                    2 It introduces and allows students to
Lussier, 2008): the diversity of the student pop-        requisite.
ulation. NLP is a discipline that is not just for        practice Python, Jupyter notebooks, data analysis
computer science students, but it is challenging to      and visualization, and basic statistics (including
prepare students for the technical skills required in    the bootstrap method of statistical significance).
a NLP course. Moreover, similar to the student pop- Several courses follow this course that are more
ulation in Fosler-Lussier (2008), there should be        domain specific, giving the students options for
course offerings for both graduate and undergradu- gaining practical experience in Data Science skills
ate students. In my case, which is fairly common         relative to their abilities and career goals. One path
in academia, as the sole NLP researcher at the uni- more geared towards students of STEM-related
versity I can only offer one course once every four      majors (though not targeting Computer Science
semesters for both graduate and undergraduate stu- majors) as well as some majors in the Humanities,
dents, but also students with varied backgrounds– is a certificate program that includes the founda-
not only computer science. As a result, this is not a    tional course, a follow-on course that gives students
research methods course; rather, it is more geared       experience with more data analysis as well as prob-
towards learning the important concepts and techni- ability and information theory, an introductory ma-
cal skills surrounding recent advances in NLP. Oth- chine learning course, and a course on databases.
ers have attempted to gear the course content and        The courses largely use Python as the programming
delivery towards research (Freedman, 2008) giving        language of choice.
the students the opportunity to have open-ended as- Computer Science In parallel to the changes
signments. I may consider this for future offerings, in Data Science-related courses, the Department
but for now the final project acts as an open-ended      of Computer Science has seen increased enroll-
assignment, though I don’t require students to read      ment and increased request for machine learning-
and understand recent research papers.                   related courses. The department offers several
   In the following section, I explain how we pre-
                                                            1
pare students of diverse backgrounds to succeed in            We do not have a Data Science department or major de-
                                                         gree program; the Data Science courses are housed in different
an NLP course for upper-division undergraduate           departments across campus.
                                                            2
and graduate students.                                        http://data8.org/
                                                      116
Figure 2: Two course paths that prepare students to succeed in the NLP course: Data Science and Computer Sci-
ence. Solid arrows denote pre-requisites, dashed arrows denote co-requisites. Solid outlines denote Data Science
and Computer Science courses, dashed outlines denote Math courses.

courses, though they focus on upper-division stu-         learning course. Figure 2 depicts the two course
dents (e.g., artificial intelligence, applied deep        paths visually.
learning, information retrieval and recommender
systems). This is a challenge because the main            4     NLP Course Content
Computer Science curriculum focuses on procedu-
                                                          In this section, I discuss course content including
ral languages such as Java with little or no expo-
                                                          topics and assignments that are designed to meet
sure to Python (similar to the student population
                                                          the course objectives listed above. Woven into the
reported in Freedman (2008)), forcing the upper-
                                                          topics and assignments are the themes of ambiguity
division machine learning-related courses to spend
                                                          and limitations, explained below.
time teaching Python to the students. This cannot
be underestimated–programming languages may               4.1    Topics & Assignments
not be natural languages, but they are languages
                                                          Theme of Ambiguity Figure 3 shows the topics
in that it takes time to learn them, particularly rele-
                                                          (solid outlines) that roughly translate to a single
vant Python libraries. To help meet this challenge,
                                                          lecture, though some topics require multiple lec-
the department has recently introduced a Machine
                                                          tures. One main theme that is repeated through-
Learning Emphasis that makes use of the Data Sci-
                                                          out the course, but is not a specific lecture topic
ence courses mentioned above, along with addi-
                                                          is ambiguity. This helps the students understand
tional requirements for math and upper-division
                                                          differences between natural human languages and
Computer Science courses.
                                                          programming languages. The Introduction to Lin-
                                                          guistics topic, for example, gives a (very) high-
Preqrequisites Unlike Agarwal (2013) which at-            level overviews of phonetics, morphology, syntax,
tempted and introductory course for all levels of         semantics, and pragmatics, with examples of am-
students, this course is designed for upper-division      biguity for each area of linguistics (e.g., phonetic
students or graduate students. Students of either         ambiguity is illustrated by hearing someone say it’s
discipline, Computer Science or Data Science, are         hard to recognize speech but it could be heard as
prepared for this NLP course, albeit in different         it’s hard to wreck a nice beach, and syntactic ambi-
ways. Computer Scientists can think computation-          guity is illustrated by the sentence I saw the person
ally, have experience with a syntactic formalism,         with the glasses having more than one syntactic
and have experience with statistics. Data Science         parse).
students likewise have experience with statistics
as well as Python, including data analysis skills. Probability and Information Theory This
Though the backgrounds can be quite diverse, my         course does not focus only on deep learning,
NLP course allows two prerequisite paths: all stu- though many university NLP offerings seem to
dents must take a statistics course, but Computer       be moving to deep-learning only courses. There
Science students must take a Programming Lan- are several reasons not to focus on deep learning
guages course (which covers context free grammars       for a university like Boise State. First, students
for parsing computer languages and now covers           will not have a depth of background in probabil-
some Python programming), and the Data Science          ity and information theory, nor will they have a
students must have taken the introductory machine       deep understanding of optimization (both convex
                                                     117
and non-convex) or error functions in neural net-            Syntax The Syntax & Parsing assignment also
works (e.g., cross entropy). I take time early in the        deserves mention. The students use any parser in
course to explain discrete and continuous probabil-          NLTK to parse a context free grammar of a fictional
ity, and information theory. Discrete probability            language with a limited vocabulary.3 This helps
theory is straight forward as it requires counting,          the students think about structure of language, and
something that is intuitive when working with lan-           while there are other important ways to think about
guage data represented as text strings. Continuous           syntax such as dependencies (which we discuss in
probability theory, I have found, is more difficult          the course), another reason for this assignment is
for students to grasp as it relates to machine learn-        to have the students write grammars for a small
ing or NLP, but building on students’ understand-            vocabulary of words in a language they don’t know,
ing of discrete probability theory seems to work             but also to create a non-lexicalized version of the
pedagogically. For example, if we use continuous             grammar based on parts of speech, which helps
data and try somehow to count values in that data,           them understand coverage and syntactic ambiguity
it’s not clear what should be counted (e.g., using           more concretely.4 There is no machine learning
binning), highlighting the importance of continu-            or estimating a probabilistic grammar here, just
ous probability functions that fit around the data,          parsing.
and the importance of estimating parameters for
those continuous functions–an important concept              Semantics An important aspect of my NLP class
for understanding classifiers later in the course. To        is semantics. I introduce them briefly to formal
illustrate both discrete and continuous probability,         semantics (e.g., first-order logic), WordNet (Miller,
I show students how to program a discrete Naive              1995), distributional semantics, and grounded se-
Bayes classifier (using ham/spam email classifica-           mantics. We discuss the merits of representing
tion as a task) and a continuous Gaussian Native             language “meaning" as embeddings and the limi-
Bayes classifier (using the well-known iris data set)        tations of meaning representations trained only on
from scratch. Both classifiers have similarities, but        text and how they might be missing important se-
the differences illustrate how continuous classifiers        mantic knowledge (Bender and Koller, 2020). The
learn parameters.                                            Topic Modeling assignment uses word-level em-
                                                             beddings (e.g., word2vec (Mikolov et al., 2013)
                                                             and GloVe (Pennington et al., 2014)) to represent
Sequential Thinking Students experience prob-                texts and gives them an opportunity to begin us-
ability and information theory in a targeted and             ing a deep learning library (tensorflow & keras or
highly-scaffolded assignment. They then extend               pytorch).
their knowledge and program, from scratch, a part-              We then consider how a semantic representation
of-speech tagger using counting to estimate proba-           that has knowledge of modalities beyond text, i.e.,
bilities modeled as Hidden Markov Models. These              images is part of human knowledge (e.g., what is
models seem old-fashioned, but it helps students             the meaning of the word red?), and how recent
gain experience beyond the standard machine learn-           work is moving in this direction. Two assignments
ing workflow of mapping many-to-one (i.e., fea-              give the students a deeper understanding of these
tures to a distribution over classes) because this is        ideas. The transfer learning assignment requires
a many-to-many sequential task (i.e., many words             the students to use convolutional neural networks
to many parts-of-speech), an important concept to            pre-trained on image data to represent objects in
understand when working with sequential data like            images and train a classifiers to identify simple ob-
language. It also helps students understand that             ject types, tying images to words. This is extended
NLP often goes beyond just fitting “models" be-              in the Grounded Semantics assignment where a
cause it requires things like building a trellis and         binary classifier (based on the words-as-classifiers
decoding a trellis (undergraduate students are re-           model introduced in Kennington and Schlangen
quired to program a greedy decoder; graduates are            (2015), then extended to work with images with
required to program a Viterbi decoder). This is a            "real" objects in Schlangen et al. (2016)) is trained
challenging assignment for most students, irrespec-
                                                            3
tive of their technical background, but grasping the          I opt for Tamarian: https://en.wikipedia.org/
                                                         wiki/Darmok
concepts of this assignment helps them grasp more           4
                                                              I have received feedback from multiple students that this
difficult concepts that follow.                          was their favorite assignment.
                                                      118
Figure 3: Course Topics and Assignments. Colors cluster topics (green=probability and information theory,
blue=NLP libraries, red=neural networks, purple=linguistics, yellow=assignments). Courses have solid lines; topic
dependencies are illustrated using arrows. Assignments have dashed lines. Assignments are indexed with the topics
on which they depend.

for all words in referring expressions to objects in      like how neurons in neural networks “mimic" real
images in the MSCOCO dataset annotated with re-           neurons in human brains, something that is very
ferring expressions to objects in images (Mao et al.,     far from true, though certainly the idea of neural
2016). Both assignments require ample scaffolding         networks is inspired from human biology. For my
to help guide the students in using the libraries and     students, we progress from linear regression to lo-
datasets, and the MSCOCO dataset is much bigger           gistic regression (illustrating how parameters are
than they are used to, giving them more real-world        being fit and how gradient descent is different from
experience with a larger dataset.                         directly estimating parameters in continuous prob-
                                                          ability functions; i.e., maximum likelihood estima-
Deep Learning An understanding of deep learn-             tion vs convex and non-convex optimization), build-
ing is obviously important for recent NLP re-             ing towards small neural architectures and feed-
searchers and practitioners. One constant challenge       forward networks. We also cover convolutional
is determining what level of abstraction to present       neural networks (for transfer learning and grounded
neural networks (should students know what is hap-        semantics), attention (Vaswani et al., 2017), and
pening at the level of the underlying linear alge-        transformers including transformer-based language
bra, or is a conceptual understanding of parame-          models like BERT (Devlin et al., 2018), and how
ter fitting in the neurons enough?). Furthermore,         to make use of them; understanding how they are
deep learning as a topic requires and understand-         trained, but then only assigning fine-tuning for stu-
ing of its limitations and at least to some degree        dents to experience directly. I focus on smaller
how it works “under the hood" (learning just how          datasets and fine-tuning so students can train and
to use deep learning libraries without understand-        tune models on their own machines.
ing how they work and how they “learn" from the
data is akin to giving someone a car to drive with- Final Project There is a "final project" require-
out teaching them how to use it safely). This also     ment. Students can work solo, or in a group of
means explaining some common misconceptions            up to three students. The project can be anything
                                                    119
NLP related, but projects generally are realized as              learn to critically read popular articles about NLP.
using an NLP or machine/deep learning library to                 Given an article, they need to summarize the article,
train on some specific task, but others include meth-            then scrutinize the sources using the following as
ods for data collection (a topic we don’t cover in               a guide: can they (1) access the primary source,
class specifically, but some students have interest              such as original published paper, (2) assess if the
in the data collection process for certain settings              claims in the article relate to what’s claimed by the
like second language acquisition), as well as in-                primary source, (3) determine if experimental work
terfaces that they evaluate with some real human                 was involved or if the article is simply offering
users. Scoping the projects is always the biggest                conjecture based on current trends, and (4) if the
challenge as many students initially envision very               article did not carry out an evaluation, offer ideas
ambitious projects (e.g., build an end-to-end chat-              on what kind of evaluation would be approrpriate
bot from scratch). I ask students to consider how                to substantiate any claims made by the article’s au-
much effort it would take to do three assignments                thor(s). Then students should relate the headline of
and use that as a point of comparison. Throughout                the article to the main text and determine if reading
the first half of the semester students can ask for              the headline provides an abstract understanding of
feedback on project ideas, and at the halfway point              the article’s contents, and determine to what extent
in the semester, students are required to submit a               the author identified limitations to the NLP technol-
short proposal that outlines the scope and timeline              ogy they were reporting on, what someone without
for their project. They have the second half of the              training in NLP might take away from the article,
semester to then work through the project (with                  and if the authors identified the people who might
a "checkpoint" part way through to inform me of                  be affected (negatively or positively) by the NLP
progress and needed adjustments), then they write                technology. This assignment gives students expe-
a project report on their work at the end of the                 rience in recognizing the gap between the reality
semester with evaluations and analayses of their                 of NLP technology, how it is perceived by others,
work. Graduate students must write a longer re-                  whom it affects, and its limitations.
port than the undergraduate students, and graduate                  We dedicate an entire lecture to ethics, and stu-
students are required to give a 10-minute presen-                dents are also asked to consider the implications of
tation on their project. The timeline here is crit-              their final projects, what their work can and cannot
ical: the midway point for beginning the project                 reasonably do, and who might be affected by their
allows students to have experience with classifica-              work.6
tion and NLP tasks, but have enough time to make
adjustments as they work on their project. For ex-               Discussion Striking a balance between content
ample, students students attempt to apply BERT                   on probability and information theory, linguistics,
fine-tuning after the BERT assignment even though                and machine learning is challenging for a single
it wasn’t in their original project proposal.                    course, but given the diverse student population
                                                                 at a public state school, this approach seems to
Theme of Limitations As is the case with am-                     work for the students. An NLP class should have at
biguity, limitations is theme in the course: lim-                least some content about linguistics, and framing
itations of using probability theory on language                 aspects of linguistics in terms of ambiguity gives
phenomena, limitations on datasets, and limitations              students the tools to think about how much they
on machine learning models. The theme of limi-                   experience ambiguity on a daily basis, and the fact
tations ties into an overarching ethical discussion              that if language were not ambiguous, data-driven
that happens at intervals throughout the semester                NLP would be much easier (or even unnecessary).
about what can reasonably be expected from NLP                   The discussions about syntax and semantics are
technology and whom it affects as more practical                 especially important as many have not considered
models are deployed commercially.                                (particularly those who have not learned a foreign
    The final assignment critical reading of the pop-
                                                                     6
ular press is based on a course under the same                        Approaching ethics from a framework of limitations I
                                                                 think helps students who might otherwise be put off by a
title taught by Emily Bender at the University of                discussion on ethics because it can clearly be demonstrated
Washington.5 The goal of the assignment is to                    that NLP models have not just technical limitations and those
                                                                 limitations have implications for real people; an important
   5
     The original assignment definition appears to no longer     message from this class is that all NLP models have limita-
be public.                                                       tions.
                                                               120
language) how much they take for granted when it        for many languages. All code I write or show in
comes to understanding and producing language,          class is accessible to the students throughout the
both speech and written text. The discussions on        semester so they can refer back to the code exam-
how to represent meaning computationally (sym-          ples for assignments and projects. This of course
bolic strings? classifiers? embeddings? graphs?)        means that students only obtain a fairly shallow
and how a model should arrive at those representa-      experience for any library; the goal is to show them
tions (using speech? text? images?) is rewarding        enough examples and give them enough experience
for the students. While most of the assignments and     in assignments to make sense of public documen-
examples focus on English, examples of linguistic       tation and other code examples that they might
phenomena are often shown from other languages          encounter.
(e.g., Japanese morphology and German declen-              The course uses two books, both which are avail-
sion) and the students are encouraged to work on        able free online, the NLTK book,7 and an ongoing
other languages for their final project.                draft of Jurafsky and Martin’s upcoming 3rd edi-
   Assignments vary in scope and scaffolding. For       tion.8 The first assignment (Python & Jupyter in
the probability and information theory and BERT         Figure 3) is an easy, but important assignment: I
assignments, I provide a fairly well-scaffolded tem-    ask the students to go through Chapter 1 and parts
plate that the students fill in, whereas most other     of Chapter 4 of the NLTK book and for all code ex-
assignments are more open-ended, each with a set        amples, write them by hand into a Jupyter notebook
of reflection and analysis questions.                   (i.e., no copy and pasting). This ensures that their
                                                        programming environments are setup, steps them
4.2   Content Delivery                                  through how NLTK works, gives them immediate
                                                        exposure to common NLP tasks like concordance
Class sizes vary between 35-45 students. Class          and stemming, and gives them a way to practice
content is presented largely either as presentation     Python syntax in the context of a Jupyter notebook.
slides or live programming using Jupyter note-          Another part of the assignment asks them to look
books. Slides introduce concepts, explain things        at some Jupyter notebooks that use tokenization,
outside of code (e.g., linguistics and ambiguity or     counters, stop words, and n-grams, and asks them
graphical models), but most concepts have concrete      questions about best practices for authoring note-
examples using working code. The students see           books (including formatted comments).9
code for Naive Bayes (both discriminative and con-         Students can use cloud-based Jupyter servers
tiniuous) classifiers, I use Python code to explain     for doing their assignments (e.g., Google colab),
probability and information theory, classification      but all must be able to run notebooks on a local
tasks such as spam filtering, name classification,      machine and spend time learning about Python
topic modeling, parsing, loading and prepossess-        environments (i.e., anaconda). Assignments are
ing datasets, linear and logistic regression, senti-    submitted and graded using okpy which renders
ment classification, an implementation of neural        notebooks and allows instructors to assign grading
networks from scratch as well as popular libraries.     to themselves or teaching assistants, and students
   While we use NLTK for much of the instruc-           can see their grades and written feedback for each
tion following in some ways what is outlined in         assignment.10
Bird et al. (2008), we also look at supported NLP
Python libraries including textblob, flair (Akbik       4.3 Adjustments for Remote Learning
et al., 2019), spacy, stanza (Qi et al., 2020), scik- This course was relatively straightforward to adjust
itlearn (Pedregosa et al., 2011), tensorflow (Abadi     for remote delivery. The course website and okpy
et al., 2016) and keras (Chollet et al., 2015), py- (for assignment submissions) are available to the
torch (Paszke et al., 2019), and huggingface (Wolf      students at all times. I decided to record lectures
et al., 2020). Others are useful, but most libraries    live (using Zoom) then make them available with
help students use existing tools for standard NLP           7
                                                              http://www.nltk.org/book_1ed/
pre-processing like tokenization, sentence segmen-          8
                                                              https://web.stanford.edu/~jurafsky/
tation, stemming or lemmatization, part-of-speech       slp3/
                                                            9
tagging, and many have existing models for com-               I use the notebooks listed here for this part of
                                                        the assignment https://github.com/bonzanini/
mon NLP tasks like sentiment classification and         nlp-tutorial
                                                           10
machine translation. The stanza library has models            https://okpy.org/
                                                     121
semester     Python    Ling     ML       DL        Spring 2019 Between 2017 and 2019, several
    Spring 2017    none     none     10%     none       important papers showing how transformer net-
    Spring 2019    50%      30%      30%     10%        works can be used for robust language modeling
    Spring 2021    80%      30%      60%     20%        were gaining in momentum, resulting in a shift to-
                                                        wards altering and understanding their limitations
Table 1: Impressions of preparedness for Python, Lin-
                                                        (so called BERTology, see Rogers et al. (2020) for
guistics (Ling), Machine Learning (ML), and Deep
Learning (DL).                                          a primer). This, along with the fact that changes
                                                        in the curriculum gave students better experience
                                                        with Python, caused me to shift focus from gen-
transcripts to the students. This course has one        erative models to neural architectures in NLP and
midterm, a programming assignment that is similar       to shift to cover word-level embeddings more rig-
in structure to the regular assignments. During         orously. I spent the second half of the semester
an in-person semester, there would normally be a        introducing neural networks (including multi-layer
written final, but I opted to make the final be part    perceptrons, convolutional, and recurrant architec-
of their final project grade.                           tures) and giving students assignments to give them
                                                        practice in tensorflow and keras. After the 2017
                                                        course, I changed the pre-requisite structure to re-
5    Reflection on Three Offerings over 4               quire our programming languages course instead of
     years                                              data structures. This led to greater pareparedness
                                                        in at least the syntax aspect of linguistics.
Due to department constraints on offering required
courses vs. elective courses (NLP is elective), I        Spring 2021 In this iteration, I shifted focus from
am only able to offer the NLP course in the Spring       recurrant to attention/transformer-based models
semester of odd years; i.e., every 4 semesters. The      and assignmed a BERT fine-tuning assignment on a
course is very popular, as enrollment is always          novel dataset using huggingface. I also introduced
over the standard class size (35 students). Below        pytorch as another option for a neural network li-
I reflect on changes that have taken place in the        brary (I also spend time on tensorflow and keras).
course due to the constant and rapid change in the       This shift reflects a shift in my own research and
field of NLP, in our undergraduate curriculum, and       understanding of the larger field, though exposure
the implications those changes had on the course.        to each library is only partial and somewhat ab-
These reflections are summarized in Table 1. As          stract. I note that students who have a data science
I am, to my knowledge, the first NLP researcher          background will likely appreciate tensorflow and
at Boise State University, I had to largely develop      keras more as they are not as object-oriented than
the contents of the course on my own, requiring          pytorch, which seems to be more geared towards
adjustments over time as I better understand student     students with Computer Science backgrounds. Stu-
preparedness. At this point, despite the two paths       dents can choose which one they will use (if any)
into the course, most students who take the course       for their final projects. More students are gaining
are still Computer Science students.                     interest in machine learning and deep learning and
                                                         are turning to MOOC courses or online tutorials,
Spring 2017 The first time I taught the course, which has led in some degree to better prepara-
only a small percentage of the students had expe- tion for the NLP course, but often students have
rience with Python. The only developed Python            little understanding about the limitations of ma-
library for NLP was NLTK, so that and scikit-learn       chine learning and deep learning after completing
were the focus of practical instruction. I spent the     those courses and tutorials. Moreover, students
first three weeks of the course helping students gain    from our university have started an Artificial In-
experience with Python (including Jupyter, numpy, telligence Club (the club started in 2019; I am the
pandas) then used Python as a means to help them         faculty advisor), which has given the students guid-
understand probability and information theory. The       ance on courses, topics, and skills that are required
course focused on generative classification includ- for practical machine learning. Many of the NLP
ing statistical n-gram language modeling with some       class students are already members of the AI Club,
exposure to discriminative models, but no exposure       and the club has members from many academic
to neural networks.                                      disciplines.
                                                      122
5.1    Student Feedback                                             Much of my course materials including note-
I reached out to former students who took the class               books, slides, topics, and assignments can be found
to ask for feedback on the course. Specifically, I                on a public Trello board.12
asked if they use the skills and concepts from the
NLP class directly for their work, or if the skills               Acknowledgements
and concepts transferred in any way to their work.                I am very thankful for the anonymous reviewers for
Student responses varied, but some answered that                  giving excellent feedback and suggestions on how
they use NLP directly (e.g., to analyze customer                  to improve this document. I hope that it followed
feedback or error logs), while most responded that                those suggestions adequately.
they use many of the Python libraries we covered
in class for other things that aren’t necessarily NLP
related, but more geared towards Data Science. For                References
several students, using NLP tools helped them in
                                                                  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng
research projects that led to publications.                        Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
                                                                   Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
6     Conclusions & Open Questions                                 et al. 2016. Tensorflow: A system for large-scale
                                                                   machine learning. In 12th {USENIX} symposium
With each offering, the NLP course at Boise State                  on operating systems design and implementation
University is better suited pedagogically for stu-                 ({OSDI} 16), pages 265–283.
dents with some Data Science or Computer Science
training, and the content reflects ongoing changes                Apoorv Agarwal. 2013. Teaching the basics of NLP
                                                                    and ML in an introductory course to information sci-
in the field of NLP to ensure their preparation. The                ence. In Proceedings of the Fourth Workshop on
topics and assignments cover a wide range, but as                   Teaching NLP and CL, pages 77–84, Sofia, Bulgaria.
students have become better prepared with Python                    Association for Computational Linguistics.
(by the introduction of new prerequisite courses
that cover Python as well as changing some courses                Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif
                                                                    Rasul, Stefan Schweter, and Roland Vollgraf. 2019.
to include assignments in Python), more focus is                    Flair: An easy-to-use framework for state-of-the-art
spent on topics that are more directly related to                   nlp. In NAACL 2019, 2019 Annual Conference of
NLP.                                                                the North American Chapter of the Association for
   Though I feel it important to stay abreast of the                Computational Linguistics (Demonstrations), pages
                                                                    54–59.
ongoing changes in NLP and help students gain
the knowledge and skills needed to be successful                  Emily M Bender and Alexander Koller. 2020. Climb-
in NLP, an open question is what changes need to                    ing towards NLU: On Meaning, Form, and Under-
be made, and a related question is how soon. For                    standing in the Age of Data. In Association for Com-
                                                                    putational Linguistics, pages 5185–5198.
example, I think at this point it is clear that neu-
ral networks are essential for NLP, though it isn’t               Steven Bird, Ewan Klein, Edward Loper, and Jason
always clear what architectures should be taught                     Baldridge. 2008. Multidisciplinary instruction with
(e.g., should we still cover recurrant neural net-                   the natural language toolkit. In Proceedings of
works or jump directly to transformers?). It seems                   the Third Workshop on Issues in Teaching Compu-
                                                                     tational Linguistics, pages 62–70, Columbus, Ohio.
important to cover even new topics sooner than                       Association for Computational Linguistics.
later, though a course that is focused on research
methods might be more concerned with staying up-                  François Chollet et al. 2015.          Keras.    https://
to-date with the field, whereas a course that is more               keras.io.
focused on general concepts and skills should wait
                                                                  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
for accessible implementations (e.g., huggingface                    Kristina Toutanova. 2018. BERT: Pre-training of
for transformers) before covering those topics.                      Deep Bidirectional Transformers for Language Un-
   With recordings and updated content, I hope                       derstanding. arXiv.
to flip the classroom in the future by assigning
                                                                  taught the course, I flipped it and used class time for students
readings and watching lectures before class, then                 to work on assignments, which gave me a better indication of
use class time for working on assignments.11                      how students were understanding the material, time for one-
                                                                  on-one interactions, and fewer issues with potential cheating.
  11                                                                 12
     This worked well for the Foundations of Data Science               https://trello.com/b/mRcVsOvI/
course that I introduced to the university; the second time I     boise-state-nlp
                                                                123
Eric Fosler-Lussier. 2008. Strategies for teaching         cessing Systems 32, pages 8024–8035. Curran Asso-
  “mixed” computational linguistics classes. In Pro-       ciates, Inc.
   ceedings of the Third Workshop on Issues in Teach-
   ing Computational Linguistics, pages 36–44, Colum-    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
   bus, Ohio. Association for Computational Linguis-        B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
   tics.                                                    R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
                                                            D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
Reva Freedman. 2008. Teaching NLP to computer               esnay. 2011. Scikit-learn: Machine learning in
  science majors via applications and experiments.          Python. Journal of Machine Learning Research,
  In Proceedings of the Third Workshop on Issues            12:2825–2830.
  in Teaching Computational Linguistics, pages 114–
  119, Columbus, Ohio. Association for Computa-          Jeffrey Pennington, Richard Socher, and Christopher
  tional Linguistics.                                       Manning. 2014. Glove: Global Vectors for Word
                                                            Representation. In Proceedings of the 2014 Con-
C. Kennington and D. Schlangen. 2015. Simple learn-         ference on Empirical Methods in Natural Language
   ing and compositional application of perceptually        Processing (EMNLP), pages 1532–1543.
   groundedword meanings for incremental reference
   resolution. In ACL-IJCNLP 2015 - 53rd Annual          Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
  Meeting of the Association for Computational Lin-        and Christopher D Manning. 2020.       Stanza:
   guistics and the 7th International Joint Conference     A python natural language processing toolkit
   on Natural Language Processing of the Asian Feder-      for many human languages.      arXiv preprint
   ation of Natural Language Processing, Proceedings       arXiv:2003.07082.
   of the Conference, volume 1.                          Anna Rogers, Olga Kovaleva, and Anna Rumshisky.
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana         2020. A Primer in BERTology: What we know
  Camburu, Alan L Yuille, and Kevin Murphy. 2016.          about how BERT works. arXiv.
  Generation and comprehension of unambiguous ob-        David Schlangen, Sina Zarriess, and Casey Kenning-
  ject descriptions. In Proceedings of the IEEE con-       ton. 2016. Resolving References to Objects in Pho-
  ference on computer vision and pattern recognition,      tographs using the Words-As-Classifiers Model. In
  pages 11–20.                                             Proceedings of the 54th Annual Meeting of the Asso-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-         ciation for Computational Linguistics, pages 1213–
  rado, and Jeffrey Dean. 2013. Distributed Represen-      1223.
  tations of Words and Phrases and their Composition-    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  ality. In Proceedings of NIPS.                           Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
George A Miller. 1995. Wordnet: a lexical database for     Kaiser, and Illia Polosukhin. 2017. Attention is all
  english. Communications of the ACM, 38(11):39–           you need. arXiv preprint arXiv:1706.03762.
  41.                                                    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Adam Paszke, Sam Gross, Francisco Massa, Adam              Chaumond, Clement Delangue, Anthony Moi, Pier-
  Lerer, James Bradbury, Gregory Chanan, Trevor            ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  Killeen, Zeming Lin, Natalia Gimelshein, Luca            icz, Joe Davison, Sam Shleifer, Patrick von Platen,
  Antiga, Alban Desmaison, Andreas Kopf, Edward            Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  Yang, Zachary DeVito, Martin Raison, Alykhan Te-         Teven Le Scao, Sylvain Gugger, Mariama Drame,
  jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,      Quentin Lhoest, and Alexander M. Rush. 2020.
  Junjie Bai, and Soumith Chintala. 2019.         Py-      Transformers: State-of-the-art natural language pro-
  torch: An imperative style, high-performance deep        cessing. In Proceedings of the 2020 Conference on
  learning library. In H. Wallach, H. Larochelle,          Empirical Methods in Natural Language Processing:
  A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-      System Demonstrations, pages 38–45, Online. Asso-
  nett, editors, Advances in Neural Information Pro-       ciation for Computational Linguistics.

                                                     124
You can also read