Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents

Page created by Dwayne Wagner
 
CONTINUE READING
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
Virtual Pre-Service Teacher Assessment and Feedback via Conversational
                                 Agents
Debajyoti Datta 1 ∗, Maria Phillips 1 ∗, James P. Bywater 2 , Jennifer Chiu 3 , Ginger S. Watson 3 ,
                               Laura Barnes 1 Donald Brown 1
              1
                School of Engineering and Applied Science, University of Virginia
                       2
                         College of Education, James Madison University
            3
              School of Education and Human Development, University of Virginia
                     {dd3ar, mp6kv, jlc4dz, gw2b,lb3dp, deb}@virginia.edu
                                         bywatejx@jmu.edu

                        Abstract                                training scenarios to train teachers and nurses in
                                                                many different contexts (Datta et al., 2016). Virtual
     Conversational agents and assistants have been
     used for decades to facilitate learning. There
                                                                conversational agents refers to an online interac-
     are many examples of conversational agents                 tive system where a user is able to dialogue and
     used for educational and training purposes in              receive responses in turn. all future references to
     K-12, higher education, healthcare, the mili-              conversational agents will refer to this definition
     tary, and private industry settings. The most              of virtual conversational agents. Conversational
     common forms of conversational agents in ed-               agents, often characterized as dialogue systems, are
     ucation are teaching agents that directly teach            common in customer support representative appli-
     and support learning, peer agents that serve as
                                                                cations (Yan et al., 2017). In education, they have
     knowledgeable learning companions to guide
     learners in the learning process, and teach-               been used extensively for interpreting the content
     able agents that function as a novice or less-             of the conversation, integrating and assimilating
     knowledgeable student trained and taught by a              information, and providing feedback(Chhibber and
     learner who learns by teaching. The Instruc-               Law, 2019). However, as pointed out by Smutny
     tional Quality Assessment (IQA) provides a                 and Schreiberova (2020) very few conversational
     robust framework to evaluate reading compre-               agents used in education use recent advances in
     hension and mathematics instruction. We de-
                                                                machine learning and deep learning, instead re-
     veloped a system for pre-service teachers, in-
     dividuals in a teacher preparation program, to
                                                                lying on simple decision trees. This reinforces
     evaluate teaching instruction quality based on             the need for more research and development on
     a modified interpretation of IQA metrics. Our              artificial intelligence–based methods to support
     demonstration and approach take advantage                  content-specific conversations. A conversational
     of recent advances in Natural Language Pro-                agent deployed through a web interface, as opposed
     cessing (NLP) and deep learning for each di-               to software implementations, for teacher learning
     alogue system component. We built an open-                 has multiple benefits. Two of the prominent ben-
     source conversational agent system to engage
                                                                efits are that this approach can easily be scaled
     pre-service teachers in a specific mathematical
     scenario focused on scale factor with the aim              to reach more in-service and pre-service teachers
     to provide feedback on pre-service teachers’               as well while also providing a cost-effective way
     questioning strategies. We believe our system              to build systems that can be scaled to new teach-
     is not only practical for teacher education pro-           ing scenarios as multiple dialogue system compo-
     grams but can also enable other researchers to             nents can be reused. The dialogue manager and
     build new educational scenarios with minimal               semantic sentence-level components can be used
     effort.
                                                                for different mathematical scenarios as long as the
 1   Introduction                                               assessment component remains unchanged. In this
                                                                case the term assessment is referring to pre-service
 In the era of remote teaching due to governmental              teacher evaluation as opposed to a students under-
 regulations and stay-at-home orders for COVID-19,              standing of a mathematical topic. In this work, we
 remote teaching methodologies have come to the                 combine the advances in NLP and deep learning
 forefront of education and training. Virtual con-              research with a modified version of the Instruc-
 versational agents have been used for a variety of             tional Quality Assessment (IQA) (Boston, 2012)
     ∗
     * First two authors have contributed equally               framework to build a scenario to be used in teacher

                                                          185

     Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 185–198
                              April 20, 2021 ©2021 Association for Computational Linguistics
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
education settings. Our goal in this project is to            by relying on transfer learning and weak su-
assess and give teachers immediate feedback on the            pervision
quality of their instructional moves using a specific
scenario on a given mathematical topic.                   2   Related Work
   Research demonstrates that the Instructional
Quality Assessment provides a robust framework          Virtual human-based simulations can provide mean-
for evaluating teachers’ instructional practice in      ingful, deliberate practice for learning a wide range
mathematics classrooms. Our demonstration uses          of teaching skills during teacher preparation as
a modified version of the IQA that focuses on the       well as extended practice for advanced skills for
strategy of teachers’ questions during one-on-one       in-service teachers. Dialogue systems for conver-
or class discussions through a web-based platform       sational agents can be built in two different ap-
that allows for pre-service teachers to receive real-   proaches: End-to-end approaches that combine all
time feedback on the quality of their questioning.      of the dialogue system stages require large-scaled
Although our demonstration centers on using an          labeled training data and component-based systems
adapted version of IQA as the assessment com-           which require less data, but each component needs
ponent, our modular architecture can be used to         to be trained separately. Component-based systems
incorporate alternative evaluation schemes as well.     have the advantage of allowing certain components
   A critical challenge of developing dialogue sys-     to be reused for similar contextual scenarios. Our
tems in a new domain is the requirement of large        work utilizes the latter approach leveraging unstruc-
data sets for the different components of the dia-      tured data collected from web and textbooks as
logue system (intent classification, slot filling for   knowledge bases. Our dialogue policy is a combi-
dialogue state tracking, and the dialogue policy).      nation of handcrafted rules by education domain
In education, where data collection can itself prove    experts, and dialogue states are tracked through a
a significant challenge, especially given a need for    reading comprehension based approach highlighted
increased domain expert annotations, our system is      by Gao et al. (2019). The most common approach
developed with only a small amount of annotated         for dialogue state tracking is the “slot-value” pairs
data: around two thousand sentences that are la-        approach. In this scenario, different stages of the
beled with one of four adapted IQA classes. Note        dialogue are often framed as a multi-class classifica-
that this system is not necessarily better than ex-     tion task (Mrkšić et al., 2015). While this approach
isting approaches that use annotated training data      is well-studied, the decision to use the reading com-
for each component of the dialogue system, rather,      prehension based approach in this system is based
this system is a compromise because dialogue sys-       on the ability to incorporate pre-trained models for
tems are challenging to build in low- or no-data        dialogue state tracking.
amount scenarios. What our system demonstrates             The assessment component for the pre-service
is a solution to developing a highly-manipulable        teachers’ questioning strategy is an adapted version
scenario given minimal domain-expert annotated          of IQA. The IQA has been developed by the Learn-
data that can be used to support virtual feedback       ing Research and Developmental Center at the Uni-
for pre-service teachers.                               versity of Pittsburgh since 2002 (Matsumura et al.,
   Our contributions are as follows:                    2006). The IQA offers a holistic assessment of
                                                        mathematical instruction, including the academic
   • An open-source web-platform for assessment         rigor of the specific tasks and student and teacher
     of the quality of teachers’ mathematical ques-     discussion surrounding the task (Pianta and Hamre,
     tioning                                            2009). The IQA has been further validated by its
                                                        developers in subsequent research (Boston and Can-
   • A process that allows for scenario develop-        dela, 2018). More recent developments suggest
     ment with minimal training data                    that the IQA can be used not only as an assessment
   • Direct feedback to pre-service teachers on the     tool but also as a feedback tool to help teachers
     quality of mathematical questioning by rely-       actively improve their instruction (Boston and Can-
     ing on the state of the art NLP components         dela, 2018).
                                                           The questions that teachers ask are essential for
   • A framework in which new scenarios can be          promoting students’ meaningful mathematical dis-
     deployed with minimal change in components         course. The academic rigor component (Junker

                                                    186
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
et al., 2005; Boston, 2012) of the IQA builds on ear-        • Procedural and Factual: Elicits a mathemat-
lier classifications of teacher questions (e.g. Boaler         ical fact or procedure; Requires a yes/no or
and Brodie (2004), distinguishes between ”probing              single response answer; Requires the recall of
and exploring” questions that ask students to clarify          a memorized fact or procedure (e.g., What is
their ideas or the connections between them, and               the square root of 4?)
”procedural and factual” questions that elicit facts or
yes/no responses). The IQA is intended to be used            • Expository and Cueing: Provides a mathe-
in contexts where cognitively demanding mathe-                 matical cueing or mathematical information
matical tasks are implemented and is well suited               to students. (e.g: To solve this problem you
for fine-grained teacher professional development              need to double this side, then take that number
such as that which focuses on teacher questioning              and multiply it by 3.)
(Boston et al., 2015). Another advantage of utiliz-          • Other This refers to all other conversations
ing IQA is the limited categories as a high number             not related to the above topics.(e.g: Close your
of categories results in very complex dialogue poli-           books, Why didn’t you use graph paper?)
cies (Yan et al., 2017).
   Given the limited amount of time domain experts           Annotators also had the opportunity to flag any
may have for annotating data, several methods to          data as a ”data issue” which would represent a tran-
improve label efficiency were explored. Weak su-          script preprocessing error or another issue indicat-
pervision techniques, as highlighted by Ratner et al.     ing the data could not be labeled such as incoherent
(2016) provides the two-fold benefit in that it re-       text or blank text.
quires less human labeling than would otherwise              The data used for the adapted IQA evaluation
be required for training. An additional benefit of        rubric was developed from transcriptions of audio
Weak supervision is that noisy data and the accu-         recordings of teachers in whole-class and teacher-
racy of each annotator can be taken into account for      student conversations that took place in elemen-
classification. Ratner et al. (2016) has shown that       tary mathematics classrooms using different math-
weak supervision systems are better than generic          ematics curricula across the United States. The
majority vote approaches. Noisy label data for            de-identified dataset was shared from an NSF-
model classification has also been studied by deep-       sponsored project that had previously collected
learning-based approaches (Guan et al., 2018) and         the recordings to answer separate research ques-
proven effective.                                         tions. Students engaged with a project purposed to
                                                          help them understand different geometry concepts
3     Data and Tools                                      like scale factor, dimensions, surface area, and vol-
3.1    Classification Model Data                          ume of rectangular prisms. The students recorded
                                                          the observations from a given visualization and ex-
A primary purpose of this conversational agent is         plained the impact of the scale factor. The data
in its ability to provide feedback for pre-service        collected for the development of this scenario con-
teachers. This requires the ability to classify each      tained 2826 questions. The unique question along
statement or question using the selected assess-          with the context, or speaking turn, in which the
ment rubric which in this implementation is an            question was uttered were both provided as refer-
adapted IQA rubric. The categories of the adapted         ence for the annotators to use during labeling.
IQA measure were set by education domain experts             We had 5799 total labeled data instances. There
and iterated on over the course of several months         were five total annotators: three expert teachers as
for the purposes of classification and feedback for       well as two pre-service teachers. The total number
pre-service teachers. The adapted IQA measure             of annotators fluctuated during different stages of
includes the following categories of questions:           the annotation process resulting in varied amounts
    • Probing and Exploring: Clarifies student            of labels generated by each annotator. The time to
      thinking, enables students to elaborate their       label each data point averaged between 5.2 to 6.7
      own thinking for their own benefit and the          seconds per annotator. The total number of unique
      class. Points to underlying mathematical re-        labeled sentences was 2826. The total distribution
      lationships and meanings and makes links            of labels between the four assessment categories
      among mathematical ideas. (e.g., Explain to         ranged from 856 to 2133. We used weak super-
      me how you got that expression?)                    vision based approaches to combine the labeled

                                                      187
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
Figure 1: Labeling interface for annotation

data from multiple annotators over majority vote        gies. In this paradigm noisy labels acquired either
approaches.                                             through human labels or machine learning mod-
                                                        els are cost effective to acquire. In the scenario
3.2   Labeling Platforms                                in which domain expert annotators are available
Two labeling platforms were used extensively for        (in our case, expert teachers), noisy disagreements
this project: Labelbox (Labelbox, 2020) and La-         between annotators can be leveraged to build high
bel Studio (Tkachenko et al., 2020). While both         accuracy models (Ratner et al., 2016; Guan et al.,
platforms were simple and straightforward to use,       2018). Weak supervision approaches are scalable,
Label Studio enabled the building of custom user        enabling easy adaptation to multiple mathematical
interfaces with several improved features such as       scenarios, one of the key contributions and focuses
keyboard shortcuts that allowed annotators to more      of this project.
easily onboard and complete labeling tasks more
efficiently.                                              3.3   Knowledge Base Data
   Each individual question was labeled with a con-     The knowledge-base of dialogue systems can be
text reference that allowed annotators to see the       very complex depending on the scenario for which
entire speaking turn of the teacher. The decision       the dialogue system is being built (Yan et al., 2017).
to include context came after previous iterations       For this initial demonstration scenario the conversa-
of labeling questions resulted in an inter-annotator    tional agent represents a student with some level of
agreement of below 0.50, which subsequently in-         understanding of the topic scale factor. The knowl-
creased to 0.66 after including context. An example     edge base relies on unstructured knowledge about
of the Label Studio labeling interface is shown in      scale factor collected from the web and textbooks
Figure 1.                                               in which text compiled is in the format of plain
   Our data collection approach relied on weak          text. As our system is intended to reflect a students
supervision and learning with noisy labels strate-      understanding of a topic, which is reasonably im-

                                                    188
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
Figure 2: Conversational Agent Implementation Architecture

perfect, contradictory sources of information are         3.4    Platform Interface and Hosting
not a primary concern. In fact a knowledge base         The application was built with Django, a Python
with contradictory information may be leveraged         web application development framework. Creden-
to support more robust answering. Additionally,         tials must be generated and are required prior to
the expected level of understanding of a student for    using the interface. The interface includes an Insti-
a given topic is likely to be documented in instruc-    tutional Review Board (IRB) consent form as well
tional materials readily available on the web and       as a description of the scenario to include topic,
therefore collecting this data is a simple way to de-   expected student understanding of topic (Beginner,
velop a knowledge base. Future efforts will address     Intermediate, Advanced), and student grade level.
identifying grade-appropriate filtering that may im-    Screenshots of the application are included in the
prove interaction by generating a more realistic        Appendix.
student profile with a grade-reflective knowldge-
base.                                                     4     Methods
   The text collected from the web was not cleaned,     The overall architecture of our conversational agent
labeled, or annotated. Basic pre-processing in-         is depicted in Figure 2. As discussed, a central
cluded removing references to figures or hyperlinks     component of our demonstration is evaluating and
to other web pages. Once the plain text reference       providing feedback of pre-service teacher instruc-
base was compiled, we then separated the text into      tion. For this initial version of the scenario, domain
sections of no more than 512 words. This process-       experts provided a specified rubric for pre-service
ing step was done so that an entire section could be    teachers to meet that clarifies the types of IQA cate-
directly used as the input in the response generation   gories desired within a session. One sample rubric
discussed further in the methods section.               evaluates if the teacher is asking at least one ”prob-
   Relying on unstructured knowledge bases is crit-     ing and exploring” question, one ”expository and
ical to rapidly developing and deploying new con-       cuing” question, and one ”procedural and factual”
versational agent scenarios. Unstructured texts on      question. This rubric can be changed easily in the
varying mathematical topics are readily available       demonstration through a separate JSON file. In our
from websites and video transcripts. Our frame-         current evaluation, we do not evaluate the order
work allows for a simple way to incorporate newly       in which the questions are asked, but we plan to
generated external knowledge bases as a way to          include more sophisticated evaluation protocols in
scale to additional scenarios. The ability to use un-   future work.
structured knowledge as the key input of our knowl-        The current implementation incorporates text
edge base is possible due to the recent advances        interactions between a conversational agent, repre-
in question-answering models, reading comprehen-        senting a student, and a pre-service teacher. The
sion tasks, and readily available libraries such as     pre-service teacher can interact with the conver-
Huggingface Transformers (Wolf et al., 2020).           sational agent by providing new knowledge and

                                                    189
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
Figure 3: Dialogue System Selection of Knowledge Base Reference Section via Semantic Similarity

testing understanding as well as testing knowledge         4.2   Semantic Matching
of the topic “scale factor”. The pre-service teacher
types a statement, question, multiple statements           For the conversational agent to respond to input
or multiple questions in the text box, and the con-        text, the first step is to identify the most relevant
versational agent responds by taking into account          section of the knowledge base. To do this, the pre-
the conversation context as well as the pre-service        processed input text is used in combination with
teacher’s utterance. Each component of the dia-            the Universal Sentence Encoder (Cer et al., 2018)
logue system is described in detail in the subse-          to find the most relevant, or more specifically, se-
quent subsections.                                         mantically similar section of the knowledge base.
                                                           Semantic similarity refers to identifying the degree
                                                           to which two texts have the same meaning. As dis-
4.1   Assessment Metric Classifier                         cussed in the data treatment section, the knowledge
                                                           base section is split into smaller sections that are
In our demonstration, we utilize the entire speak-
                                                           a more optimized size for the semantic similarity
ing turn of the pre-service teacher as the input text.
                                                           tool used: the Universal Sentence Encoder. This
This formulation is very similar to the two-sentence
                                                           process is depicted in Figure 3.
classification task like what is used in the Stanford
Natural Language Inference corpus (Bowman et al.,           The Universal Sentence Encoder is optimized
2015). We frame our input in a similar format for        for short phrases or paragraphs and outputs a 512-
classification and fine-tuning. The input text under-    dimensional vector. Semantic similarity computa-
goes basic cleaning and is tokenized prior to being      tion is accomplished by computing the inner prod-
used as the input to our classifier model. We experi-    uct between the input and knowledge base text. The
mented with multiple text classification approaches      semantic similarity between generated embeddings
to include Convolutional Neural Network (CNN)—           was computed using normalized cosine similarity
based text classification (Kim, 2014), Long Short-       of the embeddings. Semantic similarity computa-
Term Memory (LSTM)—based text classification             tion at the sentence level is more accurate than the
(Liu et al., 2016), and newer approaches that rely on    aggregate of word-level similarities and is there-
Transformer Architectures (Devlin et al., 2018; Liu      fore preferred in this application. Models trained to
et al., 2019) and perform well with small amounts        understand words in context are often better suited
of labeled data. Transfer learning models tend to        for identifying semantic similarities of phrases and
perform well with less labeled data than other mod-      sentences. In application, we may take input such
els because of the pretraining with unsupervised         as ”What is scale factor?” By finding the most se-
text that encodes knowledge and semantic meaning         mantically similar section in the knowledge base,
of words and sentences. This demonstration incor-        we can use this section to input the response gener-
porates a fine-tuned Bidirectional Encoder Repre-        ation.
sentations from Transformers (BERT) model for                If the text’s semantic similarity and the knowl-
our adapted IQA classification task.                       edge base sections do not achieve a pre-defined

                                                     190
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
threshold, the system responds from the unknown          the pre-service teacher acknowledge the answer”.
category. For this demonstration the threshold was       This iterative framework of question answering
set to 0.80 after empirical evaluation of semantic       helps keep track of the dialogue state. Since this is
coherence. This threshold value will be further          a task-specific dialogue system being evaluated for
tested and empirically evaluated in future iterations    a specific mathematical scenario, we are already
of this system.                                          aware of the dialogue states of the conversation. At
   In the unknown category, one of the six random        each turn in the conversation, we use all the previ-
responses (pre-defined) is selected to convey to the     ous pre-service teacher utterances to determine the
pre-service teacher that the system does not under-      current dialogue state of the conversation.
stand the user input. This pre-defined selection
                                                           4.3.2    Dialogue Policy
represents a hand-crafted dialogue policy that was
determined by domain experts.                            In our task-specific dialogue system, our dialogue
   If the semantic similarity is greater than or equal   policy is rule-based. Depending on the dialogue
to the threshold for a given knowledge base sec-         state accomplished up to turn n, our utterance at
tion, the system then selects this knowledge sec-        n + 1 depends on the dialogue states accomplished
tion. The initial input text along with the selected     up to that point. The hand-crafted rules for our
knowledge base section are used as the inputs            dialogue policy also enable the use of direct evalua-
to a question-answering module. The question-            tion based on simple metrics such as the number of
answering module is a pre-trained BERT model             questions in each adapted IQA category or devel-
that is fine-tuned on the Stanford Question Answer-      opment of metrics like Initiate Response Evaluate
ing Dataset (SQuaD). This module is then used to         (IRE) (Mehan, 1979).
generate a response in the user interface for the          4.4     Response Generation
subsequent turn of the conversation. The dialogue
manager retains the dialogue states of the conver-       The response generation component extracts rele-
sation for record and reference within the conver-       vant sections of the knowledge base as a question-
sation.                                                  answering task. A question-answering task, also
                                                         referred to as a reading comprehension task, is a
4.3     Dialogue Manager                                 supervised learning problem, where given a seg-
                                                         ment of text of i tokens, a question of j tokens, it
4.3.1    Dialogue State Tracking                         returns an answer segment of k tokens. The answer
Dialogue State Tracking is a core component of the       in question-answering tasks can be cloze-style as
dialogue system. The goal of the dialogue state          in CNN/Daily Mail (Hermann et al., 2015), span
tracking system is to estimate the goal at each          prediction (like SQuaD (Rajpurkar et al., 2016)),
turn of the conversation. There are multiple for-        or be similar to NarrativeQA (Kočiskỳ et al., 2018).
mulations of dialogue state tracking systems like        We retrieved our knowledge from semantic match-
hand-crafted rules (Wang and Lemon, 2013) and            ing of web-text categories and thus our response
web-style ranking (Williams, 2014). In our ap-           generation pipeline matched closely to span predic-
proach we use the most recent question-answering         tion tasks. We implemented the response genera-
paradigm for dialogue state tracking (Gao et al.,        tion pipeline using the transformers library (Wolf
2019). Unlike Gao et al. (2019) we do not train          et al., 2020), where a BERT model (Devlin et al.,
our reading comprehension based model, but in-           2018) was fine-tuned on the SQuaD dataset. We
stead use the question-answering paradigm to un-         did not fine-tune our question-answering system for
derstand the dialogue states. We append each of          the response generation module, rather relying on
the pre-service teacher utterances along with the        semantically-matched unstructured data sections
student responses. Since conversational agent re-        to be used as inputs in generating answers to ques-
sponses can either be mathematical (evaluating an        tions.
expression) or responding to a pre-service teacher
utterance, we evaluate each stage as a question-           4.5     Session Feedback
answer task. So at the end an utterance questions        All text input by the pre-service teacher is retained
prompted include: “Did the teacher ask a prob-           as well as the associated adapted IQA category
ing and exploratory question?”, “Did the conversa-       classification. The compilation of classifications of
tional agent answer the question correctly?”, “Did       the pre-service teachers’ input texts are provided in

                                                     191
an assessment report which is used as feedback for         teacher receives a session report which incorporates
the pre-service teacher at the end of the session.         the adapted IQA rubric and the number of their
                                                           interactions that would be classified within each
5     Results                                              of the sections. They are also provided a report of
                                                           the entire conversation that is shared with expert
This section will discuss the step by step interac-
                                                           teachers as well in an effort to identify how the
tion with the developed interface that is developed.
                                                           conversational agent stages can be designed better.
Only one scenario currently exists and is defined
as a 4th-5th grade average understanding of the
                                                           6    Conclusion and Future Work
mathematics topic scale factor.
                                                         Our goal in this paper is to demonstrate the im-
5.1    IQA Classifier
                                                         plementation of a conversational agent with very
To train the IQA classifier, we used BERT (Devlin        little training data that relied on foundational and
et al., 2018) and DistillBERT (Sanh et al., 2019) for    well-studied metrics like IQA. By leveraging state
classification based on the open source Transform-       of the art modules for natural language process-
ers (Wolf et al., 2020) implementation. We trained       ing and deep learning we could build a functional
the classifiers on 80% of the data and sectioned the     prototype that is now going to be used as a pilot
remaining data as a 10% validation set and 10%           for training for a specific mathematical scenario of
for the test set. The validation set accuracy for the    ”scale factor”. By integrating pre-trained models
BERT and DistillBert models were 75.8 and 74.3           such as SQuAD, BERT, and the Universal Sen-
respectively. Since, the performance increase with       tence Encoder as well as using weak supervision
Bert was minimal we used Distillbert for faster          approaches in data treatment we have leveraged
inference.                                               minimal amounts of domain-expert-labeled data
                                                         and knowledge base data in order to create a us-
5.2    Walk Through                                      able interface. Approaches like this help evaluate
This platform is currently hosted on Python Any-         pre-service teachers in a scalable fashion and can
where. Pre-generated credentials for instructors are     also be deployed across the web for large scale
generated for a given user ID. Each user uses the        participation. There are several areas we are pur-
unique ID to create an associated password. Fol-         suing to improve this interface in order to provide
lowing the prompts the user is able to read through      a more robust interface as well as a more useful
the IRB consent form and sign in to access the           assessment tool. Some of the current features that
conversational interface. Figure 4 shows the first       we are working on are as follows:
conversational agent interaction screen as well as
an example inputs where the pre-service teacher                • Improved IQA model: We plan to continue
may ask several questions to assess the conversa-                collecting domain-expert labeled data that can
tional agents current level of understanding.                    then be used to improve the trained classi-
   The pre-service teacher, or user, can then interact           fication model. The improved classification
with the conversational agent by typing questions                model will better reflect a realistic assessment
or statements. The Appendix contains additional                  of pre-service teacher sessions using the IQA-
screen captures of the developed interface. Cur-                 developed categories as a rubric.
rently the system only supports text-based conver-
sation.                                                        • Increased Knowledge Base: This first sce-
   An example of a text snippet that does not meet               nario is limited to a specific knowledge base
the set semantic matching threshold with the gener-              on the topic ”Scale Factor”. We plan to in-
ated conversational agent response is demonstrated               corporate a wider variety of data related to
in Figure 4. There are instances when the conver-                expected knowledge on this topic (from 4th
sation breaks despite the teacher re-framing their               grade to 10th grade) as well as associated top-
questions or input differently. This is intended to              ics such as perimeter, volume, and ratio. With
reflect how a student may sometimes fail to answer               these knowledge bases compiled we will be
a properly framed question or statement appropri-                able to test techniques that can assist in gen-
ately.                                                           erating responses that are representative of a
   Once the session is completed, the pre-service                limited understanding of the desired topic to a

                                                     192
Figure 4: Conversational Agent Interface Screen. Left image shows initial screen. Right image shows interaction
example where conversational agent has identified a statement that does not meet semantic matching threshold
within the knowledge base.

     more advanced level of understanding on the            Melissa Boston, Jonathan Bostic, Kristin Lesseig, and
     topic.                                                  Milan Sherman. 2015. A comparison of mathemat-
                                                             ics classroom observation protocols. Mathematics
   • Response Generation: Currently there are                Teacher Educator, 3(2):154–175.
     some features within the response that appear
                                                            Melissa D Boston and Amber G Candela. 2018. The in-
     to represent the way a student is more likely           structional quality assessment as a tool for reflecting
     to respond. We would like to further develop            on instructional practice. ZDM, 50(3):427–444.
     these features by incorporating more student-
     like speech features. This would allow for             Samuel R Bowman, Gabor Angeli, Christopher Potts,
                                                              and Christopher D Manning. 2015. A large anno-
     more realistic conversational agent interaction          tated corpus for learning natural language inference.
     and result in less formal or textbook-like re-           arXiv preprint arXiv:1508.05326.
     sponses. Additionally, current responses gen-
     erated are most coherent when responding to            Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua,
     a question which is not necessarily reflective           Nicole Limtiaco, Rhomni St. John, Noah Constant,
                                                              Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,
     of all student-teacher interactions. Future de-          Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil.
     velopments are planned to improve the robust-            2018. Universal sentence encoder.
     ness of responses to better account for the
     different forms of inputs.                             Nalin Chhibber and Edith Law. 2019. Using conversa-
                                                              tional agents to support learning by teaching. arXiv
  Finally, we plan to be deploy this tool with a              preprint arXiv:1909.13443.
group of pre-service teachers under the direction
                                                            Debajyoti Datta, Valentina Brashers, John Owen,
of expert teachers in order to test the qualitative           Casey White, and Laura E Barnes. 2016. A deep
aspects and realism of this system.                           learning methodology for semantic utterance classi-
                                                              fication in virtual human dialogue systems. In In-
Acknowledgments                                               ternational Conference on Intelligent Virtual Agents,
                                                              pages 451–455. Springer.
This work was funded in part by the Robertson
Foundation. The authors wish to acknowledge the             Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
use of de-identified classroom dialogue from NSF               Kristina Toutanova. 2018. Bert: Pre-training of deep
                                                               bidirectional transformers for language understand-
1535024.                                                       ing. arXiv preprint arXiv:1810.04805.

                                                            Shuyang Gao, Abhishek Sethi, Sanchit Agarwal, Tagy-
References                                                    oung Chung, and Dilek Hakkani-Tur. 2019. Dia-
Jo Boaler and Karin Brodie. 2004. The importance,             log state tracking: A neural reading comprehension
   nature and impact of teacher questions. In Proceed-        approach. In Proceedings of the 20th Annual SIG-
   ings of the twenty-sixth annual meeting of the North       dial Meeting on Discourse and Dialogue, pages 264–
  American Chapter of the International Group for the         273.
  Psychology of Mathematics Education, volume 2,
   pages 774–782.                                           Melody Guan, Varun Gulshan, Andrew Dai, and Ge-
                                                             offrey Hinton. 2018. Who said what: Modeling
Melissa Boston. 2012. Assessing instructional qual-          individual labelers improves classification. In Pro-
 ity in mathematics. The Elementary School Journal,          ceedings of the AAAI Conference on Artificial Intel-
 113(1):76–104.                                              ligence, volume 32.

                                                      193
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-             Victor Sanh, Lysandre Debut, Julien Chaumond, and
  stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,            Thomas Wolf. 2019. Distilbert, a distilled version
  and Phil Blunsom. 2015. Teaching machines to read              of bert: smaller, faster, cheaper and lighter. arXiv
  and comprehend. Advances in neural information                 preprint arXiv:1910.01108.
  processing systems, 28:1693–1701.
                                                               Pavel Smutny and Petra Schreiberova. 2020. Chatbots
Brian William Junker, Yanna Weisberg, Lindsay Clare              for learning: A review of educational chatbots for
  Matsumura, Amy Crosson, Mikyung Wolf, Allison                  the facebook messenger. Computers & Education,
  Levison, and Lauren Resnick. 2005. Overview of                 page 103862.
  the instructional quality assessment. Regents of the
  University of California.                                    Maxim Tkachenko, Mikhail Malyuk, Nikita
                                                                Shevchenko, Andrey Holmanyuk, and Nikolai
Yoon Kim. 2014. Convolutional neural networks for               Liubimov. 2020. Label Studio: Data labeling
  sentence classification. In EMNLP.                            software. Open source software available from
                                                                https://github.com/heartexlabs/label-studio.
Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom,
                                                               Zhuoran Wang and Oliver Lemon. 2013. A simple
  Chris Dyer, Karl Moritz Hermann, Gábor Melis, and
                                                                 and generic belief tracking mechanism for the dialog
  Edward Grefenstette. 2018. The narrativeqa reading
                                                                 state tracking challenge: On the believability of ob-
  comprehension challenge. Transactions of the Asso-
                                                                 served information. In Proceedings of the SIGDIAL
  ciation for Computational Linguistics, 6:317–328.
                                                                 2013 Conference, pages 423–432.
Labelbox. 2020. Labelbox: Training data platform.              Jason D Williams. 2014. Web-style ranking and slu
                                                                  combination for dialog state tracking. In Proceed-
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang.                      ings of the 15th Annual Meeting of the Special Inter-
  2016. Recurrent neural network for text classi-                 est Group on Discourse and Dialogue (SIGDIAL),
  fication with multi-task learning. arXiv preprint               pages 282–291.
  arXiv:1605.05101.
                                                               Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-              Chaumond, Clement Delangue, Anthony Moi, Pier-
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,                  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.                  icz, Joe Davison, Sam Shleifer, Patrick von Platen,
  Roberta: A robustly optimized bert pretraining ap-             Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  proach.                                                        Teven Le Scao, Sylvain Gugger, Mariama Drame,
                                                                 Quentin Lhoest, and Alexander M. Rush. 2020.
Lindsay Clare Matsumura, Brian Junker, Yanna Weis-               Transformers: State-of-the-art natural language pro-
  berg, and Amy Crosson. 2006. Overview of the in-               cessing. In Proceedings of the 2020 Conference on
  structional quality assessment.                                Empirical Methods in Natural Language Processing:
                                                                 System Demonstrations, pages 38–45, Online. Asso-
Hugh Mehan. 1979. ‘what time is it, denise?”: Ask-               ciation for Computational Linguistics.
  ing known information questions in classroom dis-
  course. Theory into practice, 18(4):285–294.                 Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jian-
                                                                 she Zhou, and Zhoujun Li. 2017. Building task-
Nikola Mrkšić, Diarmuid O Séaghdha, Blaise Thom-              oriented dialogue systems for online shopping. In
  son, Milica Gašić, Pei-Hao Su, David Vandyke,                Proceedings of the AAAI Conference on Artificial In-
  Tsung-Hsien Wen, and Steve Young. 2015. Multi-                 telligence, volume 31.
  domain dialog state tracking using recurrent neural
  networks. arXiv preprint arXiv:1506.07190.                   Appendix
Robert C Pianta and Bridget K Hamre. 2009. Concep-
  tualization, measurement, and improvement of class-
  room processes: Standardized observation can lever-
  age capacity. Educational researcher, 38(2):109–
  119.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  Percy Liang. 2016. Squad: 100,000+ questions
  for machine comprehension of text. arXiv preprint
  arXiv:1606.05250.

Alexander J Ratner, Christopher M De Sa, Sen Wu,
  Daniel Selsam, and Christopher Ré. 2016. Data pro-
  gramming: Creating large training sets, quickly. In
  Advances in neural information processing systems,
  pages 3567–3575.

                                                         194
Figure 5: Conversational Agent Session Example: Login

       Figure 6: Conversational Agent Session Example: Acknowledgement

Figure 7: Conversational Agent Session Example: Institutional Review Board consent

                                       195
Figure 8: Conversational Agent Session Example: Initial Session Screen

Figure 9: Conversational Agent Session Example: Testing Scale Factor Knowledge

                                      196
Figure 10: Conversational Agent Session Example: Improperly Phrased Question

 Figure 11: Conversational Agent Session Example: Properly Phrased Question

                                    197
Figure 12: Conversational Agent Session Example: Irrelevant Question

                                198
You can also read