An Item Response Theory Framework for Persuasion - ACL Anthology

Page created by Shirley Lawson

Education

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

An Item Response Theory Framework for Persuasion

      Anastassia Kornilova            Daniel Argyle      Vlad Eidelman
                                   FiscalNote Research
                      anastassia, daniel, vlad@fiscalnote.com

                      Abstract                                across a diverse set of persuasion tasks.
                                                                 Since implementing the ELM framework re-
    In this paper, we apply Item Response Theory,
    popular in education and political science re-            quires separate data about the speaker, audience,
    search, to the analysis of argument persuasive-           and argument, it is difficult to validate empirically.
    ness in language. We empirically evaluate the             Often, we only have access to the observed out-
    model’s performance on three datasets, includ-            come (e.g. did the person donate money). Both the
    ing a novel dataset in the area of political ad-          persuadability of the audience and the persuasive-
    vocacy. We show the advantages of separating              ness of the argument are unobserved. Motivated by
    these components under several style and con-
                                                              this, we explicitly model a persuasive scenario as a
    tent representations, including evaluating the
    ability of the speaker embeddings generated by
                                                              function of latent variables describing the persuad-
    the model to parallel real-world observations             ability of the audience and the persuasiveness of
    about persuadability.                                     the text.
                                                                 Our approach is based on Item Response Theory
1   Introduction                                              (IRT), a framework for modeling the interaction be-
Persuasion is the art of instilling in someone a given        tween latent traits and observable outcomes. While
belief or desire to take a given action. The ac-              these types of models are well known in the context
tion can be expressing agreement with the speaker             of education (Fischer, 1973; Lord, 1980; McCarthy
in a debate (Durmus and Cardie, 2019), making                 et al., 2021) and politics (Clinton et al., 2004), to
a donation to a crowdfunding campaign (Yang                   our knowledge this is the first application of an IRT
et al., 2019) or non-profit (Wang et al., 2019), or a         model to study persuasion. Using this framework,
Supreme Court ruling (Danescu-Niculescu-Mizil                 we model the interaction between the grouped ar-
et al., 2012). Social psychology frameworks for               gument and speaker, and the audience, explicitly.
understanding persuasion, such as the Elaboration             The argument and speaker are grouped together be-
Likelihood Model (ELM), argue that attributes of              cause in practice it is hard to separate their effects,
successful persuasion fall into three groups: (1)             especially in the written tasks covered in this study.
message, the text of the argument; (2) audience;                 We explore two variations on the IRT framework
and (3) speaker, the source of the argument. (Petty           and apply it to three different persuasion tasks. In
and Cacioppo, 1986; Lukin et al., 2017; Cialdini,             addition to two previously studied tasks, we intro-
2009).                                                        duce a novel setting related to political advocacy
   Although much attention has been given to study-           group campaigns, where a recipient is asked by an
ing the text, text in isolation fails to capture how          organization to take a specific action.
the audiences’ prior beliefs and predispositions can             We evaluate these models with different param-
affect their response to the same argument. Sev-              eterizations, including style and content features,
eral recent studies have considered all three factors         showing that they are both effective for predicting
within the context of specific datasets by creating           persuasion, and have the ability to uncover latent
features to represent the audience as a whole or              characteristics of the audience that were modeled
by building separate models for different types of            explicitly in previous works.
audiences (Lukin et al., 2017; Tan et al., 2016; Dur-            Our contributions are as follows: 1) we formal-
mus and Cardie, 2019; El Baff et al., 2020). In this          ize the use of IRT model formulations for persua-
paper, we present a broad framework that can rep-             sion and show the advantages of them over exist-
resent individual audience members in one model               ing approaches, 2) we introduce a new dataset of
                                                         77
                 Findings of the Association for Computational Linguistics: NAACL 2022, pages 77 - 86
                          July 10-15, 2022 ©2022 Association for Computational Linguistics

political advocacy emails, 3) we apply the formu-             simultaneously. To solve this problem, Fischer
lations with style and content features on three              (1973) proposed the linear logistic test model that
persuasion tasks, and 4) we show that the sep-                parameterizes the difficulty, β, as a weighted linear
arate latent audience component is interpretable              combination of test features. In this formulation,
and consistent with external information. All code            the student (α) remains a latent variable, but the
associated with the paper is available at https:              β of an unseen question can be predicted using
//github.com/akornilo/IRT_Persuasion.                         attributes of the question itself.
                                                                 Following Fischer (1973), the parameterization
2   Item Response Theory                                      used to predict the item parameters is a weighted
Item Response Theory (IRT) represents a set of                linear sum of features:
models that explain an observed outcome based                                          K
on latent traits. These models are frequently used                                     X
                                                                                βj =         wk × ψjk                  (2)
when an outcome is easily observed, but the fac-                                       k=1
tors predicting that model are unobservable. For
example, in education an outcome could be a stu-              where ψk is an input feature representing the item,
dent’s answer to an exam question, and the latent             and wk is the associated weight.
predictive traits are a students knowledge and the               In order to apply this model to persuasion,
difficulty of the question; in politics an outcome            we propose considering argumentation as follows:
could be a vote on a bill and the unobservable traits         First, arguments can vary in quality, similar to test
are the legislator’s and bill’s ideology. Crucially,          questions having different difficulty levels. Sec-
an IRT model provides both a prediction of the                ond, we can only measure the quality of an argu-
outcome, and an interpretable measurement of the              ment based on how the audience reacted; similar to
latent variables.                                             how a students ability is measured via their perfor-
   In applying IRT to persuasiveness, we can view             mance. Third, it is possible that a good argument
the audience as having a response to the item,                is matched with an audience reticent to persuasion,
where the item is an argument composed of the                 similar to a good student receiving a particularly
speaker and message pair.                                     hard question. Note that this requires an audience
2.1 Rasch Testing Model                                       member observe multiple arguments, and that each
                                                              argument be heard by multiple audience members.
We build on two specific IRT parameterizations.               Inspired by the linear logistic model, we model
The first, the Rasch model (Rasch, 1960) is com-              the latent argument parameter as a function of at-
monly used in education research to model the diffi-          tributes of the argument itself, thus allowing us
culty of standardized test questions (Fischer, 1973;          to include attributes of the speaker and text in the
Lord, 1980). In it the probability that an individual         model directly.
i answers test question j is given by:
                                                              2.2    Two Parameter IRT
         p(yij = 1 | α, β) = σ(αi − βj )          (1)
                                                              While the simplicity of the Rasch model is pow-
where αi represents a respondent (e.g. a student’s            erful, a two parameter generalization of an IRT
ability) and βj represents the item (e.g. the diffi-          model (a two parameter logistic - 2PL) provides
culty of a test question). Intuitively, if the ability        additional benefits for our application (Birnbaum,
is greater than the question difficulty, then the stu-        1968). In the simplest version, a two parameter
dent will answer the question correctly. Given a              model (so called because the item is modeled with
series of exam sessions one can estimate values               two parameters) is as follows:
of α and β for all of the students and questions in
the dataset. This can be done using a variety of                    p(yij = 1 | α, φ, β) = σ (αi · φj − βj )           (3)
optimization strategies, such as Expectation Maxi-
mization or Bayesian techniques (Bock and Aitkin,
                                                              where as before, αi represents the respondent (stu-
1981; Natesan et al., 2016).
                                                              dents ability), and βj is the item’s difficulty,1 but
   However, one limitation of this approach is that
it cannot be used to perform inference on new                     1
                                                                    Analogous to the Rasch model, this tells us the overall
test questions because all parameters are estimated           difficulty level of the question

                                                         78

now φj represents the item’s discrimination.2 We than conservatives. Wang et al. (2019) also find
similarly generalize this model by estimating the that people with different personality types respond
two item parameters, βj and φj , as linear functions differently to emotional vs. logical appeals. Tan
of features as in Equation 2. et al. (2016) show how “malleable” different Reddit
This framework has commonly been be used to users are to new perspectives. Durmus and Cardie
explain legislator voting behavior (Clinton et al., (2018, 2019) show that prior beliefs play a strong
2004), a useful analogy as many of the persuasion role in how persuadable someone is. Cano-Basave
contexts we consider have political undertones. In and He (2016) study persuasiveness of style in po-
this case, the response yij is a vote by respondent litical speeches. In contrast to these studies, our
i (a legislator) on item j (a bill). Clinton et al. method is designed to work when we have limited
(2004) show that the parameter αi can then be in- or no information about the audience of an argu-
terpreted as the respondent’s ideology (e.g negative ment.
values are more liberal, positive values are more
conservative); φj is referred to the bills polarity (i.e. Item Response Theory As described in the pre-
discrimination);3 βj represents the bill’s popularity vious section, IRT models have primarily been ap-
(i.e. difficulty).4 Persuasion is a generalization of plied in politics to measure the ideology of politi-
this framework because popularity can correspond cians (Clinton et al., 2004; Poole and Rosenthal,
to properties of arguments that are appealing over- 1985). While most IRT implementations here
all, while polarity represents techniques or topics rely only on the responses as data, more recent
that appeal only to a subset of the audience. work augment the models to take advantage of
the text through a simultaneously estimated topic
2.3 Audience Analysis model (Gerrish and Blei, 2012; Vafa et al., 2020;
Once a Rasch or a 2PL model is fit, the learned α Lauderdale and Clark, 2014).
can be interpreted as a one-dimensional respondent
embedding. In the legislator voting context these The efficacy of IRT has been applied on large-
values can be interpreted as ideologies: legislators scale datasets to verify the validity of standardized
with very negative or very positive embeddings re- tests both in the U.S. and internationally (AERA
flect very liberal and conservative stances, respec- et al., 2014; Rutkowski et al., 2014). Recent ad-
tively, while those with small-value embeddings vances have focused on polytomous test questions
map to moderate legislators. While interpretation and creating new questions (the ‘cold-start’ prob-
of these values will depend on the task, in general, lem: Settles et al., 2020; McCarthy et al., 2021). In
similar embeddings will map to similar audience this paper, we focus on the simplest form, but this
members and can be grouped together for further area of research points to many possible extensions.
analysis.
Argument Quality Argument mining has been
3 Related Works studied in various domains (Palau and Moens,
2009). Most relevant here, several studies have at-
Audience Effects The properties of the audience tempted to study argument quality through pairwise
in relation to argument persuasiveness have previ- ranking as the outcome (Habernal and Gurevych,
ously been examined in several predictive studies. 2016; Gleize et al., 2019; Toledo et al., 2019).
Lukin et al. (2017) show that audiences with a more
“open” personality respond better to emotional argu- Framing Theory In the study of framing effects,
ments, while El Baff et al. (2020) show that liberals the expectancy value model (Chong
are more affected by the style of a new editorial P and Druckman,
2007) represents an attitude as i vi × wi , where
2
Discrimination is how well the question is able to tell vi is the favorability of the object of evaluation
which students perform better, a high value indicates clearly (e.g. a candidate), on dimension i (e.g. foreign af-
separates high scoring students from low scoring, a negative fairs
value would indicate that low performing students are more P or personality), and wi is the salience weight
likely to get the question right than high performing. ( i wi = 1). Our parameterization of βj and φj
3
Large negative or positive values indicate that a bill is can be seen in this paradigm as identifying frames
strongly ideological, a value close to zero means the vote isn’t in communication, with each feature of the style
strongly driven by ideology.
4
Large values indicate a bill that is “difficult” to vote for and content as a dimension, and learning the fram-
and is less likely regardless of ideology. ing effect of each.
79

4    Datasets                                                             To preprocess the data, we removed all debates
                                                                       that have fewer than three rounds, end in a forfeit
In order to apply the IRT framework, an audience                       or a tie, have fewer than 100 words per side, or
member must respond to multiple arguments (and                         have fewer than 5 points awarded total. In addition,
arguments must be observed by multiple audience                        we excluded debates not on the following issues:
members). Too few responses implies that an audi-                      Politics, Religion, Society, Philosophy, Education
ence member’s latent value will be driven entirely                     and Economics. Since we are interested in mod-
by the one or two arguments. While not many exist-                     eling individual audience members, we identify
ing argument mining datasets meet this criteria, we                    audience members who have responded on at least
are able to study three diverse settings. Addition-                    10 debates, then remove debates where none of
ally, our advocacy task is akin to many real-world                     those members responded. The final dataset con-
settings where users on one-platform are asked to                      tains approximately 60k datapoints; 6320 debates
complete an arbitrary task (e.g. a retail mailing list                 and 1131 responders.
getting users to click on a promotion).                                   Each debate has one side with a pro argument
                                                                       and one side with a con argument, resulting in the
4.1 NYTimes Editorials                                                 wining side being “assigned more points”. The pre-
The NYTimes Editorial corpus5 consists of 975                          diction task consists of whether a responder gave
editorials from the New York Times news portal                         more points to a given debate side. Since our mod-
(El Baff et al., 2018). Each publication was re-                       els only consider one argument at a time, we treat
viewed by 3 conservatives and 3 liberals from a                        each side of the debate as a separate item, concate-
pool of 12 conservative and 12 liberal reviewers.                      nating texts from all rounds from that speaker.8
   Each reviewer rated the editorials as either ‘chal-
lenging’, ‘reinforcing’ or ‘no effect’. These labels                   4.3    Advocacy Campaign Corpus
must be approached with care as reinforcing could                      Grassroots advocacy is the process wherein orga-
imply ‘reinforced view against the article’s stance’.                  nizations (e.g. companies, non-profits, coalitions)
El Baff et al. (2020) study this corpus in a ternary                   encourage individual citizens to influence their gov-
setting by aggregating the liberal and conservative                    ernment. In the United States, such lobbying often
votes and building separate models for each side.                      takes the form of advocacy email campaigns, sent
For our study, we construct a binary task for pre-                     by an organization to specific audiences, asking
dicting ‘whether this article had an effect’. While                    them to take an action, such as contacting their
this framing elides whether the speaker succeeded                      legislators to vote yes or no on a particular bill.
according to her intent, it does relay whether the                        We construct a dataset containing the text and
argument was persuasive.                                               metadata of these emails, from a popular advo-
                                                                       cacy software platform, paired with whether re-
4.2 Debates (DDO) Corpus                                               cipients took the requested action.9 Organizations
DDO is a corpus of 78k debates scraped from                            will send different messages to the same audience
debate.org.6 Each debate has two speakers and                          over time, allowing us to identify which emails
an audience votes on a winner.7 In addition, each                      (items) elicited a response from specific recipients.
audience member can fill out their profile with their                  Thus, it is possible to distinguish messages that
political and religious ideology, and stance on var-                   did not generate interest overall (popularity) from
ious political issues (e.g. Abortion or the Border                     messages that did not resonate with specific groups
Wall). Originally, it was used to study how prior                      of recipients (polarity).
beliefs and similarities between the audience and                         The dataset contains 63,795 individual recipi-
the speaker affected debate outcomes (Durmus and                       ents of 7,067 email campaigns from 328 different
Cardie, 2018, 2019).                                                   organization, resulting in approximately 2 million
                                                                       individual data points. Each recipient has data for
    5
      https://webis.de/data/
                                                                           8
webis-editorial-quality-18.html                                              We are interested in how a single unit of argument affects
    6
      https://www.cs.cornell.edu/                                      the audience, and leave extension of this to account for both
˜esindurmus/ddo.html                                                   simultaneously to future work.
    7                                                                      9
      While the audience can assign points to various aspects                Due to privacy concerns, this dataset will not be released,
of the debate, this study will only consider the cumulative sum        but platform users agreed to terms of services providing for
of the points.                                                         internal analysis.

                                                                  80

15 to 100 emails and had an action rate between task). While we initially explored using deep, con-
of 5% - 95%.10 Each email included in the dataset textual text representations, they did not show ben-
had at least 6 responses. efit, and the motivation for this paper is to under-
The data is not balanced with respect to organi- stand the benefits of the IRT framework, rather
zations; while the largest organizations sent over than optimize performance based on the argument
200 emails, the median is 6. One possibility of this alone.
imbalance is overfitting a feature that is only per-
tinent to one, particularly prevalent organization. Debate-Only Speaker Features In the debate
To mitigate such effects, we include an indicator platform, users can optionally specify a stance -
variable to specify the organization.11 for, against, undecided or no stance - on 48 issues
such as Abortion, Death Penalty or Gay Marriage.
5 Model Features These can be viewed as a proxy for the content as
Argument analysis is often separated into style and users often present arguments that align with their
content features (Cano-Basave and He, 2016; Long- views.
pre et al., 2019; El Baff et al., 2020), with additional
categories included for argument quality and task Advocacy-Only Org Indicator An indicator to
specific properties. Since we group the speaker and account for the large variation in action rate be-
the argument text together, we combine features tween organizations. Additional indicators are used
representing both as inputs to φ and β. to represent the industry and organization size.

Lexicon Style Features Style features represent Advocacy-Only Appeals Using data from Wang
higher-level properties of words and rhetorical et al. (2019), we built a multi-class classifier to
structures. We chose the following sets of such recognize ‘emotional’, ‘logical’ and ‘credibility’
features from lexicons that were commonly used appeals. The classifier was applied at a sentence
in previous argumentation literature: level to the emails, and features were created for
LIWC lexicon of 93 metrics ranging from parts- the average and the sum of the scores across the
of-speech to thinking styles to emotions (Pen- sentences.
nebaker et al., 2015);12 Valence, Arousal, Dom-
inance (Warriner et al., 2013); Concreteness (Brys- Advocacy-Only Misc Features : The day of the
baert et al., 2014). (These features were shown to week and time of day have a strong effect on email
be useful for argument quality analysis by Tan et al. click rate.14 We include indicator features for the
(2016).) Argument features developed by Soma- day of the week and the hour of day. We include
sundaran et al. (2007), including necessity, empha- an urgency indicator feature, based on a custom list
sizing, desire, contrasting and rhetorical question; of words indicative of high urgency and timeliness
NRC Lexicon: Word-level level associations for (e.g. “soon”, “now”, “hurry”).
emotions like anger, disgust and fear (Mohammad
and Turney, 2013); Sentiment and Subjectivity: as IBM Quality Gretz et al. (2019) released a
implemented in the TextBlob Python Library.13 dataset of 30k sentence-level arguments with 0-
1 quality ratings. Unlike our tasks where quality is
Argument Text We use TF-IDF unigrams to rep-
a latent property, these sentences were assessed for
resent the text directly (tuned with respect each
quality directly. We re-implemented the BERT-FT
10
Those with a lower or higher action rate are unlikely to model from this paper, using the MACE-P score.
be illustrative of persuasion characteristics. Since these scores were trained on short texts, we
11
Alternatively, we could construct separate models for
each organization, but refrain from doing so for three reason. apply them to individual sentences in the input text,
First, about a quarter of recipients are ‘multi-org’ - they re- then use the min, max, average, range, 25th, 50th,
ceive emails from multiple sources, thus, we would like to and 75th percentiles of these scores. As far as we
model their behavior across all of them. Second, as many of
the organizations are not well represented, they benefit from know, this is the first study to transfer the qual-
patterns that appear across different organizations. Finally, ity model to longer texts. These features will be
maintaining a separate model for every recipient and recipient
is not as efficient or scalable.
grouped with Style for the analysis.
12
We purchased a copy of the software from liwc.
wpengine.com to obtain these labels. 14
https://sleeknote.com/blog/
13
https://textblob.readthedocs.io/ best-time-to-send-email

Model                   Accuracy
              Audience Prior               0.662
              Style                        0.741
              Text                         0.754
              Style + Text                 0.750

Table 1: Results for the Editorials Task (Rasch Model).

6        Models and Results
Since the Editorials corpus is the smallest, we
use the simpler Rasch parameterization, while the
                                                                          Figure 1: Reviewer Embeddings for the Editorial
2PL model is used for the Debates and Advocacy                            Rasch Model on the x-axis. Blue represents liberal re-
tasks.15 Each of the models is trained using a regu-                      viewers, red represents conservative reviewers.
larized binary cross-entropy loss:

    L (ŷi , yi ) = −yi log ŷi − (1 − yi ) log (1 − ŷi )                to the New York Times style; however, the fact that
                                                                          the majority of reviewers from both sides have sim-
                                            +c · kα, β, φk
                                                                          ilar embeddings, suggests that the pattern is not
where ŷi is the output from equation 1 or 3, and yi                      very strong.
is the binary label, representing if the persuasion                          This data also contained information from each
was successful. The second part of the equation                           reviewers Big 5 Personality test. We measured the
represents a regularization parameter. Details on                         Pearson correlation between the reviewers embed-
the experimental parameters can be found in Ap-                           dings and found a strong correlation with extrover-
pendix A. For each task, an audience prior baseline                       sion (r=-0.568, p

Model                    Accuracy
         Random                       0.500
         Style                        0.561
         Text                         0.581
         Speaker                      0.611
         Speaker + Style              0.626
         -β (popularity) layer        0.604

   Table 2: Results for Debates Task (2PL Model).

                                                              Figure 3: Contrast of weights from popularity vs polar-
                                                              ity features.

                                                              Low Polarity: Border Fence, Gun Rights, Home-
                                                              schooling;
                                                              High Popularity:          quality max, quality range,
                                                              liwc differ;
                                                              Low Popularity: liwc Exclam, liwc authentic,
                                                              liwc drives
                                                                 For popularity the significant factors are related
Figure 2: Distribution of one-dimensional audience em-        to style and quality. The high ‘quality max’ feature
beddings on the y-axis.                                       suggests that the quality model transfers better to
                                                              this context than Editorials. The low popularity
                                                              value for ‘liwc authentic’ is interesting, as El Baff
popularity parameter, β the performance decreases,
                                                              et al. (2020) also found that authenticity generally
which confirms the theory that both polarity and
                                                              led to No Effect editorials.
popularity are necessary to adequately represent
the argument and the speaker. The Speaker stance                  For polarity, the highest weighted are the stances.
model outperforms just Text; a probable explana-              ‘Polarity High’ corresponds to having a Pro stance
tion is that the stances are a proxy for the actual            on those issues, which in this case represent a Lib-
opinions expressed in the text that a simple unigram           eral view point. This corresponds with the Liberal
representation can not capture.                                recipient embeddings in Figure 2 having generally
    To understand the latent audience embeddings               positive embeddings (alignment in weights results
we compare them to the self-reported political af-             in positive final weight). The opposite is true for the
filiations from their profiles. Figure 2 shows a               Conservative issues and embeddings. This align-
clear separation between liberals and conservatives            ment reinforces the finding that prior beliefs play
(the two largest groups). This finding supports the            a strong role in outcomes (Durmus and Cardie,
work of Durmus and Cardie (2019) which showed                  2018).
that similarity on ‘Big Issue Stance’ between the                Figure 3 plots the weights learned for each fea-
speaker and the audience member is a good indica-             ture for the polarity and popularity parameters.16
tor for predicting outcome. As with Editorials, the
                                                                 Notably, the orthogonal pattern extends beyond
advantage of our approach is that we were able to
                                                              the top features, features that strongly predict
infer audience member preferences without using
                                                              whether the audience responds to an argument do
their profiles.
                                                              not usually strongly predict whether the argument
    To understand what φ and β tells us about per-
                                                              is popular overall.
suasive theory, we will focus on the Speaker+Style
model:
High Polarity: Abortion, Gay Marriage, Progres-                  16
                                                                    This figure excludes features that had very small weights
sive Tax;                                                     along both dimensions.

                                                         83

Overall              Audience Average             Org Average
                                       Acc.     Macro-F1           Acc.       Macro-F1       Acc.    Macro-F1
               Org Prior              0.608           0.514       0.606           0.263    0.630         0.513
               Audience Prior         0.710           0.415       0.716           0.318    0.714         0.472
               Org Only               0.757           0.667       0.759           0.589    0.728         0.573
               Org + Style            0.781           0.708       0.761           0.662    0.771         0.678
               - β (popularity)       0.750           0.653       0.749           0.643    0.756         0.654
               Sep Feat V1            0.725           0.619       0.726           0.571    0.700         0.520
               Sep Feat V2            0.748           0.678       0.750           0.604    0.698         0.654

                                    Table 3: Results For Advocacy Task (2PL Model).

6.3 Advocacy Results                                                   The highest weighted features include concreteness,
                                                                       average-logical-appeal, word count and quality
Table 3 shows the results for the Advocacy task.17
                                                                       75th percentile. The lowest weighted features (un-
The overall accuracy and macro-F1 scores repre-
                                                                       likely to produce action) include valence, quality
sent results across all data, while the Org and Audi-
                                                                       mean, arousal and liwc-we. Similar to the Edito-
ence average accuracy represent data for individual
                                                                       rials, the quality features are contradictory, sug-
organizations and respondents. Due to the variation
                                                                       gesting the connection between sentence level and
in action rate and sample size, the macro-F1 results
                                                                       document level quality needs to be investigated
are particularly important.
                                                                       further. The logical appeal feature shows they are
   While the Org Only model performs well,18 the
                                                                       particularly effective (the corresponding scores for
improved performance with the additional of Style
                                                                       emotional and credibility appeals had smaller, neg-
suggests that the style of an email still affects the
                                                                       ative weights).
user. The style features may have an advantage for
recipients associated with a diverse set of organiza-
tions. Without β, the performance is significantly                     7   Conclusion and Future Work
worse, again confirming the need for both parame-
                                                                       In this paper, we validate the social psychology
ters.
                                                                       frameworks for persuasion using the IRT frame-
   To better understand the effect of style and org                    work to explicitly model the audience and the
features, two additional models are trained that                       speaker. Our approach lets us analyze how dif-
separate between polarity and popularity. In Sep                       ferent audience members respond to the same argu-
Feat V1, φ receives style features, β receives org                     ment, and we show that our representation implic-
indicators. In this setting, (α · φ) represents how                    itly learns latent audience features modeled explic-
individuals are affected by style, while β models                      itly by other models.
the organizations base rate. In Sep Feat V2 the fea-
                                                                           We empirically showed several additional in-
tures are reversed. V1 has the worst performance of
                                                                       sights about persuasion. In the Debates and Ad-
all five 2PL models, suggesting that modeling the
                                                                       vocacy tasks, the Popularity parameter improved
interaction between the recipient and organization
                                                                       performance showing that certain stylistic elements
(α · φ) is important. Org-Only and V2 have mixed
                                                                       are universally appealing. In the Debates task, the
performance on accuracy, but V2 performs better
                                                                       audiences’ embeddings aligned with their politi-
on macro-F1, suggesting that style influences the
                                                                       cal affiliation, showing that prior beliefs play a
recipients’ decisions to act.
                                                                       strong role in their argument perception. While the
   Finally, we analyze the features with lowest and                    background information about the audiences was
highest magnitudes from β in the Org+Style model.                      available for these tasks, we did not need to model
   17
      Due to computational constraints, we omitted the raw text
                                                                       it explicitly; as a result this setup allows us to make
model from this task.                                                  predictions for audiences who do not report their
   18
      One likely explanation for this performance is that audi-        affiliation.
ence is not independent of the speaker - by virtue of receiving
emails from this organization, recipients may also have similar            A potential negative side of the models is they
preferences.                                                           may learn latent characteristics of the speaker or
                                                                  84

audience they may not be aware of or consider Esin Durmus and Claire Cardie. 2018. Exploring the
private. However, all datasets studied in this paper role of prior beliefs for argument persuasion. Pro-
ceedings of the 2018 Conference of the North Amer-
were either public and anonymous or private with
ican Chapter of the Association for Computational
audiences who consented to analysis. Linguistics: Human Language Technologies, Vol-
This study focused on simple representations to ume 1 (Long Papers).
show the viability of our method and provide for
explainability. To build on this foundation in future Esin Durmus and Claire Cardie. 2019. A corpus for
work, we will: expand argument text representa- modeling user and language effects in argumenta-
tion on online debating. In Proceedings of the
tions with contextual word embeddings and stance 57th Annual Meeting of the Association for Compu-
detection models; include higher dimensional em- tational Linguistics, pages 602–607, Florence, Italy.
bedding for audience and item parameters (the IRT Association for Computational Linguistics.
models easily generalize to this set-up). These
improvements will allow us to better capture the Roxanne El Baff, Henning Wachsmuth, Khalid Al-
Khatib, and Benno Stein. 2018. Challenge or em-
elements of persuasion, especially in a complex power: Revisiting argumentation quality in a news
case like Advocacy. editorial corpus. In Proceedings of the 22nd Confer-
ence on Computational Natural Language Learning,
pages 454–464, Brussels, Belgium. Association for
References Computational Linguistics.

AERA, APA, and NCME. 2014. Standards for Educa- Roxanne El Baff, Henning Wachsmuth, Khalid
tional and Psychological Testing. Al Khatib, and Benno Stein. 2020. Analyzing the
persuasive effect of style in news editorial argumen-
A. L. Birnbaum. 1968. Some latent trait models and
tation. In Proceedings of the 58th Annual Meet-
their use in inferring an examinee’s ability. Statisti-
ing of the Association for Computational Linguistics,
cal theories of mental test scores.
pages 3154–3160, Online. Association for Computa-
R Darrell Bock and Murray Aitkin. 1981. Marginal tional Linguistics.
maximum likelihood estimation of item parameters:
Application of an em algorithm. Psychometrika, Gerhard H. Fischer. 1973. The linear logistic test
46(4):443–459. model as an instrument in educational research.
Acta psychologica, 37(6):359–374.
Marc Brysbaert, Amy Beth Warriner, and Victor Ku-
perman. 2014. Concreteness ratings for 40 thousand Sean M. Gerrish and David M. Blei. 2012. The issue-
generally known english word lemmas. Behavior re- adjusted ideal point model.
search methods, 46(3):904–911.

Amparo Elizabeth Cano-Basave and Yulan He. 2016. Martin Gleize, Eyal Shnarch, Leshem Choshen, Lena
A study of the impact of persuasive argumentation Dankin, Guy Moshkowich, Ranit Aharonov, and
in political debates. In Proceedings of the 2016 Con- Noam Slonim. 2019. Are you convinced? choos-
ference of the North American Chapter of the Asso- ing the more convincing evidence with a siamese
ciation for Computational Linguistics: Human Lan- network. CoRR, abs/1907.08971.
guage Technologies, pages 1405–1413, San Diego,
California. Association for Computational Linguis- Shai Gretz, Roni Friedman, Edo Cohen-Karlik, As-
tics. saf Toledo, Dan Lahav, Ranit Aharonov, and Noam
Slonim. 2019. A large-scale dataset for argument
Dennis Chong and James N. Druckman. 2007. Fram- quality ranking: Construction and analysis. CoRR,
ing theory. Annual Review of Political Science, abs/1911.11408.
10(1):103–126.
Ivan Habernal and Iryna Gurevych. 2016. Which ar-
R.B. Cialdini. 2009. Influence: The Psychology of Per- gument is more convincing? analyzing and predict-
suasion. Collins Business Essentials. HarperCollins ing convincingness of web arguments using bidi-
e-books. rectional LSTM. In Proceedings of the 54th An-
Joshua Clinton, Simon Jackman, and Douglas Rivers. nual Meeting of the Association for Computational
2004. The statistical analysis of roll call data. Amer- Linguistics (Volume 1: Long Papers), pages 1589–
ican Political Science Review, 98(2):355–370. 1599, Berlin, Germany. Association for Computa-
tional Linguistics.
Cristian Danescu-Niculescu-Mizil, Lillian Lee,
Bo Pang, and Jon Kleinberg. 2012. Echoes of Benjamin E. Lauderdale and Tom S. Clark. 2014. Scal-
power: Language effects and power differences in ing politically meaningful dimensions using texts
social interaction. In Proceedings of WWW, pages and votes. American Journal of Political Science,
699–708. 58(3):754–771.

Liane Longpre, Esin Durmus, and Claire Cardie. 2019. Swapna Somasundaran, Josef Ruppenhofer, and Janyce
Persuasion of the undecided: Language vs. the lis- Wiebe. 2007. Detecting arguing and sentiment in
tener. In Proceedings of the 6th Workshop on Argu- meetings. In Proceedings of the 8th SIGdial Work-
ment Mining, pages 167–176, Florence, Italy. Asso- shop on Discourse and Dialogue, pages 26–34.
ciation for Computational Linguistics.
Chenhao Tan, Vlad Niculae, Cristian Danescu-
Frederic M Lord. 1980. Applications of item response Niculescu-Mizil, and Lillian Lee. 2016. Winning
theory to practical testing problems. Routledge. arguments: Interaction dynamics and persuasion
strategies in good-faith online discussions. In Pro-
Stephanie M. Lukin, Pranav Anand, Marilyn Walker, ceedings of WWW.
and Steve Whittaker. 2017. Argument strength is in
Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni
the eye of the beholder: Audience effects in persua-
Friedman, Elad Venezian, Dan Lahav, Michal Ja-
sion.
covi, Ranit Aharonov, and Noam Slonim. 2019. Au-
tomatic argument quality assessment - new datasets
Arya D. McCarthy, Kevin P. Yancey, Geoff T. LaFlair, and methods. CoRR, abs/1909.01007.
Jesse Egbert, Manqian Liao, and Burr Settles. 2021.
Jump-starting item parameters for adaptive language Keyon Vafa, Suresh Naidu, and David M Blei.
tests. In Proceedings of the 2021 Conference on 2020. Text-based ideal points. arXiv preprint
Empirical Methods in Natural Language Processing, arXiv:2005.04232.
pages 883–899, Online and Punta Cana, Domini-
can Republic. Association for Computational Lin- Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh,
guistics. Sijia Yang, Jingwen Zhang, and Zhou Yu. 2019. Per-
suasion for good: Towards a personalized persuasive
Saif Mohammad and Peter D. Turney. 2013. Crowd- dialogue system for social good. In Proceedings of
sourcing a word-emotion association lexicon. CoRR, the 57th Annual Meeting of the Association for Com-
abs/1308.6297. putational Linguistics, pages 5635–5649, Florence,
Italy. Association for Computational Linguistics.
Prathiba Natesan, Ratna Nandakumar, Tom Minka, and
Jonathan D Rubright. 2016. Bayesian prior choice Amy Beth Warriner, Victor Kuperman, and Marc Brys-
in irt estimation using mcmc and variational bayes. baert. 2013. Norms of valence, arousal, and dom-
Frontiers in psychology, 7:1422. inance for 13,915 english lemmas. Behavior re-
search methods, 45(4):1191–1207.
Raquel Mochales Palau and Marie-Francine Moens.
Diyi Yang, Jiaao Chen, Zichao Yang, Dan Jurafsky,
2009. Argumentation mining: the detection, clas-
and Eduard Hovy. 2019. Let’s make your request
sification and structure of arguments in text. In Pro-
more persuasive: Modeling persuasive strategies via
ceedings of the 12th international conference on ar-
semi-supervised neural nets on crowdfunding plat-
tificial intelligence and law, pages 98–107.
forms. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Computational Linguistics: Human Language Tech-
Kate Blackburn. 2015. The development and psy- nologies, Volume 1 (Long and Short Papers), pages
chometric properties of liwc2015. Technical report. 3620–3630, Minneapolis, Minnesota. Association
for Computational Linguistics.
Richard E Petty and John T Cacioppo. 1986. The elab-
oration likelihood model of persuasion. In Commu-
nication and persuasion, pages 1–24. Springer.
A Model Training Details
The models described in section 6 were trained
Keith T Poole and Howard Rosenthal. 1985. A spa-
tial model for legislative roll call analysis. American as follows. In equation (6), c is set to 1e−4 for
Journal of Political Science, pages 357–384. all experiments. L2 loss is used for the Editorials
and Advocacy corpus and text model for Debates,
G. Rasch. 1960. Probabilistic Models for Some Intel- L1 is used for the remaining models in the Debates
ligence and Attainment Tests. Studies in mathemati- corpus. Editorial models are trained for 200 epochs;
cal psychology. Danmarks Paedagogiske Institut.
Debates for 25; Advocacy for 5. A learning rate
Leslie Rutkowski, Matthias Von Davier, and David of 0.01 is used for Editorials and Debates; 0.005 is
Rutkowski. 2014. Handbook of international large- used for Advocacy.
scale assessment. Background, technical issues, and All results are reported over 5-fold cross-
methods of data analysis.
validation, with the splits performed at an argument
Burr Settles, Geoffrey T. LaFlair, and Masato Hagi- level. All models are fit using the AdamW opti-
wara. 2020. Machine learning–driven language as- mizer. The α embedding initializations are drawn
sessment. Transactions of the Association for Com- from a uniform distribution of −0.5 to 0.5.
putational Linguistics, 8:247–263.

You can also read