An enhanced personality detection system through user's digital footprints

Page created by Elaine Webb
 
CONTINUE READING
An enhanced personality detection system through user's digital footprints
An enhanced personality detection
                     system through user’s digital
                     footprints

                                                                                                                                                                                    Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
                     ............................................................................................................................................................
                                           Mohammad Mobasher and Saeed Farzi
                                           Department of Software Engineering, K. N. Toosi University of
                                           Technology, Tehran, Iran
                                           ......................................................................................................................................

                                           Abstract
                                           One of the most important aspects of any person’s life is personality, which affects
                                           one’s speech, decision, well-being, feeling and mental health. Personality detection is
                                           usually based on data collected by a questionnaire that comprises some critical prob-
                                           lems such as the lack of direct access to the individuals and explicit personal infor-
                                           mation. However nowadays, one of the valuable resources for such studies is social
                                           networks. The footprint and tracking of users on social networks have provided
                                           valuable information for personality recognition. Specifically, this research introdu-
                                           ces an intelligence personality recognition system based on modeling user behavior
                                           using sophisticated features, i.e., Statistical, Emotional, and Linguistic. Furthermore,
                                           a dataset called KNTU_Personality based on the MBTI personality model with the
Correspondence:                            profile information and tweets has been collected. The experimental study follows
Saeed Farzi, Department of                 two scenarios with complementing objectives. First the sensitivity analysis is per-
Software Engineering, K. N.
                                           formed respecting to setting parameters, introduced features and different learning
Toosi University of
Technology, Tehran, Iran.                  algorithms. Next the proposed system has been compared with well-known person-
E-mail:                                    ality detection systems. The results demonstrate the superiorities of the proposed
saeedfarzi@kntu.ac.ir                      system regarding its counterparts in terms of F-Score, Precision, Recall and Accuracy.
.................................................................................................................................................................................

1 Introduction                                                                                 Currently, the personality detection is done by
                                                                                           responding to questionnaires prepared by sociological
For a large-scale society, making policies such as educa-                                  specialists; nonetheless, this method suffers from two
tion, mass media, and community orientation to elim-                                       critical problems. (1) Respondents often have little
inate specific anomalies require a proper perception of                                    desire to answer lots of questions. (2) Preparing the
the society. This perception can be achieved by identi-                                    suitable implicit questions is a hard task even for
fying people’s personalities in the society. Of course,                                    sociological specialists. Since the questions need to
personality recognition is also used in a variety of other                                 be asked implicitly in order to reveal a variety of
fields such as finance (Kannadhasan et al., 2016; Wang                                     aspects of respondents’ personality.
and and Lu, 2018), recommendation systems                                                      Here, the main idea of coming up with these prob-
(Tahmasebi and Fotouhi, 2019), mental health, person-                                      lems is tracking user’s activity and following users’
al or business relationship improvement (Orme, 2016),                                      footprints on social networks instead of using long
and determining job path (Ting and Varathan, 2018).                                        and hard questionnaires. By analyzing this valuable
Even nowadays, applications use the user’s personality                                     information, identifying the personalities of the user
to improve user experience (Mehta et al., 2019).                                           becomes an easy, precise and automatic task.

Digital Scholarship in the Humanities VC The Author(s) 2021. Published by Oxford University Press on behalf of EADH. All                                               1 of 21
rights reserved. For permissions, please email: journals.permissions@oup.com
doi:10.1093/llc/fqaa070
An enhanced personality detection system through user's digital footprints
M. Mobasher and S. Farzi

    Due to the widespread use of social networks in             Unlike other social networks’ contents, typically
recent years—every person spends an average of               photos or videos, Twitter usually uses short texts
more than 135 minutes a day on social media—a pris-          with 280-character length limitation.
tine mine of user data has been created (Kircaburun             In this study, we introduce a personality recogni-
and Griffiths, 2018). This mine is full of behavioral,       tion system using users’ footprints on the Twitter so-
                                                             cial network based on the MBTI model. In this regard,

                                                                                                                         Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
contextual, structural and demographical information
of different characters, including politicians, artists,     users’ footprints on the social network are modeled
athletes and eventually ordinary people. Indeed, min-        through three types of linguistic (Linguistic modeling
ing this mine, despite its risks, is valuable and useful.    of users tweets), emotional and statistical (Statistical
    Social networks users commonly share their opin-         and Descriptive aspects of users’ Activities) features.
ions either explicitly or implicitly, in a straightforward      All the experiments performed in this study can be
manner without regarding social interactions, on vari-       described in three general sections: (1) Several
ous issues, e.g., political, social, sports or even most     Boosting Algorithms. (2) Several features sets. (3)
private emotions and behaviors about music, movies,          Features set combination.
entertainment and so on. Obviously, this has pro-               The purposes of all of these exercises are to develop
duced a considerable volume of data, making it an            a smart personality recognition system, using the best
ideal platform for analyzing user’s personality              algorithms as well as the best features set and compare
(Kumar et al., 2013; Liao et al., n.d.). In recent decade,   result together. As we mentioned in previous section,
Twitter has become one of the most popular social            one of the best algorithms used in this study is the
networks as a microblog (Sakaki et al., 2010) to share       CatBoost algorithm, which has been able to yield ac-
users’ opinions, feelings and thoughts. Figure 1             ceptable f-score 82%.
shows the trend of increasing its users over the years
2014–2019.
    There are various theories of personality prediction     2. Background and Related Work
in psychology, such as BigFive1(Mccrae and John,
1992), MBTI2(Boyle, 1995), DISC3(Renzulli, 1990)             Many scholars have focused on identifying users
and so on. However, after some considerations and            personalities in social networks, especially Twitter, be-
the literature review process, the MBTI theory, one of       cause of their importance and their increasing usage to
the most common among people concerned with                  effectively identify one’s personality using user-
understanding their personality or society, is used in       generated content (Alsadhan and Skillicorn, 2017).
this study. The basis of this theory relays on four dis-     This section first introduces personality models and
tinct dipoles of personality (i.e., Introvert-Extrovert,     then reviews the related works.
Intuitive-Sensing, Thinking-Feeling, and Judging-
                                                             2.1 Brief Theory of Personality modeling
Perceiving).
    One of the most important challenges of tradition-       The development of the MBTI theory was carried out
al machine learning algorithms is the lack of labeled        by Myers and McCaulley based on Carl Jung’s book.5
                                                             In this theory, human behavior is based on four es-
data. As the MBTI model uses sixteen personality
                                                             sential personality attributes. These four main attrib-
types, it is necessary that each type has sufficient in-
                                                             utes are named Mind (Extroverts (E) and Introverts
formation from users who have that personality type.
                                                             (I)), Energy (Observant (S) or Intuitive (N)), Nature
The absence of labeled data and the imbalanced dis-
                                                             (Thinking (T) and Feeling (F)), Tactic (Judging (J)
tribution of data across different personality types are
                                                             and Perceiving (P)). Each individual belongs to one of
two critical problems that personality recognition sys-
                                                             the dimensions of each of the four attributes. The four
tems face. To address this problem, a dataset called
                                                             main attributes upon which this theory is developed
KNTU_Personality4 was collected with 1,357 users.
                                                             are described as follows.
This dataset includes users profile information and
their tweets, which will be provided free of charge to       † Mind refers to how one interacts with the world
researchers.                                                   around them. In this dimension, people are

2 of 21    Digital Scholarship in the Humanities, 2021
An enhanced personality detection system through user's digital footprints
Personality detection

                                                                                                                                 Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
Fig. 1 Distribution of twitter usage (https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/)

  divided into Introverted (I) and Extroverted (E).               and Perceiving (P). Judging are people who plan
  Extroverts usually have more relationships and                  for all their work to avoid an unspecified factor
  friendships than introverts. But instead, introverts            and also complete all tasks. But in contrast to flex-
  spend most of their time immersing themselves in                ible people decide too early to do their job and they
  their thoughts and prefer being alone rather than               do not plan much ahead.
  in groups.
† Energy focuses more on how one perceives the                   In the MBTI model, each of these dimensions will
  world around them. Based on this feature, people            be denoted by a single letter, and by combining, there
  are divided into two parts: Observant(S) and                will be sixteen personality types, i.e., INTJ, INTP,
  Intuitive (N). For Observant individuals, genuine           ENTJ, ENTP, INFJ, INFP, ENFJ, ENFP, ISTJ, ISFJ,
  and documented information takes precedence                 ESTJ, ESFJ, ISTP, ISFP, ESTP, ESFP.
  over intuition and inspiration, so they are cautious
  about the details of the steps involved. But intui-         2.2 Related Work
  tionists, in turn, pay more attention to probabil-          This section taxonomies pervious works and,
  ities than to facts.                                        describes in more details and finally, in Table 1, sum-
† Nature defines the individual’s approach to                 marizes them in terms of data, learning algorithm and
  decision-making when facing difficulties, and peo-          features.
  ple are divided into two categories of Thinking (T)
  and Feeling(F). Thinkers have an analytical spirit          2.2.1 Facebook dataset
  and usually tell the truth, but felling pay attention       Quercia et al. (2012) developed a smart system using
  to the outcome and the impact that their decision           personality information published by users in the
  may have on others when making a decision.                  myPersonality web application to measure the rela-
† Tactic is based on how people are oriented in life,         tionship among users who are highly popular on
  and people divide into two categories: Judging (J)          Facebook and their personality types. The personality

                                                                        Digital Scholarship in the Humanities, 2021   3 of 21
An enhanced personality detection system through user's digital footprints
M. Mobasher and S. Farzi

Table 1 Summarizes the related works
Research work               Year     Data                     Personality model   Algorithm              Features And tools
(Quercia et al., 2012)      2012     myPersonality            BigFive             Relation               —
(Tandera et al., 2017)      2017     Facebook                 BigFive             Deep Learning          LIWC
                                     (250 Facebook users                          And                    SPLICE

                                                                                                                               Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
                                        with 10.000                               Traditional learning
                                        Statuses)                                    algorithm
                                     (150 Facebook users)
(Nave et al., 2018)         2018     myPersonality            BigFive             Linear regression      Demographic
                                     (22,252 MyPersonality
                                        users)
(Sewwandi et al., 2017)     2017     Facebook                 BigFive             Naı̈ve Bayesian        LIWC
(Preoţiuc-Pietro et al.,   2015     Twitter                  BigFive             —                      LIWC
  2015)                              (1957 Twitter users                                                 Age-Gender
                                        with average 3400
                                        message)
(Bharadwaj et al., 2018)    2018     Twitter                  MBTI                SVM                    LIWC
                                     8600 record of each                                                 EmoSenticNet
                                        class in MBTI                                                    ConceptNet
(Chhabra et al., 2019)      2019     Twitter                  BigFive             LSTM                   Bag of words
                                                                                                         With
                                                                                                         unigram, bigram and
                                                                                                            trigram
(Golbeck et al., 2011)      2011     Twitter and Facebook     BigFive             Regression in weka     LIWC
                                     (50 users )                                                         Statistical
(Stankevich et al.,         2018     Vkontakte                BigFive             SVM                    Statistical
  2018)                              (165 profiles)                                                      Age
                                                                                                         Gender
(Gatica-Perez et al.,       2018     YouTube vlog             BigFive             Correlation            Image and Audio
  2018)                              (99 users)
(Zou and Wu, 2019)          2018     Sina Webio               BigFive             Correlation            —
(Barry et al., 2019)        2019     Instagram                BigFive             Correlation            —
                                     (149 undergraduates                          Like PNIa
                                        from university)                          NPIb
                                                                                  FoMOSc
(Yılmaz et al., 2020)       2017     James Pennebaker and     BigFive             Deep Learning          Google’s pretrained
                                       Laura King’s stream-                         CNN                    word2vec
                                       of-consciousness                                                    embeddingse
                                       essay datasetd
(Sarwani et al., 2019)      2019     Twitter                  MBTI                Neural Network         Bag of Word
                                     (25 users)
a
  Pathological narcissism Inventory
b
  Narcissistic personality Inventory
c
  Fear of missing out survey
d
  http://web.archive.org/web/20160519045708/http://mypersonality.org/wiki/doku.php?id¼wcpr13
e
  https://code.google.com/archive/p/word2vec/

types of people in this study was based on the BigFive            statuses, compiled using in the myPersonality web ap-
theory, and the measure of the popularity of users was            plication, and the other with 150 users and their sta-
the number of friends per user. Tandera et al. (2017)             tus. This research used linguistic features such as
predicted the user’s personality based on the BigFive             LIWC6 and SPLICE, to build traditional classification
theory. They used two datasets from Facebook. One                 models as well as deep learning. Nave et al. (2018)
consisting of 250 users’ data and their last 1,000                investigated the relationship between personality

4 of 21      Digital Scholarship in the Humanities, 2021
An enhanced personality detection system through user's digital footprints
Personality detection

types and interests in music. In this study, two datasets   and their choices. The number of hashtags, and the
consisting of 21,929 myPersonality users based on the       number of followers are two features. How an indi-
Big Five Theory, were used to obtain personality types.     vidual communicates with others or how he or she
Demographic features such as age, gender and num-           chooses friends were considered as indications for
ber of likes were used to build such a system.              choice.
Sewwandi et al. (2017) have designed and imple-

                                                                                                                              Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
mented an intelligent system to identify individuals’       2.2.3 Vkontakte dataset
personalities based on the BigFive theory using             It is a social network in Russia that is available in
Linguistic features on user-generated content on the        several languages. According to statistics in 2018, the
Facebook social network. In this study, the features        number of users of this social network has reached 500
obtained from the LIWC tool have a good effect on           million users.
the fabrication of such a system. Sarwani et al. (2019)          Stankevich et al. (2018) by using the BigFive the-
have solved a personality classification problem using      ory, developed a system to identify users’ personality
a neural network algorithm. They used a 25-member           in the Vkontakte social network. One of the problems
suite of Facebook users under the BigFive theory. In        in this study was the lack of labeled data. To solve this
this research, they first obtained the documentation of     problem, the researchers first asked the volunteers to
users on Facebook social network and then, using TF/        fill in a questionnaire and then give their username in
IDF methods, built a neural network with a back-            the social network. Finally, their data set contains 165
propagation approach. The accuracy is 66%. Yılmaz           users, along with their profile information.
et al. (2020) used the BigFive theory for each person-
ality type in this theory to design and train a model       2.2.4 YouTube dataset
using user-generated sentences. According to research       Video Blogs or Video Logs refer to those videos that
reports, the sentences in this dataset were first con-      Vloggers sit in front of the camera and talk about a
verted to Word2Vec and then given as input to the           variety of topics such as politics, books, movies, or
network.                                                    personal matters.
                                                               Gatica-Perez et al. (2018) studied the behavioral
2.2.2 Twitter dataset                                       data of users on YouTube using Vlogger or Vlogs
Preoţiuc-Pietro et al. (2015) researched the relation-     that users have posted for at least three years, and
ship between content generated by users with diseases       were able to correlate the type of videos with the user’s
such as depression, anxiety, and stress PTSD7 on the        personality using the BigFive theory. In this study, for
social network Twitter, based on the BigFive theory.        each video, a set of twenty-one variables was obtained
They collected data from 1957 users of the self-            using online tools. According to the research, there is a
reporting method that reported users suffering from         connection between the Extraversion and Funny
a disease, especially depression. They use Logistic re-     videos.
gression as a classifier with features such as age, gen-
der, and LIWC were used to construct such an                2.2.5 Sina Webio dataset
intelligent system. Bharadwaj et al. (2018) developed       A social network (microblog) in China is launched in
a smart personality recognition system using LIWC           2009. Zou and Wu (2019) investigated the relation-
and EmoSenticNet tools on texts produced by users           ship between user loyalties in improving the growing
on Twitter. In this study, the MBTI theory and the          trend of a social network. In this study, they first
SVM algorithm were used to categorize the users’ per-       explored the relationship between users’ personality
sonality. Chhabra et al. (2019) were able to design and     characteristics based on the BigFive theory and their
implement a personality recognition system based on         loyalty. According to the results of this study, there is a
the BigFive theory using a data set collected from          strong relationship between openness and loyalty.
Twitter. Their proposed system uses LSTM for the
classification. Golbeck et al. (2011) used demographic      2.2.6 Instagram dataset
features related to individual activities on Twitter to     Barry et al. (2019) studied the relationship between
discover the relationship between individual’s lifestyle    users’ selfies and their personality based on the BigFive

                                                                     Digital Scholarship in the Humanities, 2021   5 of 21
An enhanced personality detection system through user's digital footprints
M. Mobasher and S. Farzi

theory. During this study, the hypothesis that people          introduced. In fact, every part of personality is illumi-
with narcissism were generally unrelated to their selfie       nated through one or more features. Hence introduc-
sharing.                                                       ing sophisticated features is an essential part of
    Kircaburun and Griffiths (2018) investigated               machine learning projects. Proposed features are
Instagram addiction and its association with person-           described in detail in Section 4.2. To address the se-
ality types of the BigFive model. The study was con-

                                                                                                                           Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
                                                               cond question, a research dataset called
ducted on college students who also referred to the            KNTU_Personality has been collected from the
extent of internet usage and the degree of narcissism          Twitter microblog that has produced, refined, and
during the study. According to the results of this             standardized respecting to the high-quality dataset.
study, Instagram addiction has a weak relationship             This dataset is described in Section 4.1. At the end,
with agreeableness and self-linking personality traits,        to answer the third question, given that data unbal-
as well as conscientious personality traits.                   ancing and problem properties, different classifiers are
    Table 1 briefly describe the related works.                examined, which are described in detail in Section 4.3.
                                                                   The overall architecture of the proposed system is
                                                               showing in Fig. 2, in which the four main parts are
3 Problem Definition                                           observed. (1) Data gathering: Gathering complete
                                                               profile information of users along with their tweets.
Definition Personality Recognition: it is defined
                                                               In this phase, users, profile information as well as
as
       assigning    a     personality         type,   P
                                                               tweets that identify their personality types based on
  P P 2 fINTJ: ESFJ:ESTP: ENTJ . . .gg, to a given
                                                               the MBTI model are collected from Twitter accounts.
user, U fU U 2 users ¼ fu1 :u2 : u3 :u4 . . . :un gg,
according to behavioral, emotional and sociological            (2) Feature engineering: Based on studies on MBTI
characteristics. Therefore, if the user U is vectorized        theory and the nature of the problem that applies to all
                   ~ ¼ < f1 : f2 : f3 : . . . : fm > then
by its features as U                                           avenues of life. We decided to extract three categories
the personality recognition is a function approxima-           of features. (3) Pre-processing: One of the essential
tion which is shown by Equation (1).                           components in any problem is the application of
                                                               pre-processing techniques to the type of data available.
                           P ¼ F ðU
                                  ~ Þ;                   (1)   This operation is essentially an empirical task and
                                                               should be based on the knowledge gained from the
where P 2 fINTJ, ESFJ, ESTP, ENTJ. . .g is a person-           dataset. (4) Model and Classification: In this study,
ality type based on the MBTI model.In the machine              the family of Boosting algorithms is used. Several algo-
learning literature, the estimation of the function F is       rithms in this family are also used for comparison and
done by a classifier whose classes are personality types       evaluation.
of the MBTI model.

                                                               4.1 Data gathering
4 Proposed System                                              The data set should include the following information.
                                                               (1) User information (such as the number of posts,
The following three critical questions must be                 number of followers). (2) User-generated content
answered to design a classifier. (1) How to map a              (such as text and image posts). (3) User feedback on
user to a feature space? (2) How to collect labeled            others’ content (such as Likes and Retweet). (4) User
training data? (3) What sort of classifier can accurately      personality types.
estimate the function F?                                           Making such a dataset is done in three steps, iden-
    To addresses the first question, three types of            tifying individuals, whose personality type is deter-
Linguistic, Statistical and Emotional features as repre-       mined, removing users whose profile is private, and
sentatives of behavioral, emotional and sociological           then gathering profile information and user-generated
characteristics of a considering user have been                content. These steps are described as follow.

6 of 21    Digital Scholarship in the Humanities, 2021
An enhanced personality detection system through user's digital footprints
Personality detection

                                                                                                                            Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
Fig. 2 Proposed system

Step 1: Identify and retrieve user profile                 of users. In all, 63% of this dataset is for women and
information                                                37% for men.
The user information used in this study is part of the
                                                           4.2 Feature extraction
Twisty (Verhoeven et al., 2012) data set and its users
                                                           In the classification problem, generating and selecting
are English speaking. This dataset includes usernames,
                                                           the right features is one of the most critical steps in
along with the personality types of 1,500 users on
                                                           designing a learning system. This is also addressed in
Twitter’s social network. The Twython8 library is
                                                           this research. Different kinds of features are extracted
used to communicate with the twitter servers and col-
                                                           in three kinds of Linguistic, Statistical and Emotional
lect the data needed in this study, which is briefly       which briefly described in following.
described in Fig. 3.
   This function includes two steps. First, it requests    4.2.1 Linguistic
to communicate with Twitter server, which requires         Generally, in a statistical linguistic model, we seek to
five parameters that each developer must enter a           find the probability function for a sequence of differ-
unique value (line 2 to 5). Then, it fetches profile in-   ent words. In other words, if the sentence W is made
formation of anyone in the body (line 6 to 8).             up of words < w1 : w2 : w3 . . . wN >, the goal is to
                                                           find the following probability.
Step 2: Remove private account
At this point, users whose profiles were private were                  PðwÞ ¼ Pðw1 w2 w3 . . . wN Þ                  (1)
identified and removed from the body. Finally, after
this step, 1,357 users were obtained.                         This joint probability of words is computed using
                                                           the chain rule as:
Step 3: Data collection
After identifying users, up to 3,000 recent tweets have                                Y
                                                                                       N
                                                             Pðw1 w2 w3 . . . wN Þ ¼         P ðwi w1 w2 wi1 Þ (2)
been fetched per user, using the User_Tweet function                                   i¼1
which its pseudo-code is described in Fig. 4.
   After these steps, something over 3.300 million            It is assumed that the writing style and lexicon set
tweets were collected. Figure 5 shows the distribution     that users use to write tweets contain information

                                                                   Digital Scholarship in the Humanities, 2021   7 of 21
An enhanced personality detection system through user's digital footprints
M. Mobasher and S. Farzi

                                                         Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
Fig. 3 Function user’s profile data

Fig. 4 Function user’s Tweet data

8 of 21    Digital Scholarship in the Humanities, 2021
Personality detection

                                                                                                                                   Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
Fig. 5 Distributions of user

about their personality type. Hence, a linguistic model          person’s emoji, we use Emoji-Emotion dataset9 where
train for each personality type. The linguistic model            an emoji describes negative, neutral, and positive
will model users’ writing style in terms of lexicon,             emotions.
grammar, and meaning for each personality type rep-                 As we can see in Table 2, for each user based on
resented by Equation (3).                                        Equations (10), (11) and (12), three features
                                                                 #PositiveEmoji, #NeutralEmoji, #NegativeEmoji are
                    pt
                                X
                                n                                generated. P_Emoji, N_Emoji, and Ne_Emoji are
  score ðu: pt Þ ¼ plm ðt Þ ¼         logplm ðti jti1 Þ   (3)   the functions that return the number of emoji used
                                i¼1
                                                                 by user based on their category.
where pt represents the personality type and                     4.2.3 Emotional
logplm ðti jti1 Þ is the language model probability of
                                                                 People’s feelings about a subject have a direct impact
term ti. This feature is one of the features introduced
                                                                 on their textual generated-contents such as posts and
by this study.
                                                                 tweets. Past studies have shown that emotional traits
4.2.2 Statistical                                                can be used in many tasks, such as the classification of
                                                                 sentences using their latent emotions (Luo, 2018).
These features refer to user information such as num-
                                                                 Therefore, this study has a particular emphasis on
ber of likes, number of posts, age, gender, and so on in
                                                                 the fact that one’s personality can be identified using
the social network. The primary purpose of introduc-
                                                                 the feelings contained in his or her textual generated-
ing these features is to understand the personality type
                                                                 contents. For this purpose, a set of features is provided
of the user based on their Statistical features on social
                                                                 for recognition of the emotions in the textual
networks. In this research, we have tried to introduce
                                                                 generated-contents for each user.
features that have the most relationship with different
                                                                    To do this, several tools such as NRC,10 ParallelDot
types of MBTI personality types, which can provide
                                                                 API11 and Sentiment140 API12 have been used for
relatively high quality and accuracy in building an
                                                                 describe each person in terms of emotion in this study.
intelligent personality recognition system. Table 1
                                                                    To do this, several methods have been used in this
reports the engineered statistical features used and
                                                                 study to describe each person in terms of emotion,
introduced in this study.                                        which is described below.
    Also, in this feature category, there are several fea-
tures that related to emoji used by users. Emoji is one          4.2.3.1 NRC. The Emotion of each tweet could be
of the simplest ways to express emotion and the gen-             understood by the word used in the tweet. One can
eral concept that it is growing rapidly on social media          examine the set of tweets of a person based on the
(Lin, 2019). To further analyze and understand each              words used in those tweets and as we know each

                                                                          Digital Scholarship in the Humanities, 2021   9 of 21
M. Mobasher and S. Farzi

Table 2 Statistical formulas
Abbreviation                 Description                    Formula                                                                                     #

                                                             Score                      Hashtag ðuÞ min ðjHashtagðiÞjÞ
No_Hashtag (NH)              # Hashtags                     u 2 users
                                                                        ðuÞ ¼     max
                                                                                                        i 2 users
                                                                                          ðjHashtagðiÞjÞ          min
                                                                                                                          ðjHashtagðiÞjÞ
                                                                                                                                                        (5)
                                                                                i 2 users                       i 2 users
                                                                                        UniquHashtag ðuÞ min ðjUniquHashtagðiÞjÞ

                                                                                                                                                               Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
                                                             Score
No_UniquHashtag              # Hashtags                     u 2 users
                                                                        ðuÞ ¼     max
                                                                                          ðjUniquHashtagðiÞjÞ 
                                                                                                                i 2 users
                                                                                                                           min
                                                                                                                                  ðjUniquHashtagðiÞjÞ
                                                                                                                                                        (6)
                                                                                i 2 users                               i 2 users

                                                             Score                      MentionðuÞ min ðjMentionðiÞjÞ
No_Mention                   # Mentions                     u 2 users
                                                                        ðuÞ ¼     max
                                                                                                        i 2 users
                                                                                          ðjMentionðiÞjÞ          min
                                                                                                                          ðjMentionðiÞjÞ
                                                                                                                                                        (7)
                                                                                i 2 users                       i 2 users

                                                             Score                      Image ðuÞ min ðjImageðiÞjÞ
No_Image                     # Images                       u 2 users
                                                                        ðuÞ ¼     max
                                                                                                     i 2 users
                                                                                          ðjImageðiÞjÞ          min
                                                                                                                       ðjImageðiÞjÞ
                                                                                                                                                        (8)
                                                                                i 2 users                    i 2 users

                                                             Score                      Like ðuÞ min ðjLikeðiÞjÞ
No_Like                      # Like                         u 2 users
                                                                        ðuÞ ¼     max
                                                                                                   i 2 users
                                                                                          ðjLikeðiÞjÞ        min
                                                                                                                     ðjLikeðiÞjÞ
                                                                                                                                                        (9)
                                                                                i 2 users                  i 2 users

                                                             Score                      GeoðuÞ min ðjGeoðiÞjÞ
No_Geo                       # Geo                          u 2 users
                                                                        ðuÞ ¼     max
                                                                                                  i 2 users
                                                                                          ðjGeoðiÞjÞ        min
                                                                                                                     ðjGeoðiÞjÞ
                                                                                                                                                        (10)
                                                                                i 2 users                 i 2 users

                                                             Score                      WEmoji ðuÞ min ðjWEmoji ðiÞjÞ
No_Whole_Emoji               # Emoji                        u 2 users
                                                                        ðuÞ ¼     max
                                                                                                      i 2 users
                                                                                          ðjWEmoji ðiÞjÞ         min
                                                                                                                         ðjWEmoji ðiÞjÞ
                                                                                                                                                        (11)
                                                                                i 2 users                      i 2 users

                                                             Score                      PEmoji ðuÞ min ðjPEmoji ðiÞjÞ
No_Pos_Emoji                 # Positive emoji               u 2 users
                                                                        ðuÞ ¼     max
                                                                                                      i 2 users
                                                                                          ðjPEmoji ðiÞjÞ        min
                                                                                                                        ðjPEmoji ðiÞjÞ
                                                                                                                                                        (12)
                                                                                i 2 users                     i 2 users
                                                                                                            min
                                                             Score                       NeEmoji ðuÞ              ðjNeEmoji ðiÞjÞ
No_ Neu_Emoji                # Neutral Emoji                u 2 users
                                                                        ðuÞ ¼     max
                                                                                                         i 2 users
                                                                                            ðjNeEmoji ðiÞjÞ        min
                                                                                                                           ðjNeEmoji ðiÞjÞ
                                                                                                                                                        (13)
                                                                                i 2 users                        i 2 users
                                                                                                           min
                                                             Score                       NEmoji ðuÞ              ðjNEmoji ðiÞjÞ
No_Neg_Emoji                 # Negative Emoji               u 2 users
                                                                        ðuÞ ¼     max
                                                                                                        i 2 users
                                                                                            ðjNEmoji ðiÞjÞ        min
                                                                                                                          ðjNEmoji ðiÞjÞ
                                                                                                                                                        (14)
                                                                                i 2 users                       i 2 users
                                                                                                         min
                                                             Score                       QSen ðuÞ              ðjQSen ðiÞjÞ
No_Question                  # Question sentence            u 2 users
                                                                        ðuÞ ¼     max
                                                                                                      i 2 users
                                                                                            ðjQSen ðiÞjÞ        min
                                                                                                                        ðjQSen ðiÞjÞ
                                                                                                                                                        (15)
                                                                                i 2 users                     i 2 users
                                                                                                          min
                                                             Score                       Sentence ðuÞ           ðjSentenceðiÞjÞ
No_Sentence                  # Sentence                     u 2 users
                                                                        ðuÞ ¼     max
                                                                                                       i 2 users
                                                                                          ðjSentenceðiÞjÞ        min
                                                                                                                         ðjSentenceðiÞjÞ
                                                                                                                                                        (16)
                                                                                i 2 users                      i 2 users

                                                             Score                     Excl Sen ðuÞ min ðjExcl Sen ðiÞjÞ
No_Exclamation               # Exclamation                  u 2 users
                                                                        ðuÞ ¼     max
                                                                                                     i 2 users
                                                                                          ðjExcl Sen ðiÞjÞ        min
                                                                                                                          ðjExcl Sen ðiÞjÞ
                                                                                                                                                        (17)
                                                                                i 2 users                       i 2 users
                                                                                                             min
                                                             Score                       Follower ðuÞ              ðjFollowerðiÞjÞ
No_Follower                  # Follower                     u 2 users
                                                                        ðuÞ ¼       max
                                                                                                         i 2 users
                                                                                            ðjFollowerðiÞjÞ          min
                                                                                                                            ðjFollowerðiÞjÞ
                                                                                                                                                        (18)
                                                                                i   2 users                       i 2 users

                                                             Score                        Following ðuÞ min ðjFollowingðiÞjÞ
No_Following                 # Following                    u 2 users
                                                                        ðuÞ ¼       max
                                                                                                           i 2 users
                                                                                            ðjFollowingðiÞjÞ          min
                                                                                                                              ðjFollowingðiÞjÞ
                                                                                                                                                        (19)
                                                                                i   2 users                         i 2 users

                                                             Score                        GrMemðuÞ min ðjGrMemðiÞjÞ
No_List                      # Group                        u 2 users
                                                                        ðuÞ ¼       max
                                                                                                        i 2 users
                                                                                            ðjGrMemðiÞjÞ            min
                                                                                                                           ðjGrMemðiÞjÞ
                                                                                                                                                        (20)
                                                                                i   2 users                      i 2 users

                                                             Score                        Tweet ðuÞ min ðjTweetðiÞjÞ
No_Tweet                     # Reply                        u 2 users
                                                                        ðuÞ ¼       max
                                                                                                      i 2 users
                                                                                            ðjTweetðiÞjÞ         min
                                                                                                                         ðjTweetðiÞjÞ
                                                                                                                                                        (21)
                                                                                i   2 users                    i 2 users

                                                             Score                        ReTweet ðuÞ min ðjReTweetðiÞjÞ
No_ReTweet                   # Retweet                      u 2 users
                                                                        ðuÞ ¼       max
                                                                                                         i 2 users
                                                                                            ðjReTweetðiÞjÞ           min
                                                                                                                            ðjReTweetðiÞjÞ
                                                                                                                                                        (22)
                                                                                i   2 users                       i 2 users

                                                             Score                           Truncated ðuÞ min ðjTruncatedðiÞjÞ
No_Truncated                 # Truncated Tweet              u 2 users
                                                                        ðuÞ ¼       max
                                                                                                              i 2 users
                                                                                            ðjTruncatedðiÞjÞ  mini 2 users ðjTruncatedðiÞjÞ            (23)
                                                                                i   2 users
                                                                                                       min
                                                             Score                       Tweet ðuÞ           ðjTweetðiÞjÞ
Average_Time interval        Average time for every tweet   u 2 users
                                                                        ðuÞ ¼     max
                                                                                                    i 2 users
                                                                                          ðjTweetðiÞjÞ        min
                                                                                                                      ðjTweetðiÞjÞ
                                                                                                                                                        (24)
                                                                                i 2 users                   i 2 users

word may have multiple difference senses. And NRC              provide different emotional properties (happy, angry,
allow us to get those differences senses by more than          excited, sarcasm, sad, fear, bored) for each text.
14,000 words (Mohammad and Turney, 2013). This                    In order to use this API, all the tweets were col-
dataset contains eight emotional feature (anger, fear,         lected for each user and as a result, the following six
anticipation, trust, surprise, sadness, joy, disgust) and      attributes were obtained for each user.
two psychological feature (negative, positive) for each
words.                                                         4.2.3.3 Sentiment140 The basis of this analysis sys-
4.2.3.2 ParallelDots. This API13 can be used in four           tem is a sentiment text designed by (Go et al., 2009).
different languages, and it uses a variety of datasets to      The data set used in this system is from the Twitter

10 of 21   Digital Scholarship in the Humanities, 2021
Personality detection

social network, and this API14 has been used in many          5 Experimental Study
works, such as in Heredia et al. (2016).
                                                              As mentioned before, by analyzing the footprint of
4.3 Preprocessing                                             Twitter users, their personality can be precisely pre-
Data preprocessing is a significant step, and choosing        dicted. To this end, the relationship among different
the right technique can improve the results even fur-

                                                                                                                               Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
                                                              personalities and features are investigated. Table 3
ther. All trials go through the preprocessing stage before    reports the results with different features and learning
the modeling phase. This step involves removing URLs,         algorithms over the KNTU_Personality dataset.
names, hashtags, spaces, lowercase letters, and deleting         The evaluation of the proposed system has been
users whose accounts are private. Python’s regex func-        performed in different manners. First, the impact of
tions are used to remove some entities from the text. To      each feature over the output quality is calculated, and
achieve the best results, we used the average value of a      next, the proposed system is compared with other
feature class to a person in a personality type.              well-known algorithms.
                                                                 The proposed system is programed with Python on
4.4 Prediction model                                          a 5 core CPU (i5-7200U) and 8GB memory.
Machine learning techniques play an essential role in
solving many problems in today’s world. For example,          5.1 Evaluation metrics
in smart classification of spam emails, intelligent           Evaluating machine learning algorithm is an essential
advertising systems, and malicious malware detection          part.Mostofthetimesaccuracymetricisusedtoevaluate
systems. There are two crucial factors to achieve good        the classification model; however, it is not enough to
performance in solving each of these problems with            truly judge about the model due to the imbalancing
machine learning techniques. One is the use of prac-          data. In this research, we evaluate results based on
tical models that can identify the complex relation-          Accuracy, Precision, Recall, F1-Score and ROC curve.
ships in the data. Another factor to better train these       There are four important terms that used in this metrics:
models the enormous amount of data is needed.
                                                              † True Positives (TP): The cases which we predicted
Among all the machine learning algorithms that are
used nowadays, the ensemble approaches are especial-            YES and the actual output was also YES.
                                                              † True Negatives (TN): The cases which we pre-
ly interesting. In this class of methods, a robust clas-
sifier is built by taking advantage of multiple weak            dicted NO and the actual output was NO.
                                                              † False Positives (FP): The cases which we predicted
classifiers which is why these techniques are popular
and effective (Friedman, 2001; Ke et al., 2017).                YES and the actual output was NO.
                                                              † False Negatives (FN): The cases which we pre-
    In summary, this technique operates by repeatedly
retraining a classifier in conjunction with selecting a         dicted NO and the actual output was YES.
                                                              † Accuracy
dataset based on the precision obtained from the pre-
vious step. Each of these classifiers also adopts a weight-
ing based on the accuracy obtained in that iteration.            Informally, Aaccuracy is the prediction of our
                                                              model got right. It is the ratio of correct predictions
                             X
                              T                              to the total input samples (Yin et al., 2019). It can be
             H ðx Þ ¼ sign          at ht ðxÞ         (25)    calculated like Equation (26):
                              t¼1
                                                                                            TP þ TN
                                                                       Accuracy ¼                                     (26)
   Where ht ðxÞ is the output of the weak classifier t on                              TP þ TN þ FP þ FN
the input x. at is the weight of the classifier t.
   In this research, we use the algorithms of
AdaBoost, CatBoost, GradiantBoost, XGBoost,                   † Precision
LigthGBM in order to obtain the best results as well          Precision is a good measure to determine, when the
as to compare the results of three different gradient         costs of False Positive is high. It is also called the
amplification tree algorithms.                                Positive Predictive Value (PPV). It can be calculated

                                                                       Digital Scholarship in the Humanities, 2021 11 of 21
M. Mobasher and S. Farzi

Table 3 Comparison of five algorithm on KNTU_Personality
Algorithm                       AdaBoost             CatBoost          GradientBoosting            LigthGBM            XgBoost
SþE þ L             F              0.377                  0.822             0.771                         0.793         0.782
                    R              0.573                  0.831             0.786                         0.811         0.797
                    P              0.318                  0.830             0.780                         0.794         0.786

                                                                                                                                 Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
                    A              0.350                  0.822             0.767                         0.785         0.774
EþL                 F              0.126                  0.515             0.443                         0.484         0.458
                    R              0.324                  0.548             0.471                         0.513         0.488
                    P              0.121                  0.556             0.474                         0.509         0.479
                    A              0.139                  0.513             0.458                         0.488         0.464
SþL                 F              0.416                  0.803             0.782                         0.782         0.784
                    R              0.618                  0.811             0.802                         0.797         0.797
                    P              0.361                  0.818             0.782                         0.794         0.794
                    A              0.358                  0.803             0.777                         0.781         0.779
SþE                 F              0.373                  0.746             0.734                         0.752         0.738
                    R              0.577                  0.754             0.748                         0.768         0.752
                    P              0.317                  0.753             0.741                         0.758         0.742
                    A              0.353                  0.743             0.730                         0.745         0.736
L                   F              0.116                  0.485             0.481                         0.508         0.474
                    R              0.152                  0.496             0.494                         0.528         0.480
                    P              0.155                  0.529             0.521                         0.532         0.469
                    A              0.122                  0.487             0.483                         0.508         0.469
E                   F              0.187                  0.286             0.262                         0.265         0.282
                    R              0.418                  0.306             0.282                         0.287         0.301
                    P              0.206                  0.328             0.296                         0.289         0.304
                    A              0.165                  0.290             0.271                         0.269         0.286
S                   F              0.387                  0.720             0.709                         0.741         0.717
                    R              0.589                  0.725             0.723                         0.753         0.727
                    P              0.343                  0.733             0.713                         0.748         0.728
                    A              0.345                  0.716             0.705                         0.740         0.716

like Equation (27). A low precision can also indicate a                                                   1
                                                                                    F1 ¼ 2       1             1        (29)
large number of False Positives (Davis and Goadrich,                                          precision   þ   recall
2006; Yin et al., 2019).
                                 TP                                  F1 Score tries to find the balance between preci-
                 Precision ¼                      (27)            sion and recall (Davis and Goadrich, 2006).
                             TP þ FP
                                                                  † Receiver Operation Characteristic (ROC)
† Recall                                                          The idea of using ROC diagram in machine leaning
Recall can be thought of as a measure of a                        was first discussed in 2005. This diagram is in fact the
classifiers completeness. It is also called Sensitivity or        TPR (Sensitivity) against FPR (Specificity) rate. The
the True Positive Rate (TPR) (Davis and Goadrich,                 TPR actually calculated by Equation (28) and FPR by
2006).                                                            Equation (30).
                                 TP
                 Recall ¼                             (28)                                        FP
                             TP þ FN                                                  FPR ¼                              (30)
                                                                                               FP þ TN

† F1-Score                                                           According to this metric, a suitable model place at
F1-Score is the harmonic mean between precision and               top left of the diagram—according to its acquired
recall and range for this metric is [0, 1]. It tells              point (TP ¼ 100%, FP ¼ 0)—and on unsuitable
you how precise your classifier is, as well as how strong         one place at bottom right—due to its point (TP ¼
it is. To calculates F1-Score we can use Equation (29).           0, FP ¼ 100%)—of the diagram (Prati and Flach,

12 of 21    Digital Scholarship in the Humanities, 2021
Personality detection

2005). An area of 1.0 represents a model that made all       person’s statistical activities in social networks are
predictions perfectly. An area of 0.5 represents a           very relative with personality and this helps examine
model as good as random.                                     a person’s personality type.
                                                                Next up, after the statistical features, the best cat-
5.2 Experiments                                              egory feature that we can use them to predict person’s
Applying each category of features used in this study        personality is the linguistic features.

                                                                                                                                            Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
or combining them seems to produce different results.           It can achieve to %50 in F1-Score metric for
Therefore, in Table 3, all combinations of the feature       LigthGBM and %48 F1-Score metric in CatBoost al-
categories are examined.15 Also for better result, we        gorithm in prediction personality.
use gender feature in all model. Each feature category          In the end, the emotional features have the lowest
is denoted by the following abbreviations. Linguistic        F1-Score among other metrics in algorithms com-
(L), Statistical (S) and Emotional (E). All results are      pared to the other two category features.
based on 10-fold cross-validation, where folds are ran-         The results in Fig. 7 are based on the correlation of
domly sampled from the data.                                 each category feature to MBTI classes. There are many
    As we can see in Table 3, the result from CatBoost       metrics to measure the correlation between a feature
algorithm show a significant increase in four mature         and a class label such as mutual information, chi-
Accuracy, Precision, Recall and F1-Score compared to         square, correlation coefficient scores, Pearson etc. In
other. The negative aspect of the algorithms is the          this work, we use Pearson correlation coefficient.
amount of time which we need to learn. Figure 6              Figure 7 can be calculated as follows:
show the time for learning and testing model for
each algorithm with statistical features. As we can                               P
                                                                                  n
see, the time for learning CatBoost algorithm is bigger                           ðxi  x Þðyi  y Þ
                                                                            i¼1
than others. For resolving this problem, we should               r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   (31)
                                                                      P n
                                                                                                 2   P n
change hyper parameters like grow_policy, depth                             ðxi  x Þ                     ðyi  y Þ2
and so on. We can do this by manual change by                               i¼1                          i¼1
user or tuning procedure (Probst et al., 2019).
    The best result among the three feature categories          Where n is the number of samples. x and y are the
is related to the statistical features. These features are   items that we want to compare. xi and yi are value of
fully numerical and represent individuals by their stat-     element i in samples. x and y denote the mean of
istical activities. By this result, it is true that the      each items.

Fig 6 Learning and testing time of statistical features

                                                                       Digital Scholarship in the Humanities, 2021 13 of 21
M. Mobasher and S. Farzi

                                                              Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
Fig. 7 Categories correlation MBTI classes based on Pearson

Fig. 8 CatBoost ROC curve

14 of 21   Digital Scholarship in the Humanities, 2021
Personality detection

                                                                                                                Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021

Fig. 9 Feature important with p threshold (p > 0.044)

                                                        Digital Scholarship in the Humanities, 2021 15 of 21
Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021

                                                                                                                                                                         Digital Scholarship in the Humanities, 2021
M. Mobasher and S. Farzi

                                                                                                                                                      Fig. 9 Continued

                                                                                                                                                                         16 of 21
Personality detection

   The value range for this metric is between 1 and       ISFP have low value, and ISFP has less value than
1. 1 denotes perfect negative correlation, while 1        INTP. According to Fig. 5, the number of samples
denotes perfect positive correlation and 0 is without      for ISFP is less than INTP; also, the correlation be-
correlation.                                               tween ISFP and MBTI classes is less than INTP.
   As we can see, statistical features have most related      As we can see, some classes like INTJ has good
to MBTI classes than other category features.              result in ROC curve, by review Fig. 5 it is observed

                                                                                                                           Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
   To achieve better results, we use all category fea-     that this class has more user than other classes. With
tures together. Table 3 shows results for combined         more attention in Figs 5 and 8, all classes have more
category features, and we can see improvement for          women than men like ISTP and ESTP, have lower
each combination, and in the end, we achieve the           results. Thus, it can be conducted that gender is
best result with a combination of three categories.        such an essential feature for this research.
To better understand these results, we show the               Figure 9 is selection of feature important for each
ROC curve in the CatBoost algorithm with all cat-          category with p threshold (p > 0.044). As we can see
egory features.                                            gender feature is an important feature in all of them.
   As explained in section 6.1.5, if the value below the   In statistical category Truncated and Geo are good
curve is closer to 1, the model has a good reaction for    feature. The tendency to write a lot (more than 140
that class. For example, according to the Fig. 8, INTP,    character), as well as to share location when

Fig. 9 Continued

                                                                   Digital Scholarship in the Humanities, 2021 17 of 21
M. Mobasher and S. Farzi

Table 4 Important feature correlation with MBTI classes
ISTP               0.045               0.232            0.382            0.0223            0.3372              0.251
ISTJ                0.099               0.007            0.226            0.4443            0.0196              0.1736
ISFP               0.073               0.391            0.571            0.1105             0.25703             0.0598
ISFJ                0.064               0.189            0.330           0.6245            0.0703              0.1527
INTP               0.090                  0.162          0.162            0.0883            0.6437               0.37

                                                                                                                              Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
INTJ               0.108               0.364            0.137            0.0753             0.1644               0.2536
INFP               0.133               0.124            0.100           0.1696            0.1301               0.0784
INFJ               0.171                  0.032         0.236           0.0307             0.3824              0.1129
ESTP               0.040                  0.305          0.026            0.1446            0.426               0.0576
ESTJ               0.060                  0.231         0.409           0.2884             0.1066              0.3442
ESFP               0.049                  0.501         0.044           0.1086             0.0733              0.4349
ESFJ               0.060                  0.117          0.115           0.1043            0.1568              0.3413
ENTP               0.087               0.146           0.184            0.0299             0.0142               0.0023
ENTJ                0.948               0.024           0.167           0.0513             0.0722               0.1272
ENFP               0.056                  0.367         0.374            0.5351            0.2032               0.3460
ENFJ               0.103                  0.097         0.242           0.2350             0.4221              0.0746
                 #Truncated               #Geo           #Photo            Board              Happy               Sarcasm
                                     Statistical                                             Emotional

publishing a post, seems to be very important in the          feature, as well as the average interval for each tweet, fits
process of character recognition.                             well for personality recognition. The results of this
   According to Table 4, all features have positive and       study confirm that the two dimensions of
negative correlation with some MBTI classes. #Photo           Extraversion and Introversion can be well predicted
feature has good positive correlation with Introversion       from user information in social networks.
and negative correlation with Extroversion. #Geo has             The second contribution of this work shows that
Positive correlation with Extraversion that means, users      user information on social networks, including lin-
that shared their location, they have more Extraversion       guistic, emotional, and statistical features, boosting
type than Introversion. All emotional feature, just like      algorithms can be developed for personality recogni-
sample feature we show in Table 4 that are positive           tion with excellent results. In the past, however, most
feature (Happy) have good positive correlation with           of the features used were restricted to a limited num-
Extraversion type and negative feature like Board             ber of emotional or statistical features or a combin-
have negative correlation with Introversion type, there-      ation of these two.
fore, it can be concluded that people with extraversion          Given the shortage of tagged data in this area, our
type shared positive content.                                 work can be used to fill this gap. CatBoost algorithm
   Almost all features according to Table 4 have posi-        can be used to tag the data which can then be verified
tive correlation with INTJ class. As we can see in ROC        by experts.
result (Fig. 9), this class has good result.                     One interesting area to be investigated as future
                                                              works is use models such as deep learning to improve
                                                              the results of personality recognition studies. For have
6 Conclusion and Future Work                                  more and best dataset in this area we can use best
In this study, first contribution of this work is intro-      model from this research (CatBoost) for labeling
ducing a new dataset for personality recognition studies      new user in twitter social network and the use an ex-
called KNTU_Personality. It contains information              pert for evaluate the result.
from more than 1,200 Twitter accounts which contains
their profile information together with their tweets and      References
profile. The study on this data set reveals that (the user    Alsadhan, N., and Skillicorn, D. (2017). Estimating
gender together with their average interval of tweets are       Personality from Social Media Posts. In IEEE
very useful in recognizing their personality, the gender        International Conference on Data Mining Workshops,

18 of 21   Digital Scholarship in the Humanities, 2021
Personality detection

  ICDMW, 2017-Novem, pp. 350–6. https://doi.org/10.               Kannadhasan, M., Aramvalarthan, S., Mitra, S. K., and
  1109/ICDMW.2017.51                                                Goyal, V. (2016). Relationship between biopsychosocial fac-
Barry, C. T., McDougall, K. H., Anderson, A. C et al.               tors and financial risk tolerance: an empirical study. Vikalpa,
  (2019). ‘Check Your Selfie before You Wreck Your                  41(2): 117–31. https://doi.org/10.1177/0256090916642685
  Selfie’: Personality ratings of Instagram users as a function   Ke, G., Meng, Q., Finley, T., et al. (2017). LightGBM: A
  of self-image posts. Journal of Research in Personality, 82:      highly efficient gradient boosting decision tree. In

                                                                                                                                       Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
  103843. https://doi.org/10.1016/j.jrp.2019.07.001                 Advances in Neural Information Processing Systems,
Bharadwaj, S., Sridhar S., Choudhary, R., and Srinath, R.           2017-Decem(Nips), pp. 3147–55.
  (2018). Persona Traits Identification based on                  Kircaburun, K., and Griffiths, M. D. (2018). Instagram
  Myers-Briggs Type Indicator(MBTI) - A Text                        addiction and the Big Five of personality: The mediating
  Classification Approach. In 2018 International                    role of self-liking. Journal of Behavioral Addictions, 7(1):
  Conference on Advances in Computing, Communications               158–70. https://doi.org/10.1556/2006.7.2018.15
  and Informatics, ICACCI 2018, pp. 1076–82. https://doi.         Kumar, S., Morstatter, F., and Liu, H. (2013). Twitter Data
  org/10.1109/ICACCI.2018.8554828                                   Analytics. Springer, p. 89. https://doi.org/10.1007/978-1-
Boyle, G. J. (1995). Myers-Briggs Type Indicator (MBTI)             4614-9372-3
  Some payxhomwtrix limitations.                                  Liao, Y., Moshtaghi, M., Han, B. et al. (n.d.). Mining Micro-
Chhabra, G. S., Sharma, A., and Murali Krishnan, N.                 Blogs: Opportunities and Challenges.
  (2019). Deep Learning Model for Personality Traits              Lin, F. (2019). Positive or negative: emoji usage in online social
  Classification from Text Emphasis on Data Slicing. In             media. 334(Hsmet), pp. 512–16. https://doi.org/10.2991/
  IOP Conference Series: Materials Science and                      hsmet-19.2019.95
  Engineering, 495(1). https://doi.org/10.1088/1757-
  899X/495/1/012007                                               Luo, J. (2018). Emotional Analysis Oriented to Short Texts.
                                                                    166(Amcce), pp. 567–70. https://doi.org/10.2991/amcce-
Davis, J., and Goadrich, M. (2006). The relationship be-            18.2018.98
  tween precision-recall and ROC curves. ACM
  International Conference Proceeding Series, 148: 233–40.        Mccrae, R. R., and John, O. P. (1992). The five-factor
  https://doi.org/10.1145/1143844.1143874                          model: issues and applications. Journal of Personality,
                                                                   60(2): 175–532. http://www.ncbi.nlm.nih.gov/pubmed/
Friedman, J. H. (2001). Greedy function approximation: A           1635040
  gradient boosting machine. Annals of Statistics, 29(5):
  1189–232. https://doi.org/10.2307/2699986                       Mehta, Y., Majumder, N., Gelbukh, A., and Cambria, E.
                                                                   (2019). Recent trends in deep learning based personality
Gatica-Perez, D., Sanchez-Cortes, D., Tri Do, T. M.,               detection. Artificial Intelligence Review. https://doi.org/
  Jayagopi, D. B., and Otsuka, K. (2018). Vlogging over
                                                                   10.1007/s10462-019-09770-z
  time: Longitudinal impressions and behavior in YouTube.
  In ACM International Conference Proceeding Series, pp.          Mohammad, S. M., and Turney, P. D. (2013).
  37–47. https://doi.org/10.1145/3282894.3282922                   Crowdsourcing a word-emotion association lexicon.
                                                                   Computational Intelligence, 29(3): 436–65. https://doi.
Go, A., Bhayani, R., and Huang, L. (2009). Twitter
                                                                   org/10.1111/j.1467-8640.2012.00460.x
  Sentiment Classification using Distant Supervision.
  Processing, 1–6.                                                Nave, G., Minxha J., Greenberg, D. M., Kosinski, M.,
                                                                    Stillwell, D., and Rentfrow, J. (2018). Musical preferen-
Golbeck, J., Robles, C., Edmondson, M., and Turner, K.
                                                                    ces predict personality: evidence from active listening and
  (2011). Predicting personality from twitter. In Proceedings
                                                                    Facebook likes. Psychological Science, 29(7): 1145–58.
  - 2011 IEEE International Conference on Privacy, Security,
                                                                    https://doi.org/10.1177/0956797618761659
  Risk and Trust and IEEE International Conference on Social
  Computing, PASSAT/SocialCom 2011, pp. 149–56. https://          Orme, J. (2016). Re-examining the Use of Behavioral
  doi.org/10.1109/PASSAT/SocialCom.2011.33                          Assessment Tools for Employee Selection.
Heredia, B., Khoshgoftaar, T. M., Prusa, J., and Crawford,        Prati, R. C., and Flach, P. A. (2005). ROCCER: An algo-
  M. (2016). Cross-Domain sentiment analysis: An empir-             rithm for rule learning based on ROC analysis. In IJCAI
  ical investigation. In Proceedings - 2016 IEEE 17th               International Joint Conference on Artificial Intelligence, pp.
  International Conference on Information Reuse and                 823–28.
  Integration, IRI 2016, pp. 160–65. https://doi.org/10.          Preoţiuc-Pietro, D., Eichstaedt, J., Park, G. et al.. (2015).
  1109/IRI.2016.28                                                  The Role of Personality, Age, and Gender in Tweeting about

                                                                            Digital Scholarship in the Humanities, 2021 19 of 21
M. Mobasher and S. Farzi

  Mental Illness. pp. 21–30. https://doi.org/10.3115/v1/       Wang, S., and Lu, H. (2018). The effects of personal types and
  w15-1203                                                      decision-making modes on irrational financial behavior of
Probst, P., Boulesteix, A. L., and Bischl, B. (2019).           chinese students under. 10(july), pp. 59–68.
  Tunability: Importance of hyperparameters of machine         Yin, M., Vaughan, J. W, and Wallach, H. (2019).
  learning algorithms. Journal of Machine Learning               Understanding the effect of accuracy on trust in machine
  Research, 20: 1–32.                                            learning models. In Conference on Human Factors in

                                                                                                                                Downloaded from https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqaa070/6085985 by guest on 28 January 2021
Quercia, D., Lambiotte, R., Stillwell, D., Kosinski, M., and     Computing Systems - Proceedings, pp. 1–12. https://doi.
 Crowcroft, J. (2012). The personality of popular face-          org/10.1145/3290605.3300509
 book users. In Proceedings of the ACM Conference on                                           _
                                                               Yılmaz, T., Ergil, A., and Ilgen,       B. (2020). Deep
 Computer Supported Cooperative Work, CSCW, pp.                  learning-based document modeling for personality de-
 955–64. https://doi.org/10.1145/2145204.2145346                 tection from Turkish Texts. Advances in Intelligent
Renzulli, J. S. (1990). A practical system for identifying       Systems and Computing, 1069: 729–36. https://doi.org/
  gifted and talented students. Early Child Development          10.1007/978-3-030-32520-6_53
  and Care, 63(1): 9–18. https://doi.org/10.1080/              Zou, S., and Wu, K. (2019). Impact of Weibo User’s
  0300443900630103                                               Personality Traits on Loyalty. In Proceedings -
Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earthquake       International Joint Conference on Information, Media
  shakes Twitter users. 851. https://doi.org/10.1145/            and Engineering, ICIME 2018, pp. 77–81. https://doi.
  1772690.1772777                                                org/10.1109/ICIME.2018.00025

Sarwani, M. Z., Sani, D. A., and Fakhrini, F. C. (2019).
  Personality classification through social media using
  probabilistic neural network algorithms. International       Notes
  Journal of Artificial Intelligence and Robotics (IJAIR),
                                                                1   Also know the five-factor model and OCEAN model
  1(1): 9. https://doi.org/10.25139/ijair.v1i1.2025
                                                                2   Myers–Briggs Type Indicator
Sewwandi, D., Perera, K., Sandaruwan, S., Lakchani, O.,         3   Dominance Influence Steadiness Conscientiousness
  Nugaliyadde, A., and Thelijjagoda, S. (2017). Linguistic
                                                                4   https://github.com/MohammadMobasher/
  Features based Personality Recognition using Social Media
                                                                    KNTU_Personality
  Data.
                                                                5   Psychological Type
Stankevich, M., Smirnov, I., Ignatiev, N., Grigoriev, O.,       6   Linguistic Inquiry and Word Count
  and Kiselnikova, N. (2018). Analysis of big five person-      7   posttraumatic stress disorder
  ality traits by processing of social media users activity     8   https://twython.readthedocs.io/en/latest/
  features. In CEUR Workshop Proceedings, 2277, pp.             9   https://github.com/words/emoji-emotion
  162–6.
                                                               10   The feeling of each tweet depends on the words used. In
Tandera, T., Hendro, Suhartono, D., Wongso, R., and                 other words, each word itself has one or more different
  Prasetio, Y. L. (2017). Personality prediction system             senses. Therefore, one can examine the set of tweets of
  from Facebook users. Procedia Computer Science, 116:              each individual from this perspective. This is done by
  604–11. https://doi.org/10.1016/j.procs.2017.10.016               using a dataset called the NRC that contains more than
Tahmasebi M. and Fotouhi F., Esmaeili M. (2019).                    14,000 words (Mohammad and Turney, 2013).
  Hybrid adaptive educational hypermedia recommend-                 Each word in this dataset is described with eight emo-
  er accommodating user’s learning style and web                    tional attributes (anger, fear, anticipation, trust, sur-
  page features. Journal of AI and Data Mining, 7(2):               prise, sadness, joy, disgust) and two psychological
  225–38. https://doi.org/https://10.22044/jadm.2018.               attributes (negative, positive). http://www.purl.com/
  6397.1755                                                         net/lexicons
Ting, T. L., and Varathan, K. D. (2018). Job recommenda-       11   This API can be used in four different languages, and it
  tion using Facebook personality scores. Malaysian                 uses a variety of datasets to provide different emotional
  Journal of Computer Science, 31(4): 311–31. https://doi.          properties (happy, angry, excited, sarcasm, sad, fear,
  org/10.22452/mjcs.vol31no4.5                                      bored) for each text. In order to use this API, all the
Verhoeven, B., Daelemans, W., and Plank, B. (2012). A               tweets were collected for each user and as a result, the
  Multilingual Twitter Stylometry Corpus for Gender and             following six attributes were obtained for each user.
  Personality Profiling. pp. 1632–7.                                https://www.paralleldots.com/emotion-detection

20 of 21   Digital Scholarship in the Humanities, 2021
You can also read