Consumer recommendation dynamics in online retail business under logistic regression and naïve Bayes analyses

Page created by Joanne Hampton

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

120

Consumer recommendation dynamics in online retail business
under logistic regression and naïve Bayes analyses
Irina GEORGESCU
Bucharest University of Economics, Bucharest, Romania
irina.georgescu@csie.ase.ro

Jani KINNUNEN
Åbo Akademi University, Turku, Finland
jani.kinnunen@abo.fi

Abstract. Competitive businesses need to study the behavior of their current and potential customer
base. Relevant data on the behavior can be obtained from online, where the purchase decisions are
increasingly made and often based on product reviews, ratings and recommendations available in social
media networks. The original data consists of 23486 customer reviews with ten variables/features of the
reviewing customers, the products under review and the feedback to their reviews from online retail
clothing business, and about half of the dataset is analyzed after cleaning the data. To find out, which
features are the most important factors leading to a recommendation, the naïve Bayes and logistic
regression methods are applied. Earlier research has shown that the sentiment of textual reviews and
the given numerical ratings are key factors for the decision to recommend or not recommend products.
The focus of this paper is to identify and rank-order the most relevant (numerical) factors affecting the
review process leading to a recommendation. After applying the logistic regression classifier, we have
found that rating, positive feedback count and age are statistically significant factors, in that order. The
results support online retailers and manufacturers, as well, in adjusting their product portfolios and
marketing efforts optimally to obtain recommendations for their products, reach potential customers
and expose them to the given recommendations leading to positive purchase decisions. Further, the
results indicate some future research opportunities.

Keywords: consumers, social media, logistic regression, naïve Bayes, ROC curve.

Introduction
In this paper we discuss the role of social media for developing successful businesses based
on consumers’ opinions.
The paper is a continuation of some previous research by Androniceanu et al. (2020),
where we studied the same issue with sentiment analysis and lexicon-based approaches.
Here we use the same dataset, a public dataset called Women’s Clothing E-Commerce Review
collected by Nick Brooks (Brooks, 2018a, 2018b) in order to analyze customers’ reviews on
fashion items. The original dataset contained 23486 customer reviews and 10 variables, both
text and numerical. Previous research on this dataset has also been done by Agarap, 2020
who made a sentiment analysis and implemented a bidirectional neural network for
sentiment classification. Cleaning the dataset, we worked with almost half of the data. In this
paper we will apply supervised methods, such as binomial logistic regression and naïve
Bayes classifier in order to find the best prediction model. We consider RecommendedIND as

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

121

the class variable, having two values: 1 if the review is positive and 0, if the review is negative.
The predictors considered in this approach are Rating (a qualitative variable having values
from 1 to 5), Age of consumers and PositiveFeedbackCount (a numerical variable counting
the positive reviews of a fashion item). For each classifier mentioned above we divide the
dataset in two subsets: the training set representing 75% of the data and the test set
representing 25% of the data. The classifier is trained on the training set and then it is run on
the test set to predict the class membership of the customers’ reviews (1-recommended or
0-not recommended). The real class and the predicted class are compared by means of the
confusion matrix. The ROC curve is drawn and the area under curve (AUC) is determined in
order to see the classifier’s performance.

Literature review
With the existence of big datasets, consumers have found that social networks can bring flows
of useful information to make an opinion on the latest products and services. In their turn,
companies can make use of social media platforms and technology worldwide available, such
as Facebook, Google, Instagram, Spotify, Twitter, etc., but also provided by marketing
companies (Andzulis et al., 2012). Such online platforms influence the behavior of sellers and
buyers, the sales rules and sales practices.
        Vaitkevicius et al. (2019) proved how the easiness of shopping and availability of
product information and reviews, along with pricing, drive the demand growth in online
shopping. Fodiatis and Stylos (2017) investigated online experience factors and the influence
of social platforms on using information about parks and online sales. They built a
questionnaire about the visitors’ satisfaction on E-da World theme park in Taiwan and
applied a theoretical framework of TAM (Technological Acceptance Model). Chen et al.
(2010) studied mobile phone customer satisfaction and experience as a basis on
recommendations and they view a customer, not only as a service user, but even as a partner
of a service provider due to the importance of their recommendations to potential customers.
        Product and service recommendations are relevant performance indicators of the
companies offering products, which exposed to reviews, while important also for potential
other customers reading the reviews and recommendations (cf. Siering et al., 2018).
However, consumers differ by how much weight they give for recommendations, e.g.,
Androniceanu et al. (2020b) found that customers in richer countries gave smaller weight to
reviews, for their purchase decisions, unlike shopping online customers. Further, in spite that
major companies use e-commerce, there are regions where there is a limited volume of
online transactions, such as Arab world. A study by Chivandi et al. (2018) asserts that 47%
of small business do not frequently use social media platforms, while 25% do not use it at all.
Chivandi et al. (2019) discussed how by brand awareness strategies, social media platforms
determine the consumer behavior in online purchases.
        Siering et al. (2018) studied online recommendations focusing on the airline service
quality and reviews and which factors in customer responses, i.e. the textual reviews can
predict, whether they will recommend the airline or not. They used sentiment analysis.
Similarly, Androniceanu et al. (2020a) studied online clothing business using sentiment
analysis of the textual data of given reviews. The key factor was the positive sentiment score
of each review together with given ratings in predicting the decision to give a
recommendation or not.
        Next, we will extend the study of Androniceanu et al. (2020a) using the same dataset.

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

122

Methodology
In this section we shortly present the two classifiers used for this dataset: binominal logistic
regression and naïve Bayes.

Binomial logistic regression
Following Sperandei, 2014 and McHugh, 2009, we will briefly present the binomial logistic
regression model. For simplification, one considers a logistic model with two predictors x1 , x 2
and a binary dependent variable Y, whose probability of success is denoted by p=P(Y=1) and
the probability of failure is denoted by 1-p=P(Y=0).
       The parameters b0 , b1 , b2 of the model are known. We call the log odds of the event that
                                                              p
the response variable takes value 1 by ln                         and we assume a linear relationship between
                                                            1− p
the log odds and the predictors x1 , x 2 . The logistic model is represented by this linear
relationship:
     p                                                         p
ln      = b0 + b1 x1 + b2 x 2 , equivalent to                     = e b0 +b1x1 +b2 x 2 , from where the probability of
   1− p                                                     1− p
                                          1
success p is derived: p =           − ( b0 + b1x1 + b2 x2 )
                                                            . Log odds are difficult to interpret, therefore one
                               1+ e
computes the exponential of the coefficients b0 , b1 , b2 . These exponentials tell the amount by
which the odds increase if the associated predictor increases by one.

Naïve Bayes classifier
Following Mitchell, 2020, a naïve Bayes classifier refers to a joint distribution over a response
variable Y and a set of known random variables X 1 ,..., X n . The random variables are assumed
conditionally independent given the label Y:
P( X 1 ,..., X n , Y ) = P(Y )  P( X i | Y )
                                  i

          According to Bayes’ rule, the probability that Y takes the value y k is:
                                  P (Y = y k ) P ( X 1 ,..., X n | Y = y k )
P (Y = y k | X 1 ,.., X n ) =                                                                                                 (1)
                                  P(Y = y j ) P( X 1 ,..., X n | Y = y j )
                                      j

                                  P(Y = y k )  P( X i | Y = y k )
                             =                      i

                                  P(Y = y
                                  j
                                                j   )  P( X i | Y = y j )
                                                        i

          Given a new observation X  = ( X 1 ,..., X n ) and estimating the distributions P(Y) and
 P( X i | Y ) from the training set, one can compute the probability that Y can take any value y k
. The naïve Bayes classification rule (Mitchell, 2020) is:
                  P(Y = y k )  P( X i | Y = y k )
Y  arg max                    i
                                                    .                                       (2)
              yk
                   P
                     j
                      (Y = y j  P( X i | Y = y j )
                              )
                                          i

 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

123

      The formula above can be simplified (Mitchell, 2020):
Y  arg max P(Y = y k )  P( X i | Y = y k ) , since the denominator does not depend on the value
             yk                   i

yk
        ROC (receiver operating characteristics) curve and AUC (area under curve) measure
the performance of the classifier. ROC curve summarizes confusion matrix at all threshold
values. AUC has values in the unit interval, indicating the efficiency of the classifier to
separate positive and negative class. AUC=1 indicates a perfect classifier; AUC=0 indicates
that all predictions are wrong.
        The axes of the ROC curve are measured between 0 and 1. Specificity measures the
proportion of negative class observations correctly predicted as negative. Sensitivity
measures the proportion of positive class observations correctly predicted as positive. The
Oy axis measures the sensitivity, while the Ox axis measures 1-specificity. Kappa statistics
signifies how much of the accuracy is due to chance, for example selecting the most common
class. Kappa takes values in the unit interval.

Results and discussions
In the data analysis we will remove the identifiers Nr and ClothingID, since they are useless
and would confuse machine learning algorithms. The training set contains 5695 rows
(75%), while the test set contains 2847 rows (25%). The formula on each classifier builds is
RecommendedIND ~ Rating + Age+PositiveFeedbackCount.

Binomial logistic regression
The logistic regression model will be deduced from the following output:

                                       Figure no. 1. Logistic regression output
                                                    Source: own calculations.

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

124

We remark that the response variable RecommendedIND is a factor variable with two
levels 0 and 1. Thus in our case we estimate the probability of having a positive review. The
output from Figure 1 includes measures of fit such as summary of deviance residuals and AIC
(Akaike Information Criterion). The coefficients of each predictor and the intercept are also
included. The coefficients of Rating and Age are positive and statistically significant, at the
significance levels of 0.001 and 0.1, respectively. The coefficient of PositiveFeedbackCount is
negative and significant at the level of 0.05. A one-point increase in Rating and Age increases
the log of odds ratio by 3.44, respectively 0.009. A one-point increase in
PositiveFeedbackCount decreases the log of odds ratio by 0.02. The log of odds is difficult to
interpret, therefore the exponential of the coefficients are computed.

Figure no. 2. Example of the logistic regression coefficients
Source: own computation.

If Rating increases by 1, the odds ratio of the review being positive increases by 31,2.
If Age increases by 1 year, the odds ratio of the review begin good increases by 1.0091, while
if the PositiveFeedbackCount increases by 1, the odds ratio increases by 0.979.
Confusion matrix for the test set is depicted in figure 3.

Figure no. 3. Confusion matrix for logistic regression classifier
Source: own computation.

Out of 503 negative reviews, 477 have been correctly predicted as being negative. Out
of 2344 positive reviews, 2188 have been correctly predicted as being positive. The classifier
accuracy is 93.6%. A kappa value of 80.1% is pretty high.

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

125

                              Figure no. 4. ROC and AUC for logistic regression classifier
                                                    Source: own computation.

        In figure 4, AUC=0.941, meaning that there is 94.1% chance that the model will be able
to distinguish between positive and negative reviews.

Naïve Bayes classifier
The model computes the a-priori probabilities that indicate the data distribution. Since the
three predictors are numeric, we obtain the means (the first column in figure 5) and the
standard deviations (the second column in figure 5) of the conditional Gaussian distributions.

                                Figure no. 5. Summary of the naïve Bayes model
                                                    Source: own computation

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

126

       The next step is to make the class prediction of the review based on the test set. In
order to decide the threshold for classifying a review as positive or negative, we declare a
review as positive when the probability of its being positive is greater than 0.75.

                           Figure no. 6. Confusion matrix for naïve Bayes classifier
                                                    Source: own computation.

       Out of 503 negative reviews, 477 have been correctly predicted as being negative. Out
of 2344 positive reviews, 2176 have been correctly predicted as being positive. The classifier
accuracy is 93.2%. A kappa value of 78.9% is pretty high.

                              Figure no. 7. ROC and AUC for naïve Bayes classifier
                                                    Source: own computation.

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

127

In figure 7, AUC=0.938, meaning that there is 93.8% chance that the model will be able
to distinguish between positive and negative reviews.

Conclusion
In this paper we analyzed customer reviews on fashion by employing supervised learning
methods, logistic regression and naïve Bayes for classifying if the review text recommends or
not the fashion item. Both classifiers showed a very high accuracy, about 93%-94%, with a
kappa statistic of about 79-80%. These results can give companies suggestions on the
customers’ opinions about their products and services and how to improve their quality and
types. By means of machine learning, intelligent systems can create customers’ profiles and
purchasing habits, contributing to sales increases and business strategies to fulfill various
criteria. Some limitations of machine learning technology may refer to ethics, namely
customer privacy and lack of transparency.

References
Agarap, A.F.M. (2020). Statistical analysis on e-commerce reviews with sentiment
classification using bidirectional recurrent neural networks. Preprint. Available at
https://arxiv.org/pdf/1805.03687.pdf
Androniceanu, A., Georgescu, I., & Kinnunen, J. (2020a). The key role of social media in
identifying consumer opinions for building sustainable competitive advantages. In
Meiselwitz G. (Eds.) Social Computing and Social Media. Participation, User Experience,
Consumer Experience and Applications of Social Computing. HCII 2020. Lecture Notes
in Computer Science, vol. 12195, Springer, Cham, 261-277.
Androniceanu, A., Kinnunen, J., Georgescu, I., & Androniceanu A.-M. (2020b).
Multidimensional analysis of consumer behaviour on the European digital market. In:
Sroka W. (eds.) Perspectives on Consumer Behaviour. Contributions to Management
Science. Springer, Cham. https://doi.org/10.1007/978-3-030-47380-8_4
Andzulis, J. M., Panagopoulos, N. G. & Rapp, A. (2012). A review of social media and
implications for the sales process. Journal of Personal Selling& Sales Management, 2(3),
305-316.
Brooks, N. (2018a). Guided numeric and text exploration E-commerce, available at:
https://www.kaggle.com/nicapotato/guided-numeric-and-text-exploration-
commerce
Brooks, N.:(2018b) Women’s e-commerce clothing review, available at:
https://www.kaggle.com/nicapoto/womens-ecommerce-clothing-views
Chen, W-K, Huang, H-C, & Chou, S-C. (2010). Understanding consumer recommendation
behavior. In Pousttchi, K. & Wiedemann, D.G. (eds.) Handbook of Research on Mobile
Marketing Management, IGI Global, pp. 401-416
Chivandi, A., Vafana, S., Samuel, O.G., & Muchie, M. (2018). Social media innovation
consumption of hair products in South Africa; African female perception. Journal of
Retail and Consumer Services, JJRC2018759
Chivandi, A., Samuel, M. O., & Muchie, M. (2019). Social media, consumer behaviour and
service marketing. Consumer Behaviour and Service Marketing. Matthew Reyes.
IntechOpen, Available at: https://www.intechopen.com/books/consumer-behavior-
andmarketing/socialmediaconsumer-behavior-and-service-marketing
Fotiadis, A. K. & Stylos, N. (2017). The effects of online social networking on retailer consumer
dynamics in the attraction industry: The case of ‘E-da’ theme park,
Taiwan.Technological Forecasting and Social Change. 124, 283-294.
McHugh, M.L. (2009). The odds ratio: calculation, usage, and interpretation. Biochem Med.,
19(2), 120-126.
Mitchell, T. (2020). Machine Learning (2nd ed.) McGraw Hill.

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

128

Sperandei, S. (2014). Understanding logistic regression analysis. Biochem Med., 24(1), 12-18.
Siering, M, Deokar, A.V., & Janze, C. (2018). Disentangling consumer recommendations:
       Explaining and predicting airline recommendations based on online reviews. Decision
       Support Systems, 107, March 2018, 52-63.
Vaitkevicius, S., Mazeikiene, E., Bilan, S., Navickas, V., & Savaneviciene, A. (2019). Economic
       demand formation motives in online-shopping. Inzinerine Ekonomika Engineering
       Economics, 30(5), 631-640.

10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020

You can also read