AUTOMATICALLY DETECTING AND ALERTING E-COMMERCE WEB SITE FAKE REVIEWS THROUGH SENTIMENT ANALYSIS

Page created by Troy Chang

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

AUTOMATICALLY DETECTING AND
ALERTING E-COMMERCE WEB SITE
FAKE REVIEWS THROUGH SENTIMENT
ANALYSIS
Paul Jasmine Rani. L1 , Meenakshiammal.R2, Sindhu.S3 Rishii K P4 , Shakthi Saravanaa R.B5 , 1Associate
Professor, Department of Computer Science and Engineering, Rajalakshmi Institute of Technology Chennai ,
2
Associate Professor, Rohini College of Enginering and Technology, 3Assistant Professor, SRN Institute of Science
and Technlogy, 4,5ug scholar, Department of computer Science and Engineering, Rajalakshmi Institute of
Technology pauljasminrani.l@ritchennai.edu.in,saimeenumani@gmail.com,sindhus2@srmist.edu.in,
rishiikpr@gmail.com, bossman22042001@gmail.com.

Abstract— This is an era where the popularity and the The sentiment is formulated usually as a two-class
market for any product is defined by its corresponding reviews. classification problem, positive and negative. Reading,
The demand of the product and thereby it’s stock value reviewing and guessing the polarity of the product or service
increases , whereas the demand towards the product and its is a tiresome process and also consumes a lot of time,
stock value decreases when the reviews are mostly positive and therefore automated techniques are used for sentiment
negative respectively. Therefore the sellers are sometimes
analysis. There are three major classifications in sentiment
benefitted , but always the loss is for the customers (online
purchasers). It is necessary to provide a detailed and true analysis, namely: document level, sentence level, and aspect
review for the product from the verified buyer , so that the level. It is important to distinguish between these three which
others can choose on their own without any bias . But the review will determine different tasks of sentiment analysis. The
systems are often targeted by opinion spamming. This paper document level takes a document as an opinion on its aspect,
classifies reviews and determines whether they are real or fake and it aims to classify an opinion document as a negative or
in the first place itself . If the reviews are positive and actual, positive opinion. The sentence level using SA is used to set
then there is no problem. But if the reviews are fake, it may up an opinion stated in every sentence. The aspect level is
destroy the market of the product. A study conducted on fake based on the idea that an opinion consists of a sentiment
reviews states that there are almost 50 thousand fake reviews in (positive or negative), and its sentiment analysis aims to
72 million reviews. In this paper , sentiment analysis is used in categorize the sentiment based on specific aspects of entities.
identification of these fake reviews , and by applying sentiment
analysis coupled with SVM algorithm we can also remove these
Some challenges are faced by fake review detection, when
biased and fake reviews at once . To determine the fake reviews, opinion spammers use different account names alongside
Support Vector Machine algorithm is trained over a predefined different IP’s , it becomes almost impossible to tag them as
dataset of reviews to extract features for classification. The fake. The dataset we use is from any ecommerce website such
source of the data is from any ecommerce website . The IP as amazon, flipkart etc.. After we collect these datasets, SA
address of the device that reviews the product is also identified, technique is applied on them to classify the documents as
and if the system recognizes multiple reviews from the same IP original positive and original negative reviews or fake
address, then it marks them as fake. positive and fake negative reviews. It is to be noted that not
only reviews that are negative are marked as fake, but also
Keywords— review analysis, sentiment analysis, fake reviews reviews that are positive and being spammed more than one
,support vector machine.
time is also considered a fake one. Spammers use this method
I. INTRODUCTION of opinion spamming to either discredit a product or service,
though the product is worthy, in order to decrease the
Opinions and reviews are increasingly used by company’s worth, or to increase the sales of the product and
individuals and organizations to make purchase decisions, provide the company or seller with fame and popularity,
marketing, product design, etc. . There are N number of thereby increasing their stock and demand value. Detecting
products that are being displayed in these sites whose market such fake reviews is a challenge, for example, fake consumer
is decided by the corresponding reviews. Positive reviews reviews are not only affecting individual consumers but also
provide the businessman and sellers with fame and profit corrupt purchaser’s confidence in online shopping. Machine
which increases their market value. This gives imposters to learning along with sentiment analysis provides a good
post fake reviews to discredit their product or service they chance in the detection of fake reviews in movie review,
provide. They are called fake reviewers or opinion spammers. ecommerce, and other domains. We wish to classify the
Due to fake reviews the users will make wrong decisions and reviews from the dataset taken from ecommerce website and
the product sale will decrease, this will result in a loss for the movie reviews as real ones or fake using sentiment analysis
owners. This paper gives the solution to this problem by algorithms along with supervised learning techniques.
applying sentiment analysis to detect and eliminate fake
reviews.

Volume XII, Issue IV, 2020 Page No: 5507

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

II. RELATED WORK show that the automatically extracted topics match
A. [1] analyze the market of fake review suppliers and the manually extracted ones, while also significantly
their fake reviewing methods and found that decreasing the manual effort.
developers purchase reviews to comparatively I. [6] implemented support-vector network for the
costly costs of a couple of bucks or alter reviews in restricted case where the training data can be
exchange portals. Recently, several techniques and separated without errors. It extends the result to non-
approaches are brought up within the field of fake separable training data.
review detection.
III. BACKGROUND AND SURVEY
B. [2]process and reviewing a brand new mobile
Fake reviews tend to own a lot of red flags that
application. it's used for testing purpose , beta
facilitate AI technology to spot them. , the language
testing, metadata correction , hardware employed in a fake review usually includes formal product
compatibility and software system necessities. It names, model numbers and selling jargon. Also, reviewer
determines the live element and app support. profiles sometimes fail to replicate a standard pattern of
C. [4] presents emotion Text, a toolkit for human’s product or service reviews. and plenty of times, fake reviews
feeling recognition from text. It provides empirical are published too oftentimes in a very short span of your time.
proof of the performance of EmoTxt. To the best of By assessing these aspects, AI technology will establish false
reviews from legitimate ones. With the employment of
our knowledge and data, emotion Text is the 1st
machine learning and AI algorithms, the capability to find
ASCII text file toolkit supporting each feeling of fake reviews can increase. which will facilitate to improve the
human's mind recognition from text and coaching of legitimacy of product and repair reviews so better client
custom emotion classification models. choices will be made. It’s no secret that AI technology offers
D. [5] analyze the adapting data retrieval techniques businesses a bonus in terms of client review analyses.
together with modeling and evaluating them on However, the employment of AI technology is additional
completely different level in the eyes of public unremarkably accustomed examine client complaints,
preferences and requests. Machine learning is employed to
market knowledge. Results show that the capture client information from reviews to make client
mechanically extracted topics match the manually profiles.
extracted ones, but also tend to decrease the manual
effort. The process is reviewed in a brand new mobile app
to see its political campaign as a live element of the app store.
E. [6] enforced support-vector network for the
It includes ratings, exploitation of same emblem, repeating
restricted case wherever the coaching information comments by completely different usernames. the corporate
will be separated without any errors. It extends the may receive both positive and negative review. Positive
result to non-separable training data.[1] analyze the incase to extend sales, supply and development, and in
market of fake review providers and their fake negative may be reporting and protesting . The fake review
reviewing strategies and found that developers buy market is divided into 3 sections.
reviews to relatively expensive prices of a few PROVIDERS : company, salesman, etc
dollars or deal with reviews in exchange portals.
Recently, many techniques and approaches have OFFERS : conditions, consumers, etc
been proposed in the field of fake review detection. POLICIES : detection of fake review
F. [2]process and reviewing a new mobile application.
It is used for testing purpose , beta testing, accurate The characteristics of fake review includes the
application ,category ,detection rate, frequencies.
metadata , hardware compatibility and software
requirements. It determines the live component and
app support.
G. [4] presents Emotion Text, a toolkit for emotion
recognition from text, trained and tested on a gold
standard of about 9K question, answers, and
comments from online interactions. It provides
empirical evidence of the performance of EmoTxt.
To the best of our knowledge, Emotion Text is the
first open-source toolkit supporting both emotion
recognition from text and training of custom
emotion classification models.
H. [5] analyze the adapting information retrieval
techniques including modeling and evaluating them
on different publicly available data sets. Results

Volume XII, Issue IV, 2020 Page No: 5508

Journal of Xi'an University of Architecture & Technology                                                    ISSN No : 1006-7930

                                                                    classification scheme, but instead they create confusion to our
                                                                    classifier. In this study, we used a 820 English waste words
                                                                    list. Waste words removal helps to decrease the memory
                                                                    needs while Classifying the reviews.

                                                                    Algorithm for preprocessing(POS method)

                                                                    1# : take away all blank rows if any.
                                                                    2# : Modify all the word to lower_case letter. This is needed
                                                                    in python interprets 'review' and 'REVIEW' differently
                                                                    dataset1['text']=[ent.low() for entry in dataset1['text']]
                                                                    3# : Tokenization :
                                                                    In this each and every entry in the dataset1 will be spilt into
                                                                    set of words
                                                                    dataset1['text']= [word_token(entry) for entry in
                                                                    dataset1['text']]
                                                                    4# : Remove waste words, Non-Numeric and perform Word
                                                                    (Analyzer requires Pos tagger to know if the word is noun
 Figure 1:data representation about website reviews                 or verb or adjective etc. By default the data set is set to be
                                                                    Noun)
          IV.      DESIGN AND METHODOLOGY                           map_tag= defaultdict(lambda : wn.NOUN)
    A. Data collection                                              tag_map['J'] = wn.ADJ
                                                                    tag_map['V'] = wn.VERB
           Data is collected from the website to find out the
                                                                    tag_map['R'] = wn.ADV
 fake review and data set which is used in this paper is a set of
 product reviews as well as the related data collected from         for index,entry in list(dataset1['text']):
 amazon.com. From October to December 2019, some sample                       # Stating Empty List to retain the words that follow
 has been collected, in total, over 6.2 millions of product         the protocol for this step
 reviews in which the products belong to six main categories                  Final_words = zero
 are identified i.e.. Cooking related items, musical
                                                                    if word not in stopwords.words('english') and
 instruments, beauty, book, electronic, furniture and fashion
 These online reviews were posted by over 2.2 millions of           word.isalpha():
 reviewers towards 10,175 products. Each review includes the        word_Final = word_Lemmatiz.leme(word,tag_map[tag[0]])
 following information: reviewer unique ID, product ID,
 rating ,review time , helpfulness, review text, Posted             # Initializing WordNetLemmr()
 sequence
                                                                              word_Lemmatiz = WordNetLemmatizer()
                                                                              ( pos_tagger procedure will provide the 'tag' i.e if
 B. Sentiment text extraction and data preprocessing                the word is Noun(N) or Verb(V) or something else.)
                                                                    for word, tagger in pos_tagger(entry):
           Each and every word of a sentence has its peculiar                  # Below situation is to check for Stop words and
 role that defines how the word is used. This roles are also        consider only letters and not numbers
 known as the parts of speech. There are eight parts of speech
 in English: verb , noun, the pronoun, the adjective, the           words.append(word_Final)
 adverb, the preposition, the conjunction, and the interjection.
 (POS) taggers have been developed in natural language                       # The final treated set of words for each repetition
 processing, to classify words based on their parts of speech.      will be collected in 'text_final'
 This will be the first step of the preprocessing stage . For       dataset1.loc[index,'text_final'] = str(Final_words)
 sentiment analysis, to extract the text a POS tagger is very
 useful because of the following two reasons: 1) Words like         C. Fake review detection computation
 nouns and pronouns usually do not contain any opinion. It is
 able to filter out such words with the help of a POS tagger; 2)
 A POS tagger can be used to differentiate words that can be             ● Designing the review contents
 used in different parts of speech. After differentiating words       Designing the review contents Let Rw = {rw[l], rw[2], ...,
 data preprocessing has to be done using waste word list ie.,       rw[n]} be a review sequence, and number in index indicates
 Some of the words are usually used (e.g., "a," "the," "of," "I,"   the posted order of review(time in which posted), the review
 "you," "it," "and") but do not give any significant data to our    rw[i] contains multiple information: review unique ID

Volume XII, Issue IV, 2020                                                                                          Page No: 5509

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

rw[i].uid, posted time rw[i].time, review content Random, Score[i] = T
rw[i].content and product ID rw[i].prd.

● detecting content identical (P1) If the score value is less than threshold value then
If the reviewer rw[i].uid posts his/her own assessments
constantly, rw[i].content would have a comparative high E.Support vector machine
similarity with his/her reviews. We keep a review centroid for
each reviewer, which consists of the terms' average incidence In this paper we have a tendency to used Support vector
rates in the reviews posted by rw[i].uid. The database of machine (SVM) to classify the review input training data is
centroid should be updated whenever the frequency is taken from the score and also the trained data set is taken .
increased Thus, we can evaluate the content identical of the SVM could be a methodology for the classification of each
detected review as follows: linear and nonlinear information. If the dataset is linearly
Su = identical(rw[i].content, Cr[i].u) severable, the SVM searches for the linear best separating
hyperplane (the linear kernel), that could be a call boundary
that separates information of 1 category from another. once
● detecting product review identical (p2) the information set is as input it's processed in SVM model
and that they give the output as original review severally and
A fake review might be the duplicate or nearby-duplicate faux review severally .Mathematically, a separating
of an current one on the same product. If the product is more hyperplane is written as:
similar and If so, it would be directly associated with that
product. If there are multiple reviews between the fake review W·X+b=0,
and the normal ones copied, viewers always cannot detect the wherever W could be a weight vector and W=w
fake review because most of the viewer wish is to read the one,w2,...,wn. X could be a training tuple. b could be a scalar.
reviews in the first few pages .So first few page review are so as to optimize the hyperplane, the matter primarily
very important Compared with the normal reviews, fake transforms to the diminution of diminution, that is eventually
reviews would have higher similarity with the "review computed as:
centroid" of the product. Finally to calculate the identity of
the similar product Sp.
Sp = identical(rw[i].c, Cr[i].p)

● The review frequency of a product (p3)

Various fake reviews happening in a short time interval
would make more effect. If a product is very frequently in a
small time window, it might be attacked. To get positive
review or negative review time factor plays a major role.
Suppose if it gets with in short time, this could be caused by
other reasons like promotions. we also consider it as an index
of fake reviewers actions.

D. Final score calculation

Score= (al*Su + a2* Sp + a3*fq) /

where attributes a1,a2,a3….an are the weight parameters
turning the contributions of frequency F1, F2, ... , F6
separately. Since the fake review score is standardized
between o and i. we can determine whether a review is a spam
with a predetermined threshold T:

Normal , Score[i] < T
Figure .3. Proposed System Architecture
Spam ,Score[i] > T

Volume XII, Issue IV, 2020 Page No: 5510

Journal of Xi'an University of Architecture & Technology                                                        ISSN No : 1006-7930

              V. RESULTS AND DISCUSSION

            In this section, we provide experimental results of
         completely different supervised machine learning
         approaches to distinguish sentiment of our datasets
         which is compared with product review dataset V1 and
         product Review dataset V2 Also, we have used similar
         methodologies at the same time to detect fake reviews.
         Our experimental approaches tend to study the accuracy
         of all sentiment classification algorithms, and how to
         determine     which   algorithm    is     more        accurate.
         Furthermore, we were able to find out fake positive
         reviews and fake negative reviews through the proposed
         model.

                                                                           Figure 5. Output of SVM model
 Table-I The result of our Experiment.

  Experiment Fake              Fake         SVM           Threshold            VI. CONCLUSION
             Positive          Negative
             review            review
                                                                                In this paper, we have taken the problem of
                                                                               identifying fake reviews which is from the review
  Dataset-V1         20.5      22.2         83.5          81.4                 sequences using sentiment analysis. Online
  Dataset-V2         26.8      26.3         75.4          79                   product reviews from Amazon.com are selected as
                                                                               our primary dataset. To examine and evaluate the
                                                                               proposed model , we have made an attempt to
                                                                               create a dataset from a real-life review data set.
                                                                               More number of experiments have been
                                                                               conducted to analyze the effectiveness of the
                                                                               proposed model. From the proposed model we
                  Trained data set result                                      were able to easily find out the fake reviews and
                                                                               mark it as fake.
   100
    50
                                                                                VII. REFERENCES
     0
          Fake Positive   Fake        SVM          Threshold                   [1]. 9to5Mac (2017) Bell faces $1.25M fine for
             review     Negative                                               posting fake reviews of its app in the App Store.
                         review                                                https://9to5mac.com/2015/10/14/bell-fake-app-
                       Dataset-V1         Dataset-V2                           store-reviews/

  Figure 4: graph of the experiment                                            [2].Apple (2017) App Store Review Guidelines.
                                                                               https://developer.apple.com/app-
                                                                               store/review/guidelines/

                                                                               [3]. Bird S, Klein E, Loper E (2009) Natural
                                                                               language processing with python, 1st edn.
                                                                               O’Reilly Media, Inc

                                                                               [4]. Calefato F, Lanubile F, Novielli N (2017)
                                                                               Emotxt: a toolkit for emotion recognition from
                                                                               text. arXiv:170803892

Volume XII, Issue IV, 2020                                                                                              Page No: 5511

Journal of Xi'an University of Architecture & Technology                                          ISSN No : 1006-7930

      [5]. Carreno L V G, Winbladh K (2013) “                   [15] S.Pradeep Kumar ; O. Pandithurai,Sixth
      Analysis of user comments: an approach for                sense technology,2013 International Conference
      software requirements evolution. In: 2013 35th            on Information Communication and Embedded
      international conference on software engineering          Systems (ICICES)
      (ICSE).
      https://doi.org/10.1109/ICSE.2013.6606604, pp             [16] K. Sriprasadh ;Saicharansrinivasan ; O.
      582–591                                                   Pandithurai ; A. saravanan,A novel method to
                                                                secure cloud computing through multicast key
      [6]. Cortes C, Vapnik V (1995) Support-vector             management,2013 International Conference on
      networks.      Mach     Learn   20(3):273–297.            Information Communication and Embedded
      https://doi.org/10.1007/BF00994018                        Systems (ICICES)

      [7] Finkelstein A, Harman M, Jia Y, Martin W,             [17] O. Pandithurai ; N. suganya ; S. Naresh ; Jaya
      Sarro F, Zhang Y (2017) Investigating the                 Krishna Deva ; B. Suresh,Accessing the required
      relationship between price, rating, and popularity        information from an applications in cloud through
      in the blackberry world app store. Inf Softw              SMS,2013       International   Conference        on
      Technol87:119–139.                                        Information Communication and Embedded
      https://doi.org/10.1016/j.infsof.2017.03.002.             Systems (ICICES)
      http://www.sciencedirect.com/science/article/pii/
      S095058491730215X                                         [18] O. Pandithurai ; M. Poongodi ; S. Pradeep
                                                                Kumar ; C. Gopala Krishnan,A method to support
      [8] .Fritz C O, Morris P E, Richler J J (2012)            multi-tenant as a service,2011 Third International
      Effect size estimates: current use, calculations,         Conference on Advanced Computing,pp 157-162
      and interpretation. J Exp Psychol Gen 141(1):2

      [9]. Gao C, Xu H, Hu J, Zhou Y (2015) Ar-
                                                              [19] C.P. Indumathi ; B. Galeebathullah ; O.
      tracker: “track the dynamics of mobile apps via
      user review mining” ,IEEE symposium on               Pandithurai,Analysis of test case coverage using data
      service-oriented system engineering (SOSE).
                                                           mining technique,2010International conference on
      IEEE, pp 284–290,2017
                                                           communication control and computing technologies,pp
      [10]. Google (2017) Google play user ratings,        806-809
      reviews, and installs.
                                                           [20] V. Rajesh ; O. Pandithurai ; S. Mageshkumar,Wireless
      https://play.google.com/about/storelisting-
      promotional/ratings-reviews-installs/                sensor node data on cloud,2010International
                                                           conference on communication control and computing
      [11] . Google (2018) Google Play Prohibited and
                                                           technologies,pp 476-481
      RestrictedContent.
      https://support.google.com/contributionpolicy/an     [22] G. Sivasankari and O. Pandithurai,Web User Session
      swer/7400114
                                                           Traffic Tracing Via Clustering Techniques,2010
                                                           International Conference on Computer Applications
      [12] Daniel Martens1 · Walid Maalej1,” Towards
      understanding and detecting fake reviews in app      Database Systems,pp 220-229
      stores”, Empirical Software Engineering (2019)
      24:3316–3355

      [13] Elshrif Elmurngi, Abdelouahed Gherbi,”
      Detecting Fake Reviews through Sentiment
      Analysis Using Machine Learning Techniques”,
      The Sixth International Conference on Data
      Analytics,2017.

      [14] O. Pandithurai, Dr. C. Sureshkumar, High-
      Performance Multipath Routing Algorithm Using
      CPEGASIS Protocol in Wireless Sensor Cloud
      Environment,2016 Scientific research An
      academic publisher,vol.7,PP. 3246-3252

Volume XII, Issue IV, 2020                                                                                Page No: 5512

You can also read