AUTOMATICALLY DETECTING AND ALERTING E-COMMERCE WEB SITE FAKE REVIEWS THROUGH SENTIMENT ANALYSIS

Page created by Troy Chang
 
CONTINUE READING
AUTOMATICALLY DETECTING AND ALERTING E-COMMERCE WEB SITE FAKE REVIEWS THROUGH SENTIMENT ANALYSIS
Journal of Xi'an University of Architecture & Technology                                                       ISSN No : 1006-7930

     AUTOMATICALLY DETECTING AND
     ALERTING E-COMMERCE WEB SITE
    FAKE REVIEWS THROUGH SENTIMENT
               ANALYSIS
       Paul Jasmine Rani. L1 , Meenakshiammal.R2, Sindhu.S3 Rishii K P4 , Shakthi Saravanaa R.B5 , 1Associate
      Professor, Department of Computer Science and Engineering, Rajalakshmi Institute of Technology Chennai ,
  2
    Associate Professor, Rohini College of Enginering and Technology, 3Assistant Professor, SRN Institute of Science
       and Technlogy, 4,5ug scholar, Department of computer Science and Engineering, Rajalakshmi Institute of
          Technology pauljasminrani.l@ritchennai.edu.in,saimeenumani@gmail.com,sindhus2@srmist.edu.in,
                                 rishiikpr@gmail.com, bossman22042001@gmail.com.

     Abstract— This is an era where the popularity and the                        The sentiment is formulated usually as a two-class
 market for any product is defined by its corresponding reviews.       classification problem, positive and negative. Reading,
 The demand of the product and thereby it’s stock value                reviewing and guessing the polarity of the product or service
 increases , whereas the demand towards the product and its            is a tiresome process and also consumes a lot of time,
 stock value decreases when the reviews are mostly positive and        therefore automated techniques are used for sentiment
 negative respectively. Therefore the sellers are sometimes
                                                                       analysis. There are three major classifications in sentiment
 benefitted , but always the loss is for the customers (online
 purchasers). It is necessary to provide a detailed and true           analysis, namely: document level, sentence level, and aspect
 review for the product from the verified buyer , so that the          level. It is important to distinguish between these three which
 others can choose on their own without any bias . But the review      will determine different tasks of sentiment analysis. The
 systems are often targeted by opinion spamming. This paper            document level takes a document as an opinion on its aspect,
 classifies reviews and determines whether they are real or fake       and it aims to classify an opinion document as a negative or
 in the first place itself . If the reviews are positive and actual,   positive opinion. The sentence level using SA is used to set
 then there is no problem. But if the reviews are fake, it may         up an opinion stated in every sentence. The aspect level is
 destroy the market of the product. A study conducted on fake          based on the idea that an opinion consists of a sentiment
 reviews states that there are almost 50 thousand fake reviews in      (positive or negative), and its sentiment analysis aims to
 72 million reviews. In this paper , sentiment analysis is used in     categorize the sentiment based on specific aspects of entities.
 identification of these fake reviews , and by applying sentiment
 analysis coupled with SVM algorithm we can also remove these
                                                                       Some challenges are faced by fake review detection, when
 biased and fake reviews at once . To determine the fake reviews,      opinion spammers use different account names alongside
 Support Vector Machine algorithm is trained over a predefined         different IP’s , it becomes almost impossible to tag them as
 dataset of reviews to extract features for classification. The        fake. The dataset we use is from any ecommerce website such
 source of the data is from any ecommerce website . The IP             as amazon, flipkart etc.. After we collect these datasets, SA
 address of the device that reviews the product is also identified,    technique is applied on them to classify the documents as
 and if the system recognizes multiple reviews from the same IP        original positive and original negative reviews or fake
 address, then it marks them as fake.                                  positive and fake negative reviews. It is to be noted that not
                                                                       only reviews that are negative are marked as fake, but also
     Keywords— review analysis, sentiment analysis, fake reviews       reviews that are positive and being spammed more than one
 ,support vector machine.
                                                                       time is also considered a fake one. Spammers use this method
                       I.      INTRODUCTION                            of opinion spamming to either discredit a product or service,
                                                                       though the product is worthy, in order to decrease the
          Opinions and reviews are increasingly used by                company’s worth, or to increase the sales of the product and
 individuals and organizations to make purchase decisions,             provide the company or seller with fame and popularity,
 marketing, product design, etc. . There are N number of               thereby increasing their stock and demand value. Detecting
 products that are being displayed in these sites whose market         such fake reviews is a challenge, for example, fake consumer
 is decided by the corresponding reviews. Positive reviews             reviews are not only affecting individual consumers but also
 provide the businessman and sellers with fame and profit              corrupt purchaser’s confidence in online shopping. Machine
 which increases their market value. This gives imposters to           learning along with sentiment analysis provides a good
 post fake reviews to discredit their product or service they          chance in the detection of fake reviews in movie review,
 provide. They are called fake reviewers or opinion spammers.          ecommerce, and other domains. We wish to classify the
 Due to fake reviews the users will make wrong decisions and           reviews from the dataset taken from ecommerce website and
 the product sale will decrease, this will result in a loss for the    movie reviews as real ones or fake using sentiment analysis
 owners. This paper gives the solution to this problem by              algorithms along with supervised learning techniques.
 applying sentiment analysis to detect and eliminate fake
 reviews.

Volume XII, Issue IV, 2020                                                                                             Page No: 5507
AUTOMATICALLY DETECTING AND ALERTING E-COMMERCE WEB SITE FAKE REVIEWS THROUGH SENTIMENT ANALYSIS
Journal of Xi'an University of Architecture & Technology                                                   ISSN No : 1006-7930

                  II.     RELATED WORK                                     show that the automatically extracted topics match
     A. [1] analyze the market of fake review suppliers and                the manually extracted ones, while also significantly
        their fake reviewing methods and found that                        decreasing the manual effort.
        developers purchase reviews to comparatively                  I.   [6] implemented support-vector network for the
        costly costs of a couple of bucks or alter reviews in              restricted case where the training data can be
        exchange portals. Recently, several techniques and                 separated without errors. It extends the result to non-
        approaches are brought up within the field of fake                 separable training data.
        review detection.
                                                                            III.    BACKGROUND AND SURVEY
     B. [2]process and reviewing a brand new mobile
                                                                            Fake reviews tend to own a lot of red flags that
        application. it's used for testing purpose , beta
                                                                  facilitate AI technology to spot them. , the language
        testing,    metadata      correction     ,    hardware    employed in a fake review usually includes formal product
        compatibility and software system necessities. It         names, model numbers and selling jargon. Also, reviewer
        determines the live element and app support.              profiles sometimes fail to replicate a standard pattern of
     C. [4] presents emotion Text, a toolkit for human’s          product or service reviews. and plenty of times, fake reviews
        feeling recognition from text. It provides empirical      are published too oftentimes in a very short span of your time.
        proof of the performance of EmoTxt. To the best of        By assessing these aspects, AI technology will establish false
                                                                  reviews from legitimate ones. With the employment of
        our knowledge and data, emotion Text is the 1st
                                                                  machine learning and AI algorithms, the capability to find
        ASCII text file toolkit supporting each feeling of        fake reviews can increase. which will facilitate to improve the
        human's mind recognition from text and coaching of        legitimacy of product and repair reviews so better client
        custom emotion classification models.                     choices will be made. It’s no secret that AI technology offers
     D. [5] analyze the adapting data retrieval techniques        businesses a bonus in terms of client review analyses.
        together with modeling and evaluating them on             However, the employment of AI technology is additional
        completely different level in the eyes of public          unremarkably accustomed examine client complaints,
                                                                  preferences and requests. Machine learning is employed to
        market knowledge. Results show that the                   capture client information from reviews to make client
        mechanically extracted topics match the manually          profiles.
        extracted ones, but also tend to decrease the manual
        effort.                                                              The process is reviewed in a brand new mobile app
                                                                  to see its political campaign as a live element of the app store.
     E. [6] enforced support-vector network for the
                                                                  It includes ratings, exploitation of same emblem, repeating
        restricted case wherever the coaching information         comments by completely different usernames. the corporate
        will be separated without any errors. It extends the      may receive both positive and negative review. Positive
        result to non-separable training data.[1] analyze the     incase to extend sales, supply and development, and in
        market of fake review providers and their fake            negative may be reporting and protesting . The fake review
        reviewing strategies and found that developers buy        market is divided into 3 sections.
        reviews to relatively expensive prices of a few                    PROVIDERS : company, salesman, etc
        dollars or deal with reviews in exchange portals.
        Recently, many techniques and approaches have                      OFFERS : conditions, consumers, etc
        been proposed in the field of fake review detection.               POLICIES : detection of fake review
     F. [2]process and reviewing a new mobile application.
        It is used for testing purpose , beta testing, accurate            The characteristics of fake review includes the
                                                                  application ,category ,detection rate, frequencies.
        metadata , hardware compatibility and software
        requirements. It determines the live component and
        app support.
     G. [4] presents Emotion Text, a toolkit for emotion
        recognition from text, trained and tested on a gold
        standard of about 9K question, answers, and
        comments from online interactions. It provides
        empirical evidence of the performance of EmoTxt.
        To the best of our knowledge, Emotion Text is the
        first open-source toolkit supporting both emotion
        recognition from text and training of custom
        emotion classification models.
     H. [5] analyze the adapting information retrieval
        techniques including modeling and evaluating them
        on different publicly available data sets. Results

Volume XII, Issue IV, 2020                                                                                         Page No: 5508
AUTOMATICALLY DETECTING AND ALERTING E-COMMERCE WEB SITE FAKE REVIEWS THROUGH SENTIMENT ANALYSIS
Journal of Xi'an University of Architecture & Technology                                                    ISSN No : 1006-7930

                                                                    classification scheme, but instead they create confusion to our
                                                                    classifier. In this study, we used a 820 English waste words
                                                                    list. Waste words removal helps to decrease the memory
                                                                    needs while Classifying the reviews.

                                                                    Algorithm for preprocessing(POS method)

                                                                    1# : take away all blank rows if any.
                                                                    2# : Modify all the word to lower_case letter. This is needed
                                                                    in python interprets 'review' and 'REVIEW' differently
                                                                    dataset1['text']=[ent.low() for entry in dataset1['text']]
                                                                    3# : Tokenization :
                                                                    In this each and every entry in the dataset1 will be spilt into
                                                                    set of words
                                                                    dataset1['text']= [word_token(entry) for entry in
                                                                    dataset1['text']]
                                                                    4# : Remove waste words, Non-Numeric and perform Word
                                                                    (Analyzer requires Pos tagger to know if the word is noun
 Figure 1:data representation about website reviews                 or verb or adjective etc. By default the data set is set to be
                                                                    Noun)
          IV.      DESIGN AND METHODOLOGY                           map_tag= defaultdict(lambda : wn.NOUN)
    A. Data collection                                              tag_map['J'] = wn.ADJ
                                                                    tag_map['V'] = wn.VERB
           Data is collected from the website to find out the
                                                                    tag_map['R'] = wn.ADV
 fake review and data set which is used in this paper is a set of
 product reviews as well as the related data collected from         for index,entry in list(dataset1['text']):
 amazon.com. From October to December 2019, some sample                       # Stating Empty List to retain the words that follow
 has been collected, in total, over 6.2 millions of product         the protocol for this step
 reviews in which the products belong to six main categories                  Final_words = zero
 are identified i.e.. Cooking related items, musical
                                                                    if word not in stopwords.words('english') and
 instruments, beauty, book, electronic, furniture and fashion
 These online reviews were posted by over 2.2 millions of           word.isalpha():
 reviewers towards 10,175 products. Each review includes the        word_Final = word_Lemmatiz.leme(word,tag_map[tag[0]])
 following information: reviewer unique ID, product ID,
 rating ,review time , helpfulness, review text, Posted             # Initializing WordNetLemmr()
 sequence
                                                                              word_Lemmatiz = WordNetLemmatizer()
                                                                              ( pos_tagger procedure will provide the 'tag' i.e if
 B. Sentiment text extraction and data preprocessing                the word is Noun(N) or Verb(V) or something else.)
                                                                    for word, tagger in pos_tagger(entry):
           Each and every word of a sentence has its peculiar                  # Below situation is to check for Stop words and
 role that defines how the word is used. This roles are also        consider only letters and not numbers
 known as the parts of speech. There are eight parts of speech
 in English: verb , noun, the pronoun, the adjective, the           words.append(word_Final)
 adverb, the preposition, the conjunction, and the interjection.
 (POS) taggers have been developed in natural language                       # The final treated set of words for each repetition
 processing, to classify words based on their parts of speech.      will be collected in 'text_final'
 This will be the first step of the preprocessing stage . For       dataset1.loc[index,'text_final'] = str(Final_words)
 sentiment analysis, to extract the text a POS tagger is very
 useful because of the following two reasons: 1) Words like         C. Fake review detection computation
 nouns and pronouns usually do not contain any opinion. It is
 able to filter out such words with the help of a POS tagger; 2)
 A POS tagger can be used to differentiate words that can be             ● Designing the review contents
 used in different parts of speech. After differentiating words       Designing the review contents Let Rw = {rw[l], rw[2], ...,
 data preprocessing has to be done using waste word list ie.,       rw[n]} be a review sequence, and number in index indicates
 Some of the words are usually used (e.g., "a," "the," "of," "I,"   the posted order of review(time in which posted), the review
 "you," "it," "and") but do not give any significant data to our    rw[i] contains multiple information: review unique ID

Volume XII, Issue IV, 2020                                                                                          Page No: 5509
Journal of Xi'an University of Architecture & Technology                                                  ISSN No : 1006-7930

 rw[i].uid, posted time rw[i].time,          review    content    Random, Score[i] = T
 rw[i].content and product ID rw[i].prd.

      ● detecting content identical (P1)                          If the score value is less than threshold value then
   If the reviewer rw[i].uid posts his/her own assessments
 constantly, rw[i].content would have a comparative high          E.Support vector machine
 similarity with his/her reviews. We keep a review centroid for
 each reviewer, which consists of the terms' average incidence    In this paper we have a tendency to used Support vector
 rates in the reviews posted by rw[i].uid. The database of        machine (SVM) to classify the review input training data is
 centroid should be updated whenever the frequency is             taken from the score and also the trained data set is taken .
 increased Thus, we can evaluate the content identical of the     SVM could be a methodology for the classification of each
 detected review as follows:                                      linear and nonlinear information. If the dataset is linearly
     Su = identical(rw[i].content, Cr[i].u)                       severable, the SVM searches for the linear best separating
                                                                  hyperplane (the linear kernel), that could be a call boundary
                                                                  that separates information of 1 category from another. once
     ●    detecting product review identical (p2)                 the information set is as input it's processed in SVM model
                                                                  and that they give the output as original review severally and
      A fake review might be the duplicate or nearby-duplicate    faux review severally .Mathematically, a separating
 of an current one on the same product. If the product is more    hyperplane is written as:
 similar and If so, it would be directly associated with that
 product. If there are multiple reviews between the fake review                            W·X+b=0,
 and the normal ones copied, viewers always cannot detect the     wherever W could be a weight vector and W=w
 fake review because most of the viewer wish is to read the       one,w2,...,wn. X could be a training tuple. b could be a scalar.
 reviews in the first few pages .So first few page review are     so as to optimize the hyperplane, the matter primarily
 very important Compared with the normal reviews, fake            transforms to the diminution of diminution, that is eventually
 reviews would have higher similarity with the "review            computed as:
 centroid" of the product. Finally to calculate the identity of
 the similar product Sp.
  Sp = identical(rw[i].c, Cr[i].p)

     ●    The review frequency of a product (p3)

  Various fake reviews happening in a short time interval
 would make more effect. If a product is very frequently in a
 small time window, it might be attacked. To get positive
 review or negative review time factor plays a major role.
 Suppose if it gets with in short time, this could be caused by
 other reasons like promotions. we also consider it as an index
 of fake reviewers actions.

 D. Final score calculation

 Score= (al*Su + a2* Sp + a3*fq) /

 where attributes a1,a2,a3….an are the weight parameters
 turning the contributions of frequency F1, F2, ... , F6
 separately. Since the fake review score is standardized
 between o and i. we can determine whether a review is a spam
 with a predetermined threshold T:

 Normal , Score[i] < T
                                                                           Figure .3. Proposed System Architecture
 Spam ,Score[i] > T

Volume XII, Issue IV, 2020                                                                                        Page No: 5510
Journal of Xi'an University of Architecture & Technology                                                        ISSN No : 1006-7930

              V. RESULTS AND DISCUSSION

            In this section, we provide experimental results of
         completely different supervised machine learning
         approaches to distinguish sentiment of our datasets
         which is compared with product review dataset V1 and
         product Review dataset V2 Also, we have used similar
         methodologies at the same time to detect fake reviews.
         Our experimental approaches tend to study the accuracy
         of all sentiment classification algorithms, and how to
         determine     which   algorithm    is     more        accurate.
         Furthermore, we were able to find out fake positive
         reviews and fake negative reviews through the proposed
         model.

                                                                           Figure 5. Output of SVM model
 Table-I The result of our Experiment.

  Experiment Fake              Fake         SVM           Threshold            VI. CONCLUSION
             Positive          Negative
             review            review
                                                                                In this paper, we have taken the problem of
                                                                               identifying fake reviews which is from the review
  Dataset-V1         20.5      22.2         83.5          81.4                 sequences using sentiment analysis. Online
  Dataset-V2         26.8      26.3         75.4          79                   product reviews from Amazon.com are selected as
                                                                               our primary dataset. To examine and evaluate the
                                                                               proposed model , we have made an attempt to
                                                                               create a dataset from a real-life review data set.
                                                                               More number of experiments have been
                                                                               conducted to analyze the effectiveness of the
                                                                               proposed model. From the proposed model we
                  Trained data set result                                      were able to easily find out the fake reviews and
                                                                               mark it as fake.
   100
    50
                                                                                VII. REFERENCES
     0
          Fake Positive   Fake        SVM          Threshold                   [1]. 9to5Mac (2017) Bell faces $1.25M fine for
             review     Negative                                               posting fake reviews of its app in the App Store.
                         review                                                https://9to5mac.com/2015/10/14/bell-fake-app-
                       Dataset-V1         Dataset-V2                           store-reviews/

  Figure 4: graph of the experiment                                            [2].Apple (2017) App Store Review Guidelines.
                                                                               https://developer.apple.com/app-
                                                                               store/review/guidelines/

                                                                               [3]. Bird S, Klein E, Loper E (2009) Natural
                                                                               language processing with python, 1st edn.
                                                                               O’Reilly Media, Inc

                                                                               [4]. Calefato F, Lanubile F, Novielli N (2017)
                                                                               Emotxt: a toolkit for emotion recognition from
                                                                               text. arXiv:170803892

Volume XII, Issue IV, 2020                                                                                              Page No: 5511
Journal of Xi'an University of Architecture & Technology                                          ISSN No : 1006-7930

      [5]. Carreno L V G, Winbladh K (2013) “                   [15] S.Pradeep Kumar ; O. Pandithurai,Sixth
      Analysis of user comments: an approach for                sense technology,2013 International Conference
      software requirements evolution. In: 2013 35th            on Information Communication and Embedded
      international conference on software engineering          Systems (ICICES)
      (ICSE).
      https://doi.org/10.1109/ICSE.2013.6606604, pp             [16] K. Sriprasadh ;Saicharansrinivasan ; O.
      582–591                                                   Pandithurai ; A. saravanan,A novel method to
                                                                secure cloud computing through multicast key
      [6]. Cortes C, Vapnik V (1995) Support-vector             management,2013 International Conference on
      networks.      Mach     Learn   20(3):273–297.            Information Communication and Embedded
      https://doi.org/10.1007/BF00994018                        Systems (ICICES)

      [7] Finkelstein A, Harman M, Jia Y, Martin W,             [17] O. Pandithurai ; N. suganya ; S. Naresh ; Jaya
      Sarro F, Zhang Y (2017) Investigating the                 Krishna Deva ; B. Suresh,Accessing the required
      relationship between price, rating, and popularity        information from an applications in cloud through
      in the blackberry world app store. Inf Softw              SMS,2013       International   Conference        on
      Technol87:119–139.                                        Information Communication and Embedded
      https://doi.org/10.1016/j.infsof.2017.03.002.             Systems (ICICES)
      http://www.sciencedirect.com/science/article/pii/
      S095058491730215X                                         [18] O. Pandithurai ; M. Poongodi ; S. Pradeep
                                                                Kumar ; C. Gopala Krishnan,A method to support
      [8] .Fritz C O, Morris P E, Richler J J (2012)            multi-tenant as a service,2011 Third International
      Effect size estimates: current use, calculations,         Conference on Advanced Computing,pp 157-162
      and interpretation. J Exp Psychol Gen 141(1):2

      [9]. Gao C, Xu H, Hu J, Zhou Y (2015) Ar-
                                                              [19] C.P. Indumathi ; B. Galeebathullah ; O.
      tracker: “track the dynamics of mobile apps via
      user review mining” ,IEEE symposium on               Pandithurai,Analysis of test case coverage using data
      service-oriented system engineering (SOSE).
                                                           mining technique,2010International conference on
      IEEE, pp 284–290,2017
                                                           communication control and computing technologies,pp
      [10]. Google (2017) Google play user ratings,        806-809
      reviews, and installs.
                                                           [20] V. Rajesh ; O. Pandithurai ; S. Mageshkumar,Wireless
      https://play.google.com/about/storelisting-
      promotional/ratings-reviews-installs/                sensor node data on cloud,2010International
                                                           conference on communication control and computing
      [11] . Google (2018) Google Play Prohibited and
                                                           technologies,pp 476-481
      RestrictedContent.
      https://support.google.com/contributionpolicy/an     [22] G. Sivasankari and O. Pandithurai,Web User Session
      swer/7400114
                                                           Traffic Tracing Via Clustering Techniques,2010
                                                           International Conference on Computer Applications
      [12] Daniel Martens1 · Walid Maalej1,” Towards
      understanding and detecting fake reviews in app      Database Systems,pp 220-229
      stores”, Empirical Software Engineering (2019)
      24:3316–3355

      [13] Elshrif Elmurngi, Abdelouahed Gherbi,”
      Detecting Fake Reviews through Sentiment
      Analysis Using Machine Learning Techniques”,
      The Sixth International Conference on Data
      Analytics,2017.

      [14] O. Pandithurai, Dr. C. Sureshkumar, High-
      Performance Multipath Routing Algorithm Using
      CPEGASIS Protocol in Wireless Sensor Cloud
      Environment,2016 Scientific research An
      academic publisher,vol.7,PP. 3246-3252

Volume XII, Issue IV, 2020                                                                                Page No: 5512
You can also read