EPL682 - ADVANCED SECURITY TOPICS - ADVANCED SECURITY TOPICS CAPTCHA REPORT

Page created by Karl Adams
 
CONTINUE READING
EPL682 - ADVANCED SECURITY TOPICS - ADVANCED SECURITY TOPICS CAPTCHA REPORT
EPL682 - ADVANCED SECURITY TOPICS
                                Instructor: Elias Athanasopoulos

                                  CAPTCHA REPORT
                                       Andreas Charalampous
                                            April 2020

                                  1. Captcha Background
1.1. Introduction
Using computers as bots, attackers can attack at scale, for example automatic registration for spam
accounts, or comment posting. It is visible that a defense mechanism is needed, to guard resources
(e.g. account registration) from automated and scaled attacks and at the same time does not block
humans from accessing it.

A defense mechanism for this purpose is Captcha, which protects Web Resources from being exploited
at scale. Captcha stands for Completely Automated Public Turing test to tell Computers and Humans
Apart and as its name says, is a Test used for determining if a user is human or not; if a user passes
the test, he is considered a human, else he is considered a computer. It is also known as the “Reverse
Turing Test”. When a user wants to access a web resource, then a challenge is shown to him and to
continue and access the resource, the user must solve it giving the correct answer.

1.2. Captcha Challenges
A Captcha challenge must at the same time make the bot fail and the human easily solve it. From 1997
when captcha was first introduced, until today, challenges are evolving and new types are created.
The first version of the Captcha challenge was the “twisted text” [Picture 1a], where the user was
shown a distorted text and had to provide the text shown. Early and most used challenges are the
math captcha, audio captcha [Picture 1b] and image captcha [Picture 1c]. There are is a grand variety
of captcha challenges [Picture 1d - 1f].

          (a) Twisted Text                 (b) Math/Audio                       (c) Image
EPL682 - ADVANCED SECURITY TOPICS - ADVANCED SECURITY TOPICS CAPTCHA REPORT
(d) SlideLock                   (e) Drag n’ Drop                       (f) Trivial

                                     Picture 1: Captcha Challenges

1.3. reCaptcha
In 2007 reCaptcha was developed, in 2009 was acquired by Google and today is the most used
Captcha. The three most common reCaptcha are the distorted text reCaptcha [Picture 2a], the Image
reCaptcha [Picture 2b] and the noCaptcha reCaptcha (checkbox) [Picture 2c]. Captchas are evolving
for more than 20 years and will keep on evolving with more kinds of captchas created. The reason is
that they are improving, finding ways to make it easier for humans, including minorities like health
impaired users, and at the same time make them more difficult to bots. Also, captchas are kept being
bypassed by automation software or solver services, creating an arms race between solvers and
providers.

           (a) Distorted text                     (b) Image                       (c) Checkbox

                                        Picture 2: reCaptcha Challenges

The distorted text reCaptcha was used as an aid to digitize “The New York Times” archives. During the
automatic scanning of the archives, many words were not recognized by computers. To translate the
words easily, each unknown word is sent as a challenge with another known word. If the user gives a
correct answer for the known word, then his guess for the unknown word is considered as the
translation for the scanned word. For more accuracy, the same unknown word is given in multiple
challenges to different users.

1.4. noCaptcha reCaptcha
During the evolution of captchas, human solving services were founded, that sold solutions provided
by human solvers. In 2014 the noCaptcha reCaptcha was developed to distinguish not only bots from
humans, but good humans from bad humans (fraud solvers). This rather easier challenge consisted of
a checkbox where the user is asked to just click it. In the background, a behavioral analysis on the user
and its browser is performed to choose if the user is a bot or human (good or bad). More specifically
the Advanced Risk Analysis System (ARAS) acquires user information from Google tracking cookies and
EPL682 - ADVANCED SECURITY TOPICS - ADVANCED SECURITY TOPICS CAPTCHA REPORT
browser, analyzes it and based on them it provides an easy (image), hard (difficult distorted text) or
no challenge at all to the user.

When a site protects a resource, it contains a reCaptcha widget, which collects the cookie and browser
information. The user is given a checkbox (as shown in Picture 2c) and is asked to click it. When the
user clicks it, a request is sent to Google containing the Referrer, SiteKey, Cookie and all information
gathered by the widget, which are all analyzed by the ARAS and an HTML frame with the
corresponding challenge is returned to the user. Also, when the checkbox is clicked, an HTML field is
populated with a token, which must be become valid by Google and then be submitted to the site
containing the resource. The token becomes valid if the user is legit or when the user passes the test
given. When the site gets the token from the user, it sends a verification request to Google and gets a
response indicating if the token verification was a success. Finally, the site gives access to the resource.
2. Re: CAPTCHAs – Understanding CAPTCHA-Solving Services in an
                       Economic Context
 Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M. Voelker
              and Stefan Savage University of California, San Diego

2.1. Introduction
The fact that Captchas were deployed to guard resources, does not mean it ends there; attackers that
were “abusing” those resources, are now looking for solutions that will help them bypass captchas. A
need for automatically solving captchas appeared and hence services are taking advantage of that
need, selling captcha solvers, creating a whole business model. In the paper, the two types of solvers,
automatic and human labor solvers are presented, followed by the economics around it.

2.2. Automated Solvers
The first type of solvers is automated solvers, which mainly is a software that uses Optical Character
Recognition (OCR) algorithms, reading and solving Text Captchas. Two solvers investigated were
Xrumer and reCaptchaOCR. The solvers made providers change their Captchas, creating an arms race
between them, that favors the defender (providers). The first reason is that in order to develop new
solvers, highly skilled labor is required. Another reason is that those solvers have low accuracy and
most sites blacklist IP addresses after 5-7 failed attempts. Finally, sites have alternative captchas ready
for swift deployment in case of existing captchas get bypassed. Other than the arms race lost,
automated solvers did not survive in the market, because of the human solvers.

2.3. Human Solvers
The second type of solvers is human solvers. The motivation is that Captchas are intended to obstruct
automated solvers and this can be sidestepped by giving captchas to human labor pools. Paid solving
is the core of the Captcha-solving ecosystem. There is a whole business model around paid solving
services; an example is shown in Picture 3, where an automating-spamming software tries to create
multiple Gmail accounts, but is prevented by Captcha.

                              Picture 3 - Captcha-solving market workflow

    1.   GYC Automator (Client) tries to create a Gmail account and is challenged with a Captcha.
    2.   The client pays DeCaptcher (Solving Service) to solve the Captcha.
    3.   Solving Service puts the Captcha in a PixProfit (Workers forum) pool.
    4.   PixProfit selects a worker from a pool.
5. The worker responds to PixProfit with the solution.
    6. PixProfit sends the solution to the Solving Service and then to Client.
    7. The client enters the solution to Gmail, gets validated and the account is created.

In an attempt to find geolocation details about the workers, the authors created Captchas in different
languages or asking about the local time and polled the human-based solving services, concluding that
more workers come from low-cost labor countries (China, India, etc.).

Because of its nature, being an unskilled activity and switching to low-cost labor from Eastern Europe,
Bangladesh, China, India, Vietnam etc. paid-solving services not only survived, they expanded and
became highly competitive as well. Even though the wages started from $10/10001, in a few years it
dropped to ~ $0.5/1000.

2.4. Conclusion
    •   The quality of captchas made it easy to outsource to the global unskilled labor market.
    •   Business of solving captchas growing and highly competitive.
    •   Do Captchas work:
            o Telling computers and humans apart: succeeded.
            o Preventing automated site access: failed.
            o Limiting automated site access: reduces attackers expected profit.

1
 Price of Captcha Solving Services is counted as dollars paid per 1000 solved captchas. For example $5/1000
means the client pays 5 dollars for 1000 solved captchas.
3. I am Robot: (DEEP) Learning to Break Semantic Image CAPTCHAs
        Suphannee Sivakorn, Iasonas Polakis and Angelos D. Keromytis Department of
                  Computer Science Columbia University, New York, USA

3.1. Introduction
Two other Captcha attacks were developed for researching purposes, focusing on solving reCaptcha
Image using online Image Annotation Modules and noCaptcha reCaptcha by influencing the Advanced
Risk Analysis System. To achieve this, a system is developed consisting of two main components.

3.2. System Overview

3.2.1. Cookie Manager
The first component is the Cookie Manager which its main job is to automatically create and train
cookies so they appear as real users. After creating each cookie, the system is configured to perform
humane actions using them, some examples are google searching certain terms and follow the links
provided, open videos in youtube, google map searches, etc.

3.2.2. ReCaptcha Breaker
The second component is the ReCaptcha Breaker. It uses the cookies from the Cookie Manager and
visits sites to employ reCaptchas. It locates the reCaptcha Iframe that contains the checkbox looking
for reCaptcha-anchor, performs a click and extracts the reCaptcha-token. If the reCaptcha is solved,
then it is considered a checkbox challenge, otherwise, if a popup is created in goog-bubble-content,
an image challenge is shown. The info of image challenge, hint-sample image(rc-imageselect-desc) and
candidate images(rc-imageselect-tile) are extracted and passed to another module.

3.3. Breaking the image reCaptcha
To solve the image reCaptcha, the system uses Deep Learning Techniques to match the given hint-
sample with the candidate images. Sample and candidate images are passed to Image Annotation
Modules, like Google Reverse Image Search (GRIS), Clarifai and Alchemy, which given an image they
return 10-20 tags describing it. GRIS is used as well for searching better quality images, for more
accurate results. In case candidate images’ tags do not match the hint, a Tag Classifier is used, that
models tags and hint as vectors and uses cosine similarity between them to find the candidate images
that are the most probably to be of the same category as hint-sample. Because of repetition in
reCaptcha images, a History Module is used, that keeps pairs of  in a labelled_dataset,
so future candidate images are searched in there for a hint. This attack managed to score 70.78%
accuracy, against 2235.

The algorithm for breaking an image reCaptcha is:

    •     Each candidate image will be assigned to one of 3 sets: Select, Discard, Undecided.

    •     Initially all candidate images are placed in Undecided.

    1. If the hint is not provided, the sample image is searched in the labeled dataset to obtain
       one.

    2. Information about all images are collected from GRIS.

    3. Every candidate image is searched in the labeled dataset.
•   If found, compares their tag to hint and if found match, candidate image is placed in
                 the select set.

             •   If not found, hint_list is checked, and if found match, the candidate image is placed
                 in the discard set.

    4. Image annotation processes all images and tags are assigned.

             •   If tags match the hint, the image is added in the select set.

             •   If it matches one of the tags in the hint_list, it is added in the discard set.

    5. System picks from select set, if not enough, picks from undecided.

3.4. Influencing the Advanced Risk Analysis System
For influencing ARAS into getting the easiest challenge, a variety of actions on different components
were made, exporting surprising conclusions.

    1. Token - Browsing History:
       • Without Account:
               o No matter the network setup (TOR, university, etc) or geolocation, after the 9th
                   day from token creation, even without browsing, ARAS was neutralized and
                   provided a checkbox challenge.
       • With Account:
               o Tried different settings, with or without phone verification, with alternative
                   email from another provider. The result was getting a checkbox challenge after
                   60 days.
               o It is better not to use an account.
       • Token Harvesting:
               o Experimented if creating a large number of cookies from a single IP is prohibited.
               o 63000 cookies in a single day without getting blocked.
               o Tokens could be sold, creating a harvesting attack.
    2. Browser Checks:
          a. Automation: Webdriver attribute, indicating automation kit found in the browser,
               was set to True, but made no difference.
          b. Canvas FingerPrint 2– UserAgent3:
                    i. If they do not match, fallback (hardest) challenge is provided.
                   ii. If User-Agent is outdated, fallback challenge is provided.
                  iii. If User-Agent is misformatted or does not contain complete info, fallback
                       challenge is provided.
          c. Screen Resolution: tested a variety of resolutions, from 1x1 to 4096x2160, but made
               no difference.
          d. Mouse: Automated movements, multiple clicks in widget and even used
               getElementById().click() javascript function to simulate click without hovering, but
               made no difference.

2
  HTML Canvas provided alongside the Widget, not visible to user, that collects information about user’s
browser.
3
  Attached on HTTP Requests, containing information about the client, like browser version, extensions, etc.
3.5. Conclusions - Countermeasures
Based on the two attacks above, many guidelines and countermeasures were presented.

   •   Token Auctioning: Token verification API has an optional field comparing the IP address of
       the user that solved and the one that submitted the token. It should be mandatory to
       prevent services from selling tokens obtained from the checkbox challenge.
   •   Risk Analysis:
           o Account:
                   • Requests should be valid only when they are from users logged in, those
                        that are not logged in will have to solve the hardest challenge.
                   • Limit the number of tokens per IP address.
           o Cookie Reputation:
                   • Should elevate with the amount of browsing conducted.
                   • Number of cookies that can be created within a time period, should be
                        regulated.
           o Browser Checks: Stricter approach and return no challenge if overtly suspicious, e.g.
               mismatch browser-user-agent.
   •   Image captcha attacks:
           o Solution:
                   • Increase the number of correct images.
                   • Change the range of correct images.
                   • Remove flexibility.
           o Repetition:
                   • When a challenge is shown, it should be removed from the pool.
                   • Pool of challenges should 0062e larger.
           o Hint and Content:
                   • Hint should be removed.
                   • Providers can make experiments to find problematic image categories for
                        image annotation software.
                   • Populate challenges with filler images of the same category as solutions.
           o Advanced Semantic Relations:
                   • Instead of similar objects, the user could be asked to select semantically
                        related objects (tennis ball, racket, tennis court).
           o Adversarial Images:
                   • Altering a small number of pixels, images are misclassified, but are the same
                        visually.
You can also read