Isolating JavaScript in Dynamic Code Environments

Page created by Leon Paul
 
CONTINUE READING
Isolating JavaScript in Dynamic Code Environments

                              Antonis Krithinakis                      Elias Athanasopoulos                   Evangelos P. Markatos
                                                                 Institute of Computer Science,
                                                        Foundation for Research and Technology - Hellas
                                                            {krithin,elathan,markatos}@ics.forth.gr

Abstract                                                                                         modern web applications are based on frameworks that com-
                                                                                                 bine multiple server-side and client-side technologies for
We analyze the source code of four well-known large web
                                                                                                 producing dynamic content. In these applications, isolating
applications, namely WordPress, phpBB, phpMyAdmin and
                                                                                                 one technology from the other is not considered a trivial task.
Drupal. We want to quantify the level of language intermix-
                                                                                                 For example, consider a script written in PHP (a server-side
ing in modern web applications and, if possible, we want
                                                                                                 language) which dynamically produces JavaScript source
to categorize all coding idioms that involve intermixing of
                                                                                                 code. This is a code-mixing case where PHP and JavaScript
JavaScript with a server-side programming language, like
                                                                                                 are intermixed. The JavaScript source code structure is not
PHP. Our analysis processes more than half of a million of
                                                                                                 complete prior the execution of the PHP script.
LoCs and identifies about 1,000 scripts. These scripts con-
                                                                                                     To the best of our knowledge, there has been no system-
tain 163 cases, where the source code is mixed in a way that
                                                                                                 atic effort for identifying how client-side and server-side lan-
is hard to isolate JavaScript from PHP. We manually inves-
                                                                                                 guages intermix together in modern web applications and
tigate all 163 scripts and proceed in a classification scheme
                                                                                                 how hard is to isolate the one from the other. In this paper
of five distinct classes. Our analysis can be beneficial for all
                                                                                                 we try to identify the level of intermixing between differ-
applications that apply operations in the client-side part of a
                                                                                                 ent programming languages in web applications that enable
web application, various XSS mitigation schemes, as well as
                                                                                                 real-world web sites. More precisely, we analyze the source
code refactoring and optimization tools.
                                                                                                 code of four popular applications, namely phpBB, Word-
Categories and Subject Descriptors D.2.3 [Software En-                                           Press, phpMyAdmin and Drupal. Our findings suggest that
gineering]: Coding Tools and Techniques                                                          all these web applications are experiencing mixing of PHP
General Terms             Languages, Security                                                    and JavaScript. The way that each framework generates dy-
                                                                                                 namic content varies and it can be considered as a result of
Keywords JavaScript, Web Security                                                                a series of coding idioms. However, according to our empir-
1. Introduction                                                                                  ical study, all coding idioms can be classified in five major
                                                                                                 classes. For each of the five classes we proceed and present
Cross-site scripting (XSS) is one of the most well known
                                                                                                 alternative coding styles in order to reduce the mixing. In
techniques for compromising web applications. It is consid-
                                                                                                 short, we analyze more than half of a million LoCs, iden-
ered a very popular exploitation technique nowadays [13,
                                                                                                 tify less than 1,000 scripts which include about 163 scripts
14]. Many schemes for mitigating XSS attacks are based
                                                                                                 where JavaScript and PHP are mixed in a way that is hard to
on isolating all scripts that are trusted [6–8, 15]. All such
                                                                                                 automatically isolate one from the other.
schemes require that all trusted scripts (e.g., JavaScript) must
                                                                                                     Benefits and applications. Automatically isolating the
be isolated from the untrusted ones (i.e. possible code injec-
                                                                                                 client-side part, which is most often expressed in JavaScript,
tions). So far, the most prominent techniques for marking
                                                                                                 from the rest of the code of the web application is vital
code are based on tainting, static analysis and flow tracking
                                                                                                 for security schemes that a) consider all server-generated
methods [11, 12, 16]. However, these strategies experience
                                                                                                 JavaScript trusted [7, 8], or b) apply operations to all server-
often dramatic computational overheads. On the other hand,
                                                                                                 generated JavaScript in order to isolate it from code injec-
                                                                                                 tions [6]. All the analysis of this paper has been carried out
                                                                                                 during an attempt for applying Instruction Set Randomiza-
Permission to make digital or hard copies of all or part of this work for personal or            tion [9, 10] to JavaScript. Furthermore, our analysis can be
classroom use is granted without fee provided that copies are not made or distributed            beneficial for all applications that perform operations in the
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute        JavaScript source corpus of a web application. For example,
to lists, requires prior specific permission and/or a fee.                                       consider refactoring, optimization, and code analysis tools.
APLWACA ’10 June, Toronto, Canada.
Copyright c 2010 ACM 978-1-60558-913-8/10/06. . . $10.00

                                                                                            45
Contribution. a) We perform an extensive analysis in                 for JavaScript. For an example of the randomization process
the source code of four real-world web applications and we              output please refer to Figure 1. For every stored source code
identify all cases where JavaScript is intermixed with PHP.             from the previous phase, the parser is invoked to produce
and b) we classify all cases in five main categories; for each          the randomized output. Apparently there are cases the tool
case we provide an alternative strategy to reduce the mixing.           fails to randomize the script and produces a syntax error.
                                                                        This derives from the mixing between JavaScript and PHP.
   Web App           LoCs      Scripts   Non-mixed        Mixed         In other words the parser is confused when processing PHP
  WordPress        143,791      187         136            51           language elements and tries to identify them as JavaScript
    phpBB          213,681      539         512            27           language elements. Apart from this, a syntax error can be
 phpMyAdmin        178,447      263         183            80           produced in other cases where PHP is not involved. This
    Drupal          44,780       8           3              5           occurs because the input the parser is given, may not be a
                                                                        complete JavaScript code but a fragment. For example the
Table 1. Summary of scripts that experience mixing in four              syntax analysis fails when a return statement exists out of
real-world web applications.                                            a function block. This mostly occurs in code from HTML
                                                                        events. Every event has a default action. When an event han-
2. Methodology                                                          dler is also defined (like onclick), the event handler can
                                                                        return a boolean to tell the browser whether or not to take
In this paper we aim on finding cases in the source of real-
                                                                        the default action. Another example is when the parser has
world web applications where JavaScript is mixed with a
                                                                        to deal with escaped string quotes which come from PHP
server-side programming language and isolate the JavaScript
                                                                        string variable content extractions. In our implementation,
code. We process the source code of four web applications,
                                                                        such cases are successfully identified without producing any
which happen to use PHP. We assume that JavaScript and
                                                                        error messages. An example of the parser input is given in
PHP are intermixed only in files and not in a database. A web
                                                                        Figure 2. Successfully identified cases do not mean that the
application may store JavaScript in a database and generate
                                                                        input is not mixed. In many cases PHP language elements
content via SQL queries issued by PHP, in order to produce
                                                                        contained in a script pass the syntax checking. This occurs
JavaScript code. This case is beyond the scope of this paper.
                                                                        in JavaScript assignment expressions where the right value
    In order to isolate JavaScript we use the following method-
                                                                        is a string constant containing PHP code. Whenever the tool
ology. For each web application we attempt to randomize all
                                                                        fails to randomize a script we record the file and the LoC
scripts contained in files included in the application’s distri-
                                                                        of the script and then we analyze each case manually. The
bution. We have built a custom tool which has the following
                                                                        randomization tool performs a best effort for isolating the
functions. The first processing unit takes as input a file and
                                                                        JavaScript source. A failure indicates that manual interven-
attempts to identify all possible JavaScript source code oc-
                                                                        tion is needed for the isolation of the JavaScript code.
currences. It first removes PHP and HTML comments and
                                                                            We manage to identify 163 scripts in all four web appli-
then searches for JavaScript scripts. That is, all code inside
                                                                        cations in which the tool fails to randomize them. We further
a  tag, as well as all code in HTML events such as
                                                                        proceed and investigate each of the 163 mixed scripts man-
onclick, onload, etc. Notice the identified code may con-
                                                                        ually. We manage to create a taxonomy with five different
tain PHP language elements. Finally, the mixed JavaScript
                                                                        classes. We, finally, assign each script to the corresponding
source code is stored for further processing.
                                                                        category. In the following section we describe each of the
    The second processing unit is a parser, based on the
                                                                        five categories. For the overall statistics of all web applica-
Mozilla SpiderMonkey JavaScript engine [4]. The origi-
                                                                        tions refer to Table 1 and 2.
nal spider monkey parser has been modified to randomize
all identifier tokens as they derive from the JavaScript syn-           2.1 Web Applications
tax analysis. The randomization process follows ideas intro-            Our analysis includes four popular web applications. We
duced by Keromytis et al. [9, 10], implemented specifically             now provide a short description for each one of these.
                                                                        WordPress. A popular web application for constructing
 Web Application       Files   With JavaScript     Pass    Fail
   WordPress           279           48             28      20          1   
     phpBB             542           137           117      20          2   v a r s = ” H e l l o World ! ” ;
                                                                        3       d o c u me n t . getElementByName ( ” welcome ” ) . t e x t = s ;
  phpMyAdmin           304           65             28      37          4
     Drupal            143            5              1      4           5   
                                                                        6   v a r s 0 x 7 8 = ” H e l l o World ! ” ;
                                                                        7       d o c u me n t 0 x 7 8 . getElementByName0x78 ( ” welcome ” )
Table 2. Statistics for files contained in each web appli-              8                                                 . t e x t 0 x 7 8 = s0x78 ;
cation. The third column specifies how many files include
scripts. The fifth column indicates how many files include at           Figure 1. A randomization example. All JavaScript identi-
least one mixed case.                                                   fiers get appended with a random key.

                                                                   46
blogs. It is estimated that WordPress is used by over 202                                            1   / ∗ Case 1 . ∗ /
million web sites worldwide [5].                                                                     2   e c h o ”\n ” ;
                                                                                                     3   e c h o ” d o c u me n t . w r i t e ( . . . ) ; ”
phpBB. A web application for developing interactive fo-                                              4   e c h o ”\n ” ;
rums. It is written in PHP and it supports multiple database                                         5
                                                                                                     6   /∗ Parser In p u t ∗/
management systems and unlimited levels of sub-forums [2].                                           7   / / S t a r t Of I n p u t
phpMyAdmin. This web application is intended to handle                                               8   \n ” ;
                                                                                                     9   e c h o ” d o c u me n t . w r i t e ( . . . ) ; ”
the administration of MySQL databases over a web front-                                             10   echo ”
end. Through the web interface a user can create, modify, or                                        11   / / End Of I n p u t
delete databases, tables, etc. phpMyAdmin contains a sig-
nificant amount of JavaScript intermixed with PHP [3].                                              Figure 3. Example for Case 1: Partial injection of non-
Drupal. Drupal [1] is a content management system (CMS)                                             mixed JavaScript source using the PHP built-in function
written in PHP. It allows the system administrator to orga-                                         echo().
nize the content, automate administrative tasks and manage                                           1      / ∗ Case 2 . ∗ /
site visitors.                                                                                       2      $actions [ ’ quickedit ’ ] =
                                                                                                     3        ’ onclick =
                                                                                                     4      ” commentReply . open (\ ’ ’ . $ p o s t−>ID . ’ \ ’ , \ ’ e d i t \ ’ ) ; ” ’ ;
3. Classification                                                                                    5
                                                                                                     6      /∗ Parser In p u t ∗/
The analysis of 163 different scripts, where JavaScript is                                           7      / / S t a r t Of I n p u t
mixed with PHP, produces the following five categories. For                                          8      commentReply . open (\ ’ ’ . $ p o s t−>ID . ’ \ ’ , \ ’ e d i t \ ’ ) ;
                                                                                                     9      / / End Of I n p u t
each category we provide an example from one of the four
web applications that take part in the analysis. The example                                             Figure 4. Example for Case 2: String concatenation.
may be slightly simplified for presentation reasons.
Case 1. Partial injection of non-mixed JavaScript source us-                                         1   / ∗ Case 3 . ∗ /
                                                                                                     2   tinyMCEPreInit = {
ing the PHP built-in function echo(). This is the case where                                         3      ...
non-mixed JavaScript source code is injected in the web doc-                                         4     m c e I n i t :{},
                                                                                                     5      ...
ument via PHP, by using the built-in function echo(). We                                             6   };
present a example of a JavaScript snippet, which is injected
using multiple calls to echo() in Figure 3. In such a case,                                         Figure 5. Example for Case 3: Partial JavaScript code gen-
isolating the JavaScript part is hard, since the parser’s input                                     eration by PHP scripting blocks.
has many extra quotes and echo() occurrences.                                                       meta languages. In this case a framework uses a meta lan-
Case 2. String concatenations. In this case the final JavaScript                                    guage to build JavaScript dynamically. This case exists only
code is generated by the concatenation of string literals and                                       in phpBB. The framework generates dynamic content by
values of PHP variables. All dynamic generated strings are                                          transforming static HTML pages containing the meta lan-
printed using the PHP built-in function echo(). The diffi-                                          guages elements. It loads the page, uses patterns to locate
culty here occurs because both quotes (single and double)                                           all meta language elements and then substitutes them with
are used as part of a complex string concatenation opera-                                           PHP code as the meta language semantics order. Secondly,
tion. Isolating the JavaScript part is hard in this case, since                                     the generated PHP code must be evaluated by using PHP
both quoting styles are significant for both programming                                            function eval() to produce the final JavaScript code. In ad-
languages, JavaScript and PHP. In Figure 4 we present a                                             dition, the meta language is supposed to be more expressive
string concatenation example.                                                                       than ordinary substitutions. It gives the programmer the op-
Case 3. Partial JavaScript code generation by PHP script-                                           portunity to make easy conditional assignments. In Figure 6
ing blocks. This is the most frequently occurring case of                                           we present an example which is used in order to assign a
JavaScript and PHP intermixing. In this case PHP scripting                                          value in the lang and the index field. In such a case, iso-
blocks are put inside JavaScript. These scripts contain calls                                       lating the JavaScript code is hard, since the complete source
of echo(). After PHP is invoked, these blocks are evaluated                                         code is not known in advance. The parser fails at the time it
and the final JavaScript source is generated. The parser fails                                      consumes the meta language elements.
the syntax analysis at the time it consumes the PHP scripting
block. An example of this case is in Figure 5.
Case 4. JavaScript code generation by using frameworks’                                                   Application                 Scripts           C1    C2   C3     C4       C5
                                                                                                          WordPress                     51              3     12   36      0       0
                                                                                                            phpBB                       27              1      0   0      26       0
1                                                                     phpMyAdmin                     80              0     43   34      0       3
2    . . . . o n s u b m i t =” r e t u r n c h e c k P a s s w o r d ( t h i s ) ”> . . . .
3                                                                             Drupal                      5               0      2   3       0       0
4   / / S t a r t Of I n p u t
5   r e t u r n checkPassword ( t h i s )                                                           Table 3. Categorization of all mixed scripts in the four web
6   / / End Of I n p u t
                                                                                                    applications.
    Figure 2. Parser input example from phpMyAdmin.

                                                                                               47
1   / ∗ Case 4 . ∗ /                                                      1   / ∗ Case 3 A l t e r n a t i v e Id i o m . ∗ /
2   var RecaptchaOptions = {                                              2   tinyMCEPreInit = {
3     l a n g : {L RECAPTCHA LANG} ,                                      3      ...
4     index :                                                             4     m c e I n i t :  ,
6      10                                      6      ...
7   };                                                                    7   };

Figure 6. Example for Case 4: JavaScript code generation                            Figure 10. Alternative approach for case 3.
by using frameworks’ meta languages.
                                                                          1   / ∗ Case 4 A l t e r n a t i v e Id i o m . ∗ /
                                                                          2   
1   / ∗ Case 5 . ∗ /
                                                                          3   var RecaptchaOptions = {
2   onsubmit=
                                                                          4     l a n g : {L RECAPTCHA LANG} , i n d e x : {CAPTCHA}
3      ” return
                                                                          5   };
4      ( e mp t y Fo rmE l e me n t s ( t h i s , ’ t a b l e ’ )
                                                                          6   
5      &&amp ; c h e c k Fo rmE l e me n t In R a n g e (
                                                                          7   var RecaptchaOptions = {
6      this , ’ num fields ’ , . . . ) ; ”
                                                                          8     l a n g : {L RECAPTCHA LANG} , i n d e x : 10
                                                                          9   };
      Figure 7. Example for Case 5: Markup injections.                   10   

1   / ∗ Case 1 A l t e r n a t i v e Id i o m . ∗ /                                 Figure 11. Alternative approach for case 4.
2    . . . ?>
3   < s c r i p t t y p e = ’ t e x t / j a v a s c r i p t ’>
4   d o c u me n t . w r i t e ( . . . ) ;
5   
6
the analysis fails again. These cases can be addressed by                   //drupal.org/.
code rewriting. The alternative code for this case is depicted          [2] phpBB: One of the most widely used Open Source forum
in Figure 10. Now all the PHP scripting blocks are identified               solution. http://www.phpbb.com/.
as JavaScript identifiers.                                              [3] phpMyAdmin: Application that handles the administration
Case 4. During the syntax analysis we use the same pat-                     of MySQL over the World Wide Web.        http://www.
terns phpBB uses to identify the meta language elements.                    phpmyadmin.net/.
After a successful identification the parser handles them as            [4] SpiderMonkey (JavaScript-C) Engine.            http://www.
JavaScript identifiers, as above. In cases where the meta lan-              mozilla.org/js/spidermonkey/.
guage elements are straight substituted this approach works.            [5] WordPress Usage: 202 Million Worldwide 62.8 Mil-
In cases where the meta language is quite complicated we                    lion US.    http://andrewapeterson.com/2009/09/
propose an alternative writing to make it more simple. This                 wordpress-usage-202-million-worldwide-62-8-million-us/.
strategy makes code maintenance more difficult but helps the            [6] E. Athanasopoulos, V. Pappas, A. Krithinakis, S. Ligouras,
parser to successfully isolate the JavaScript. The alternative              and E. P. Markatos. xJS: Practical XSS Prevention for Web
code is depicted in Figure 11, where the meta language el-                  Application Development. In Proceedings of the 1st USENIX
ements, {L RECAPTCHA LANG} and {CAPTCHA}, are treated                       WebApps Conference, Boston, US, June 2010.
as JavaScript identifiers.                                              [7] M. V. Gundy and H. Chen. Noncespaces: Using Randomiza-
Case 5. In this last case the parser is extended to recognize               tion to Enforce Information Flow Tracking and Thwart Cross-
HTML entities (like &) and simply consume and ignore                    Site Scripting Attacks. In Proceedings of the 16th Annual Net-
them in the syntax analysis.                                                work and Distributed System Security Symposium (NDSS).
Results. In cases where the failed instances are few, like in           [8] T. Jim, N. Swamy, and M. Hicks. Defeating Script Injec-
Case 1 and in Case 5, we rewrite the code as proposed. In ad-               tion Attacks with Browser-Enforced Embedded Policies. In
dition we extend the parser semantics to successfully handle                WWW ’07: Proceedings of the 16th international conference
Case 3 and Case 4. We then rerun the extended parser with                   on World Wide Web.
the methodology described in Section 2. We further investi-             [9] G. Kc, A. Keromytis, and V. Prevelakis. Countering Code-
gate the current failed cases. In Figure 12 we plot the overall             Injection Attacks with Instruction-Set Randomization. In Pro-
script occurrences we recorded for each of the five differ-                 ceedings of the 10th ACM conference on Computer and Com-
ent cases before and after our interference. In Case 1 and                  munications Security.
Case 5 all failed cases are eliminated. In Case 3 and Case 4,          [10] A. D. Keromytis. Randomized Instruction Sets and Runtime
the parser extensions managed to strongly reduce the failing                Environments Past Research and Future Directions. IEEE
rates. In Case 2 the failed cases remain because neither code               Security and Privacy, 2009.
rewriting nor extensions are applied in the second run.                [11] L. C. Lam and T.-c. Chiueh. A General Dynamic Information
                                                                            Flow Tracking Framework for Security Applications. In AC-
5. Conclusion                                                               SAC ’06: Proceedings of the 22nd Annual Computer Security
In this paper we present a systematic analysis in the source                Applications Conference.
code of four large web applications. The analysis aims on              [12] J. Newsome and D. Song. Dynamic Taint Analysis for Au-
identifying possible cases where JavaScript is intermixed                   tomatic Detection, Analysis, and Signature Generation of Ex-
with PHP. During our analysis we process more than half of                  ploits on Commodity Software. In Proceeding of the 13th
a million of LoCs. We identify about 1,000 scripts, which                   Annual Network and Distributed System Security Symposium
contain 163 scripts, where JavaScript is intermixed with                    (NDSS), 2005.
PHP. We manually investigate all 163 scripts and create a              [13] SANS Insitute.    The Top Cyber Security Risks.
classification scheme of five distinct classes. Moreover, we                September   2009.          http://www.sans.org/
try to address failed cases and reduce the mix by proposing                 top-cyber-security-risks/.
alternative code idioms or extensions to the processing tool.          [14] Symantec Corp. April 2008. 1-3. Retrieved on 2008-05-11.
We further analyze the results. Our analysis can be benefi-                 Symantec Internet Security Threat Report: Trends for July-
                                                                            December 2007 (Executive Summary).
cial for all applications that apply operations in the client-
side part of a web application and various XSS mitigation              [15] M. Ter Louw, P. Bisht, and V. Venkatakrishnan. Analysis of
schemes.                                                                    hypertext isolation techniques for XSS prevention. In Web 2.0
                                                                            Security and Privacy 2008, May 2008.
   Acknowledgements. Elias Athanasopoulos, Antonis Krithi-
nakis and Evangelos P. Markatos are also with the University of        [16] P. Vogt, F. Nentwich, N. Jovanovic, E. Kirda, C. Kruegel, and
Crete. Elias Athanasopoulos is also funded by the Microsoft Re-             G. Vigna. Cross-Site Scripting Prevention with Dynamic Data
                                                                            Tainting and Static Analysis. In Proceeding of the 14th Annual
search PhD Scholarship project, which is provided by Microsoft
                                                                            Network and Distributed System Security Symposium (NDSS),
Research Cambridge.
                                                                            2007.
References
 [1] Drupal: An open source content management platform. http:

                                                                  49
You can also read