Isolating JavaScript in Dynamic Code Environments
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Isolating JavaScript in Dynamic Code Environments
Antonis Krithinakis Elias Athanasopoulos Evangelos P. Markatos
Institute of Computer Science,
Foundation for Research and Technology - Hellas
{krithin,elathan,markatos}@ics.forth.gr
Abstract modern web applications are based on frameworks that com-
bine multiple server-side and client-side technologies for
We analyze the source code of four well-known large web
producing dynamic content. In these applications, isolating
applications, namely WordPress, phpBB, phpMyAdmin and
one technology from the other is not considered a trivial task.
Drupal. We want to quantify the level of language intermix-
For example, consider a script written in PHP (a server-side
ing in modern web applications and, if possible, we want
language) which dynamically produces JavaScript source
to categorize all coding idioms that involve intermixing of
code. This is a code-mixing case where PHP and JavaScript
JavaScript with a server-side programming language, like
are intermixed. The JavaScript source code structure is not
PHP. Our analysis processes more than half of a million of
complete prior the execution of the PHP script.
LoCs and identifies about 1,000 scripts. These scripts con-
To the best of our knowledge, there has been no system-
tain 163 cases, where the source code is mixed in a way that
atic effort for identifying how client-side and server-side lan-
is hard to isolate JavaScript from PHP. We manually inves-
guages intermix together in modern web applications and
tigate all 163 scripts and proceed in a classification scheme
how hard is to isolate the one from the other. In this paper
of five distinct classes. Our analysis can be beneficial for all
we try to identify the level of intermixing between differ-
applications that apply operations in the client-side part of a
ent programming languages in web applications that enable
web application, various XSS mitigation schemes, as well as
real-world web sites. More precisely, we analyze the source
code refactoring and optimization tools.
code of four popular applications, namely phpBB, Word-
Categories and Subject Descriptors D.2.3 [Software En- Press, phpMyAdmin and Drupal. Our findings suggest that
gineering]: Coding Tools and Techniques all these web applications are experiencing mixing of PHP
General Terms Languages, Security and JavaScript. The way that each framework generates dy-
namic content varies and it can be considered as a result of
Keywords JavaScript, Web Security a series of coding idioms. However, according to our empir-
1. Introduction ical study, all coding idioms can be classified in five major
classes. For each of the five classes we proceed and present
Cross-site scripting (XSS) is one of the most well known
alternative coding styles in order to reduce the mixing. In
techniques for compromising web applications. It is consid-
short, we analyze more than half of a million LoCs, iden-
ered a very popular exploitation technique nowadays [13,
tify less than 1,000 scripts which include about 163 scripts
14]. Many schemes for mitigating XSS attacks are based
where JavaScript and PHP are mixed in a way that is hard to
on isolating all scripts that are trusted [6–8, 15]. All such
automatically isolate one from the other.
schemes require that all trusted scripts (e.g., JavaScript) must
Benefits and applications. Automatically isolating the
be isolated from the untrusted ones (i.e. possible code injec-
client-side part, which is most often expressed in JavaScript,
tions). So far, the most prominent techniques for marking
from the rest of the code of the web application is vital
code are based on tainting, static analysis and flow tracking
for security schemes that a) consider all server-generated
methods [11, 12, 16]. However, these strategies experience
JavaScript trusted [7, 8], or b) apply operations to all server-
often dramatic computational overheads. On the other hand,
generated JavaScript in order to isolate it from code injec-
tions [6]. All the analysis of this paper has been carried out
during an attempt for applying Instruction Set Randomiza-
Permission to make digital or hard copies of all or part of this work for personal or tion [9, 10] to JavaScript. Furthermore, our analysis can be
classroom use is granted without fee provided that copies are not made or distributed beneficial for all applications that perform operations in the
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute JavaScript source corpus of a web application. For example,
to lists, requires prior specific permission and/or a fee. consider refactoring, optimization, and code analysis tools.
APLWACA ’10 June, Toronto, Canada.
Copyright c 2010 ACM 978-1-60558-913-8/10/06. . . $10.00
45Contribution. a) We perform an extensive analysis in for JavaScript. For an example of the randomization process
the source code of four real-world web applications and we output please refer to Figure 1. For every stored source code
identify all cases where JavaScript is intermixed with PHP. from the previous phase, the parser is invoked to produce
and b) we classify all cases in five main categories; for each the randomized output. Apparently there are cases the tool
case we provide an alternative strategy to reduce the mixing. fails to randomize the script and produces a syntax error.
This derives from the mixing between JavaScript and PHP.
Web App LoCs Scripts Non-mixed Mixed In other words the parser is confused when processing PHP
WordPress 143,791 187 136 51 language elements and tries to identify them as JavaScript
phpBB 213,681 539 512 27 language elements. Apart from this, a syntax error can be
phpMyAdmin 178,447 263 183 80 produced in other cases where PHP is not involved. This
Drupal 44,780 8 3 5 occurs because the input the parser is given, may not be a
complete JavaScript code but a fragment. For example the
Table 1. Summary of scripts that experience mixing in four syntax analysis fails when a return statement exists out of
real-world web applications. a function block. This mostly occurs in code from HTML
events. Every event has a default action. When an event han-
2. Methodology dler is also defined (like onclick), the event handler can
return a boolean to tell the browser whether or not to take
In this paper we aim on finding cases in the source of real-
the default action. Another example is when the parser has
world web applications where JavaScript is mixed with a
to deal with escaped string quotes which come from PHP
server-side programming language and isolate the JavaScript
string variable content extractions. In our implementation,
code. We process the source code of four web applications,
such cases are successfully identified without producing any
which happen to use PHP. We assume that JavaScript and
error messages. An example of the parser input is given in
PHP are intermixed only in files and not in a database. A web
Figure 2. Successfully identified cases do not mean that the
application may store JavaScript in a database and generate
input is not mixed. In many cases PHP language elements
content via SQL queries issued by PHP, in order to produce
contained in a script pass the syntax checking. This occurs
JavaScript code. This case is beyond the scope of this paper.
in JavaScript assignment expressions where the right value
In order to isolate JavaScript we use the following method-
is a string constant containing PHP code. Whenever the tool
ology. For each web application we attempt to randomize all
fails to randomize a script we record the file and the LoC
scripts contained in files included in the application’s distri-
of the script and then we analyze each case manually. The
bution. We have built a custom tool which has the following
randomization tool performs a best effort for isolating the
functions. The first processing unit takes as input a file and
JavaScript source. A failure indicates that manual interven-
attempts to identify all possible JavaScript source code oc-
tion is needed for the isolation of the JavaScript code.
currences. It first removes PHP and HTML comments and
We manage to identify 163 scripts in all four web appli-
then searches for JavaScript scripts. That is, all code inside
cations in which the tool fails to randomize them. We further
a tag, as well as all code in HTML events such as
proceed and investigate each of the 163 mixed scripts man-
onclick, onload, etc. Notice the identified code may con-
ually. We manage to create a taxonomy with five different
tain PHP language elements. Finally, the mixed JavaScript
classes. We, finally, assign each script to the corresponding
source code is stored for further processing.
category. In the following section we describe each of the
The second processing unit is a parser, based on the
five categories. For the overall statistics of all web applica-
Mozilla SpiderMonkey JavaScript engine [4]. The origi-
tions refer to Table 1 and 2.
nal spider monkey parser has been modified to randomize
all identifier tokens as they derive from the JavaScript syn- 2.1 Web Applications
tax analysis. The randomization process follows ideas intro- Our analysis includes four popular web applications. We
duced by Keromytis et al. [9, 10], implemented specifically now provide a short description for each one of these.
WordPress. A popular web application for constructing
Web Application Files With JavaScript Pass Fail
WordPress 279 48 28 20 1
phpBB 542 137 117 20 2 v a r s = ” H e l l o World ! ” ;
3 d o c u me n t . getElementByName ( ” welcome ” ) . t e x t = s ;
phpMyAdmin 304 65 28 37 4
Drupal 143 5 1 4 5
6 v a r s 0 x 7 8 = ” H e l l o World ! ” ;
7 d o c u me n t 0 x 7 8 . getElementByName0x78 ( ” welcome ” )
Table 2. Statistics for files contained in each web appli- 8 . t e x t 0 x 7 8 = s0x78 ;
cation. The third column specifies how many files include
scripts. The fifth column indicates how many files include at Figure 1. A randomization example. All JavaScript identi-
least one mixed case. fiers get appended with a random key.
46blogs. It is estimated that WordPress is used by over 202 1 / ∗ Case 1 . ∗ /
million web sites worldwide [5]. 2 e c h o ”\n ” ;
3 e c h o ” d o c u me n t . w r i t e ( . . . ) ; ”
phpBB. A web application for developing interactive fo- 4 e c h o ”\n ” ;
rums. It is written in PHP and it supports multiple database 5
6 /∗ Parser In p u t ∗/
management systems and unlimited levels of sub-forums [2]. 7 / / S t a r t Of I n p u t
phpMyAdmin. This web application is intended to handle 8 \n ” ;
9 e c h o ” d o c u me n t . w r i t e ( . . . ) ; ”
the administration of MySQL databases over a web front- 10 echo ”
end. Through the web interface a user can create, modify, or 11 / / End Of I n p u t
delete databases, tables, etc. phpMyAdmin contains a sig-
nificant amount of JavaScript intermixed with PHP [3]. Figure 3. Example for Case 1: Partial injection of non-
Drupal. Drupal [1] is a content management system (CMS) mixed JavaScript source using the PHP built-in function
written in PHP. It allows the system administrator to orga- echo().
nize the content, automate administrative tasks and manage 1 / ∗ Case 2 . ∗ /
site visitors. 2 $actions [ ’ quickedit ’ ] =
3 ’ onclick =
4 ” commentReply . open (\ ’ ’ . $ p o s t−>ID . ’ \ ’ , \ ’ e d i t \ ’ ) ; ” ’ ;
3. Classification 5
6 /∗ Parser In p u t ∗/
The analysis of 163 different scripts, where JavaScript is 7 / / S t a r t Of I n p u t
mixed with PHP, produces the following five categories. For 8 commentReply . open (\ ’ ’ . $ p o s t−>ID . ’ \ ’ , \ ’ e d i t \ ’ ) ;
9 / / End Of I n p u t
each category we provide an example from one of the four
web applications that take part in the analysis. The example Figure 4. Example for Case 2: String concatenation.
may be slightly simplified for presentation reasons.
Case 1. Partial injection of non-mixed JavaScript source us- 1 / ∗ Case 3 . ∗ /
2 tinyMCEPreInit = {
ing the PHP built-in function echo(). This is the case where 3 ...
non-mixed JavaScript source code is injected in the web doc- 4 m c e I n i t :{},
5 ...
ument via PHP, by using the built-in function echo(). We 6 };
present a example of a JavaScript snippet, which is injected
using multiple calls to echo() in Figure 3. In such a case, Figure 5. Example for Case 3: Partial JavaScript code gen-
isolating the JavaScript part is hard, since the parser’s input eration by PHP scripting blocks.
has many extra quotes and echo() occurrences. meta languages. In this case a framework uses a meta lan-
Case 2. String concatenations. In this case the final JavaScript guage to build JavaScript dynamically. This case exists only
code is generated by the concatenation of string literals and in phpBB. The framework generates dynamic content by
values of PHP variables. All dynamic generated strings are transforming static HTML pages containing the meta lan-
printed using the PHP built-in function echo(). The diffi- guages elements. It loads the page, uses patterns to locate
culty here occurs because both quotes (single and double) all meta language elements and then substitutes them with
are used as part of a complex string concatenation opera- PHP code as the meta language semantics order. Secondly,
tion. Isolating the JavaScript part is hard in this case, since the generated PHP code must be evaluated by using PHP
both quoting styles are significant for both programming function eval() to produce the final JavaScript code. In ad-
languages, JavaScript and PHP. In Figure 4 we present a dition, the meta language is supposed to be more expressive
string concatenation example. than ordinary substitutions. It gives the programmer the op-
Case 3. Partial JavaScript code generation by PHP script- portunity to make easy conditional assignments. In Figure 6
ing blocks. This is the most frequently occurring case of we present an example which is used in order to assign a
JavaScript and PHP intermixing. In this case PHP scripting value in the lang and the index field. In such a case, iso-
blocks are put inside JavaScript. These scripts contain calls lating the JavaScript code is hard, since the complete source
of echo(). After PHP is invoked, these blocks are evaluated code is not known in advance. The parser fails at the time it
and the final JavaScript source is generated. The parser fails consumes the meta language elements.
the syntax analysis at the time it consumes the PHP scripting
block. An example of this case is in Figure 5.
Case 4. JavaScript code generation by using frameworks’ Application Scripts C1 C2 C3 C4 C5
WordPress 51 3 12 36 0 0
phpBB 27 1 0 0 26 0
1 phpMyAdmin 80 0 43 34 0 3
2 . . . . o n s u b m i t =” r e t u r n c h e c k P a s s w o r d ( t h i s ) ”> . . . .
3 Drupal 5 0 2 3 0 0
4 / / S t a r t Of I n p u t
5 r e t u r n checkPassword ( t h i s ) Table 3. Categorization of all mixed scripts in the four web
6 / / End Of I n p u t
applications.
Figure 2. Parser input example from phpMyAdmin.
471 / ∗ Case 4 . ∗ / 1 / ∗ Case 3 A l t e r n a t i v e Id i o m . ∗ /
2 var RecaptchaOptions = { 2 tinyMCEPreInit = {
3 l a n g : {L RECAPTCHA LANG} , 3 ...
4 index : 4 m c e I n i t : ,
6 10 6 ...
7 }; 7 };
Figure 6. Example for Case 4: JavaScript code generation Figure 10. Alternative approach for case 3.
by using frameworks’ meta languages.
1 / ∗ Case 4 A l t e r n a t i v e Id i o m . ∗ /
2
1 / ∗ Case 5 . ∗ /
3 var RecaptchaOptions = {
2 onsubmit=
4 l a n g : {L RECAPTCHA LANG} , i n d e x : {CAPTCHA}
3 ” return
5 };
4 ( e mp t y Fo rmE l e me n t s ( t h i s , ’ t a b l e ’ )
6
5 && ; c h e c k Fo rmE l e me n t In R a n g e (
7 var RecaptchaOptions = {
6 this , ’ num fields ’ , . . . ) ; ”
8 l a n g : {L RECAPTCHA LANG} , i n d e x : 10
9 };
Figure 7. Example for Case 5: Markup injections. 10
1 / ∗ Case 1 A l t e r n a t i v e Id i o m . ∗ / Figure 11. Alternative approach for case 4.
2 . . . ?>
3 < s c r i p t t y p e = ’ t e x t / j a v a s c r i p t ’>
4 d o c u me n t . w r i t e ( . . . ) ;
5
6the analysis fails again. These cases can be addressed by //drupal.org/.
code rewriting. The alternative code for this case is depicted [2] phpBB: One of the most widely used Open Source forum
in Figure 10. Now all the PHP scripting blocks are identified solution. http://www.phpbb.com/.
as JavaScript identifiers. [3] phpMyAdmin: Application that handles the administration
Case 4. During the syntax analysis we use the same pat- of MySQL over the World Wide Web. http://www.
terns phpBB uses to identify the meta language elements. phpmyadmin.net/.
After a successful identification the parser handles them as [4] SpiderMonkey (JavaScript-C) Engine. http://www.
JavaScript identifiers, as above. In cases where the meta lan- mozilla.org/js/spidermonkey/.
guage elements are straight substituted this approach works. [5] WordPress Usage: 202 Million Worldwide 62.8 Mil-
In cases where the meta language is quite complicated we lion US. http://andrewapeterson.com/2009/09/
propose an alternative writing to make it more simple. This wordpress-usage-202-million-worldwide-62-8-million-us/.
strategy makes code maintenance more difficult but helps the [6] E. Athanasopoulos, V. Pappas, A. Krithinakis, S. Ligouras,
parser to successfully isolate the JavaScript. The alternative and E. P. Markatos. xJS: Practical XSS Prevention for Web
code is depicted in Figure 11, where the meta language el- Application Development. In Proceedings of the 1st USENIX
ements, {L RECAPTCHA LANG} and {CAPTCHA}, are treated WebApps Conference, Boston, US, June 2010.
as JavaScript identifiers. [7] M. V. Gundy and H. Chen. Noncespaces: Using Randomiza-
Case 5. In this last case the parser is extended to recognize tion to Enforce Information Flow Tracking and Thwart Cross-
HTML entities (like &) and simply consume and ignore Site Scripting Attacks. In Proceedings of the 16th Annual Net-
them in the syntax analysis. work and Distributed System Security Symposium (NDSS).
Results. In cases where the failed instances are few, like in [8] T. Jim, N. Swamy, and M. Hicks. Defeating Script Injec-
Case 1 and in Case 5, we rewrite the code as proposed. In ad- tion Attacks with Browser-Enforced Embedded Policies. In
dition we extend the parser semantics to successfully handle WWW ’07: Proceedings of the 16th international conference
Case 3 and Case 4. We then rerun the extended parser with on World Wide Web.
the methodology described in Section 2. We further investi- [9] G. Kc, A. Keromytis, and V. Prevelakis. Countering Code-
gate the current failed cases. In Figure 12 we plot the overall Injection Attacks with Instruction-Set Randomization. In Pro-
script occurrences we recorded for each of the five differ- ceedings of the 10th ACM conference on Computer and Com-
ent cases before and after our interference. In Case 1 and munications Security.
Case 5 all failed cases are eliminated. In Case 3 and Case 4, [10] A. D. Keromytis. Randomized Instruction Sets and Runtime
the parser extensions managed to strongly reduce the failing Environments Past Research and Future Directions. IEEE
rates. In Case 2 the failed cases remain because neither code Security and Privacy, 2009.
rewriting nor extensions are applied in the second run. [11] L. C. Lam and T.-c. Chiueh. A General Dynamic Information
Flow Tracking Framework for Security Applications. In AC-
5. Conclusion SAC ’06: Proceedings of the 22nd Annual Computer Security
In this paper we present a systematic analysis in the source Applications Conference.
code of four large web applications. The analysis aims on [12] J. Newsome and D. Song. Dynamic Taint Analysis for Au-
identifying possible cases where JavaScript is intermixed tomatic Detection, Analysis, and Signature Generation of Ex-
with PHP. During our analysis we process more than half of ploits on Commodity Software. In Proceeding of the 13th
a million of LoCs. We identify about 1,000 scripts, which Annual Network and Distributed System Security Symposium
contain 163 scripts, where JavaScript is intermixed with (NDSS), 2005.
PHP. We manually investigate all 163 scripts and create a [13] SANS Insitute. The Top Cyber Security Risks.
classification scheme of five distinct classes. Moreover, we September 2009. http://www.sans.org/
try to address failed cases and reduce the mix by proposing top-cyber-security-risks/.
alternative code idioms or extensions to the processing tool. [14] Symantec Corp. April 2008. 1-3. Retrieved on 2008-05-11.
We further analyze the results. Our analysis can be benefi- Symantec Internet Security Threat Report: Trends for July-
December 2007 (Executive Summary).
cial for all applications that apply operations in the client-
side part of a web application and various XSS mitigation [15] M. Ter Louw, P. Bisht, and V. Venkatakrishnan. Analysis of
schemes. hypertext isolation techniques for XSS prevention. In Web 2.0
Security and Privacy 2008, May 2008.
Acknowledgements. Elias Athanasopoulos, Antonis Krithi-
nakis and Evangelos P. Markatos are also with the University of [16] P. Vogt, F. Nentwich, N. Jovanovic, E. Kirda, C. Kruegel, and
Crete. Elias Athanasopoulos is also funded by the Microsoft Re- G. Vigna. Cross-Site Scripting Prevention with Dynamic Data
Tainting and Static Analysis. In Proceeding of the 14th Annual
search PhD Scholarship project, which is provided by Microsoft
Network and Distributed System Security Symposium (NDSS),
Research Cambridge.
2007.
References
[1] Drupal: An open source content management platform. http:
49You can also read