Oracle-Guided Component-Based Program Synthesis

Page created by Darrell Castillo
 
CONTINUE READING
Oracle-Guided Component-Based Program Synthesis
             Susmit Jha                         Sumit Gulwani                         Sanjit A. Seshia               Ashish Tiwari
             UC Berkeley                       Microsoft Research                         UC Berkeley               SRI International
         jha@eecs.berkeley.edu                 sumitg@microsoft.com                  sseshia@eecs.berkeley.edu       tiwari@csl.sri.com

ABSTRACT                                                                                 approaches have the advantage of completeness of specifi-
We present a novel approach to automatic synthesis of loop-                              cation, such specifications are often unavailable, difficult to
free programs. The approach is based on a combination of                                 write, or expensive to check against using automated verifi-
oracle-guided learning from examples, and constraint-based                               cation techniques. In this paper, we propose a novel oracle-
synthesis from components using satisfiability modulo theo-                              guided approach to program synthesis, where an I/O oracle
ries (SMT) solvers. Our approach is suitable for many appli-                             that maps a given program input to the desired output is
cations, including as an aid to program understanding tasks                              used as an alternative to having a complete specification.
such as deobfuscating malware. We demonstrate the effi-                                  The key idea of our algorithm is to query the I/O oracle on
ciency and effectiveness of our approach by synthesizing bit-                            an input that can distinguish between non-equivalent pro-
manipulating programs and by deobfuscating programs.                                     grams that are consistent with the past interaction with the
                                                                                         I/O oracle. The process is repeated until a semantically
                                                                                         unique program is obtained. Our experimental results show
Categories and Subject Descriptors                                                       that only few rounds of interaction are needed.
D.1.2 [Programming Techniques]: Automatic Program-                                          We apply the oracle-guided approach to automated syn-
ming; I.2.2 [Artificial Intelligence]: Program Synthesis;                                thesis of loop-free programs, those that compute functions of
K.3.2 [Learning]: Concept Learning                                                       their input and terminate. Such programs arise in a variety
                                                                                         of application contexts, such as low-level bit-manipulating
                                                                                         code, scientific computing kernels, parts of control software
Keywords                                                                                 in graphical languages such as LabVIEW, and even appli-
Program synthesis, Oracle-based learning, SMT, SAT                                       cations in high-level scripting languages such as Javascript
                                                                                         and Ruby that are formed by chaining multiple high-level
                                                                                         operators. A key characteristic of our method is that it is
1.     INTRODUCTION                                                                      component-based, meaning that we synthesize a program by
   Automatic synthesis of programs has long been one of the                              performing a circuit-style, loop-free composition of compo-
holy grails of software engineering. It has found many prac-                             nents drawn from a given component library. We can also
tical applications: generating optimal code sequences [20,                               address the challenge of identifying whether the given set
11], optimizing performance-critical inner loops, generat-                               of components is insufficient to synthesize the desired pro-
ing general-purpose peephole optimizers [2, 3], automating                               gram. For this purpose, we additionally require making only
repetitive programming tasks [15], and filling in low-level                              one query to a more expensive validation oracle that checks
details after the higher-level intent has been expressed [24].                           whether the program is correct or not.
Two applications of synthesis are of particular interest in                                 Our synthesis algorithm is based on a novel constraint-
this paper. The first is that of automating the discovery                                based approach that reduces the synthesis problem to that
of non-intuitive algorithms (e.g., [8]). The second applica-                             of solving two kinds of constraints: the I/O-behavioral con-
tion, as we show in this paper, is program understanding,                                straint whose solution yields a candidate program consistent
and more specifically, program deobfuscation. The need for                               with the interaction with the I/O oracle, and the distinguish-
deobfuscation techniques has arisen in recent years, espe-                               ing constraint whose solution provides the input that distin-
cially due to an increase in the amount of malicious, and                                guishes between non-equivalent candidate programs. These
mostly obfuscated, code (malware) [28]. Currently, human                                 constraints can be solved using off-the-shelf SMT (Satisfia-
experts use decompilers and manually deobfuscate the re-                                 bility Modulo Theory) solvers. Traditional synthesis algo-
sulting code (see, e.g., [22]). Clearly, this is a tedious task                          rithms perform a expensive combinatorial search over the
that could benefit from automated tool support.                                          space of all possible programs. In contrast, our technique
   A traditional view of program synthesis is that of synthe-                            leaves the inherent exponential nature of the problem to the
sis from complete specifications. One approach is to give                                underlying SMT solver, whose engineering advances over the
a specification as a formula in a suitable logic [19, 26, 10,                            years allow them to effectively deal with problem instances
8]. Another is to write the specification as a simpler, but                              that arise in practice, which are usually not hard, and hence
possibly far less efficient program [20, 11, 24]. While these                            end up not requiring exponential reasoning.
                                                                                         Contributions and Organization.
                                                                                          • We propose a novel oracle-guided approach to synthesis,
                                                                                            where an I/O oracle obviates the need for complete spec-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
                                                                                            ifications. Our approach has interesting connections to
not made or distributed for profit or commercial advantage and that copies                  results from computational learning theory (Section 5).
bear this notice and the full citation on the first page. To copy otherwise, to           • We present an instantiation of the oracle-guided approach
republish, to post on servers or to redistribute to lists, requires prior specific          to synthesis of loop-free programs over a given set of com-
permission and/or a fee.                                                                    ponents (see problem definition in Section 3). This is en-
ICSE ’10, May 2-8 2010, Cape Town, South Africa                                             abled by a novel constraint-based technique that involves
Copyright 2010 ACM 978-1-60558-719-6/10/05 ...$10.00.
an interaction between SMT solvers and the I/O oracle            1 genStringObs(int input)
  (Section 4). We also present an interesting optimization         2 {
  that leverages biased sampling (Section 4.6).                    3   a1=1, a2=0; b1=1, b2=0; c1=0; c2=0;
• We demonstrate the utility of our synthesis technique to         4   if (input == 0)
  discovery of bit-manipulating programs [29], which are           5      { a1 = 0; a2 = 0; b1=0; b2 = 0; }
  often needed for optimizing performance (Section 6.1).           6   else if (input == 1)
  These programs are quite unintuitive and can be difficult        7        { c1=0; c2 = 1; }
  for even expert programmers to discover. The upcoming            8   else if (input == 2)
  4th volume of the classic series the art of computer pro-        9        { a1 =1; a2 = 0; c1=1; c2=1; }
  gramming by Knuth has a chapter on bitwise tricks [13].         10   else if (input == 3)
                                                                  11        { b1 = 0; b2 = 0; c1=1; c2=1; }
• We propose a novel application of program synthesis to
                                                                  12   else return NULL;
  program understanding. We demonstrate this in the con-
                                                                  13   c = 2*c1 + c2;
  text of malware deobfuscation by deobfuscating examples
                                                                  14   if(c == 1) { return rot13("EPCG GB, 7"); }
  drawn from and inspired by the Conficker and MyDoom
                                                                  15   else
  viruses using our synthesis technique (Section 6.2).
                                                                  16   if (c == 2) {
                                                                  17      if (input * (input-1) % 2 == 0)
2.     MOTIVATING EXAMPLES                                        18         return rot13("EPCG GB", 7);
  We present two examples in this section to introduce the        19      else
synthesis problem and motivate our approach.                      20        return rot13("RUYB", 4);
                                                                  21   }
2.1 Bit Manipulation                                              22   else {
                                                                  23        if (b1 ⊕ b2) return rot13("ZNVY SEBZ",9)
   Consider the following programming problem: Given a
                                                                  24        else if ((a1 ⊕ a2) = (b1 ⊕ b2))
bit-vector integer x, of finite but arbitrary length, construct
                                                                  25             return rot13("RUYB", 4);
a new bit-vector y that corresponds to x with the rightmost
                                                                  26        else return rot13("QNGN", 4);
string of contiguous 1s turned off, i.e., reset to 0s. Such
                                                                  27   }
programming problems often arise while developing low level
                                                                  28 }
embedded code, network applications or in other domains
                                                                  29 rot13(char *buf, int sz)
where bit-level manipulation is needed.
                                                                  30 {
   Let us contemplate writing a formal specification for this
                                                                  31   char *buf1 = malloc((sz+1) * sizeof(char));
problem. The most natural and easiest specification involves
                                                                  32   char a;
the use of alternating quantifiers, where n is the length of x:
                                                                  33   while (a =~ *buf)
     ∃i, j. { 0 ≤ i, j < n ∧ (∀k. j ≤ k ≤ i =⇒ x[k] = 1)          34   {
                           ∧ (∀k. 0 ≤ k < j =⇒ x[k] = 0)          35     *buf1 = (~a-1/(~((a | 32))/13*2-11)*13);
                                                                  36     buf++; buf1++;
                           ∧ (x[i + 1] = 0 ∨ i = n − 1)           37   }
                           ∧ (∀k. i < k < n =⇒ x[k] = y[k])       38   return buf1;
                           ∧ (∀k. 0 ≤ k ≤ i =⇒ y[k] = 0) }        39 }
                                                                  Figure 1: Obfuscated program inspired by MyDoom
The above specification is not easy to write. Moreover, veri-
fying any candidate implementation against the above spec-        programs is of great practical significance.
ification is challenging due to the presence of quantifiers.
   Let us consider writing some sample input-output pairs,        2.2 Deobfuscation
or examples, for the problem. For any input x, it is easy to         A major challenge of dealing with malware is simply to
provide the corresponding output y. Some example (x, y)           understand what the malicious code is doing. We introduce
pairs are (0110, 0000), (0101, 0100), (110110, 110000).           here an example inspired by the obfuscations introduced in
   Finally, let us contemplate writing a program for the above    variants of MyDoom, an e-mail virus affecting Microsoft
problem. A straightforward, but inefficient, implementation       Windows that became the fastest-spreading virus when it
is a loop that iterates through the bits of x and zeroes out      was first released in 2004 [21].
the rightmost contiguous string of 1s. Can we synthesize a           The example involves the construction of the SMTP header
shorter and more efficient implementation? It is difficult to     for the e-mail sent by the virus. SMTP requires a prescribed
answer this, but it is easy to speculate that the elementary      sequence of messages of different types, initially starting
operations that may be used inside such an efficient imple-       with a “hello” message, followed by the “from”, “reply to”,
mentation will be the standard bit-vector operators: bit-wise     and other similar control fields, followed finally by the data
logical operations (|, &, ⊕ , ∼), and basic arithmetic oper-      segment of the e-mail. The fragment we have constructed
ations (+, −, ∗, /, %).                                           is a program genStringObs, given in Figure 1, which con-
   Given a set of possible elementary operations, and an abil-    structs a string corresponding to the message type.
ity to generate outputs for given inputs, our oracle-guided          We used two types of obfuscations in this example. The
synthesis tool Brahma will synthesize the following nontriv-      first is a control-flow obfuscation drawn from several obfus-
ial and tricky procedure for solving the above problem.           cations given by Collberg et al. [5]. The second obfuscation
     1 turnOffRightMostOneBitString (x)                           is one that is directly used in MyDoom, involving the modi-
     2 { t1 = x-1; t2 = (x | t1 ); t3 = t2 +1;                    fication of each alphanumeric character in the message type
     3   t4 = (t3 & x); return t4 ; }                             string by the rot13 substitution cipher [14]. The reader
   A programmer will require considerable familiarity with        can appreciate the difficulty of decoding exactly what this
bit-level manipulations to come up with such an implemen-         program is doing.
tation. Hence, automated synthesis of such difficult-to-write        Is there a simpler (deobfuscated) program that performs
1 genString(int input)                                                    {O1 , . . . , ON } as temporary variables in the following form:
      2 { if(input == 0) return "EHLO";                                                       ~
                                                                                         P (I):
      3   else if (input == 1) return "RCPT TO";                                                           ~π1 ); . . . ; Oπ := fπ (V   ~π );
      4   else if (input == 2) return "MAIL FROM";                                             Oπ1 := fπ1 (V                N         N   N

      5   else if (input == 3) return "DATA";                                                  return OπN ;
      6   else return NULL;                                                     where
      7 }                                                                        C1. each variable in V                                        ~
                                                                                                           ~π is either an input variable from I,
     Figure 2: Deobfuscated version of genStringObs                                                          i
                                                                                       or a temporary variable Oπj such that j < i, and
                                                                                 C2. π1 , . . . , πN is a permutation of 1, . . . , N .
the same function as genStringObs? It is difficult to answer
this question. However, looking at the obfuscated code in                          Program P above appears to be a straight-line program,
Figure 1, it is easier to guess that a deobfuscated program                     but, in fact, it can be more complex because the base com-
will probably use the following types of components:                            ponents fi ’s can be complex. In particular, base components
 • Conditional if-then-else (ternary) operators,                                can be “if-then-else” functions, and using these components,
 • Boolean expressions that occur in genStringObs, and                          Program P can describe arbitrary loop-free programs.
 • Operators that return strings of size bounded by size of                        We note that the program P above is using all components
   the largest string in genStringObs.                                          from the library. We can assume this without any loss of
                                                                                generality. Even when there is a correct Program P using
  Given these components and the obfuscated program gen-                        fewer components, that program can always be extended to
StringObs, our oracle-guided synthesis tool Brahma au-                          a program that uses all components by adding dead code.
tomatically computes the (deobfuscated) program given in                        Dead code can be easily statically identified and removed in
Figure 2. The reader can appreciate how much easier this                        a post-processing step.
program is to understand.                                                          We also note that program P above is using each base
                                                                                component only once. We can assume this without any loss
3.     PROBLEM DEFINITION                                                       of generality. If there is a Program P using multiple copies of
   The goal is to synthesize a loop-free program using a given                  the same base component, we assume that the user provides
set of base components and using input-ouput examples. We                       multiple copies explicitly in the library. Such a restriction
assume the presence of an I/O oracle that can be queried on                     of using each base component only once is interesting in two
any input. The I/O oracle, when given an input, returns                         regards. First, it can be used to enforce efficient or minimal
the output of the desired program (that we wish to syn-                         programs. Second, it prunes the search space of possible
thesize) on that input. We also assume the presence of a                        programs making the synthesis problem finite and tractable.
validation oracle that validates the correctness of a candi-                       Informally, the synthesis problem is to come up with a
date program. Finally, we assume that we are given a set of                     program – using only the base components in the given li-
(base) components that should be used as building blocks                        brary – that is accepted by the validation oracle.
in the synthesized program. Each component is given in the
form of its input-output specification, which is written as a                   4. ORACLE-BASED SYNTHESIS
logical formula relating the inputs and the outputs of that                        In this section, we provide our solution for the program
component. For ease of presentation, we assume that all                         synthesis problem formally described above. Our solution
components have exactly one output. We also assume that                         is based on encoding the space of all possible programs by
all inputs and outputs have the same type. These restric-                       a formula (Section 4.1). Given a set of input-output pairs,
tions are easily removed.                                                       we then constrain this formula further so that it encodes
   Formally, the synthesis problem in our proposed program-                     only those programs that work correctly on the given input-
ming methodology requires the following:                                        output pairs (Section 4.2). By solving this constraint, we
 • A validation oracle V that, given any candidate program                      generate a candidate solution. If the candidate solution is
   (constructed from base components), returns a Boolean                        not the desired program, we provide a way to generate a new
   answer indicating whether the candidate program is the                       input-output pair (Section 4.3). The overall procedure that
   desired one or not. We discuss how a validation oracle                       combines these parts to solve the program synthesis problem
   can be implemented for the program classes considered                        is presented in Section 4.4. We present enhancements to our
   in this paper in Sections 6.1 and 6.2.                                       basic procedure in Section 4.6.
 • An I/O oracle I that, given any program input, returns
   the output of the desired program on that input.                             4.1 Background: Encoding Programs
 • A set of specifications {hI~i , Oi , φi (I~i , Oi )i | i = 1, . . . , N },      We present an encoding of the space of well-formed candi-
   called a library, where (I~i , Oi , φi (I~i , Oi )i is the specifica-        date programs, that is, of programs P satisfying constraints C1
   tion for the base component fi , which includes                              and C2, as formulas. This encoding is drawn from recent
                                                                                work [8]. Note, however, that the material in subsequent
    – a tuple of input variables I~i and an output variable Oi                  sections depends only on the existence of such an encod-
    – an expression φi (I~i , Oi ) over variables I~i and Oi that               ing. Our proposed approach can be used with alternative
       specifies the input-output relationship of the i-th com-                 encodings as well.
       ponent.                                                                     Intuitively, the encoding we use involves viewing the space
                                                                                of candidate programs as all ways of connecting components
   The symbols I~i , Oi are assumed to be distinct.                             from the library that satisfy syntactic and semantic well-
   The goal of the synthesis problem is to synthesize a pro-                    formedness constraints. Each connection is encoded using an
gram P that can be validated by the validation oracle V,                        integer-valued location variable. Put another way, the value
i.e., V(P ) = true. Furthermore, program P should be con-                       of a location variable determines which component goes on
structed using only the set of base components in the library,                  which location (line-number), and from which location (line-
i.e., Program P should take I~ as its inputs and use the set                    number or circuit input) it gets its input arguments.
The main property of the encoding that our approach re-              The consistency constraint ψcons encodes that every line
lies upon is distilled into the following theorem. This the-         in the program should have at most one component, while
orem states the existence of two formulas (encodings): the           the acyclicity constraint ψacyc encodes that every variable
first formula ψwfp represents the set of all syntactically well-     should be initialized before it is used.
formed programs; whereas the second formula φfunc repre-                The function Lval2Prog returns the program correspond-
sents the set of all semantic input-output behaviors of a
                                                                     ing to a given valuation L as follows: in the ith line of
well-formed program.
                                                                     Lval2Prog(L), we have the assignment Oj := fj (Oσ(1) , . . , Oσ(t) )
                                                                     if lOj = i, lI k = lOσ(k) for k = 1, . . , t, where t is the ar-
   Theorem 1. There exists a set of integer-valued location                        j

variables L, a well-formedness constraint ψwfp (L) over L, a         ity of component fj , and (Ij1 , . . , Ijt ) is the tuple of input
mapping Lval2Prog, and a functional constraint φfunc (L, I, ~ O)     variables I~j of fj . The well-formedness constraint describes
           ~ O} such that the following properties hold:
over L ∪ {I,                                                         syntactically correct programs, but it does not describe the
                                                                     semantics of these programs.
 • Lval2Prog is a bijective mapping from the set of values L                                                   ~ O) is obtained by tak-
   that satisfy the constraint ψwfp (L) to the set of programs          The functional constraint φfunc (L, I,
   that satisfy constraints C1 and C2.                               ing ψwfp (L) and adding to it constraints capturing the dataflow
                                                                     semantics and semantics of components.
 • Let L0 be a satisfying assignment to the formula ψwfp .
   If α and β are any candidate input and output values,                             ~ O) def
                                                                           φfunc (L, I,      = ∃P, R ψwfp (L) ∧ φlib (P, R)
   then the formula φfunc (L0 , α, β) is true iff the program                                                          ~ O, P, R)
   Lval2Prog(L0 ) returns the value β on the input α.                                                    ∧ ψconn (L, I,
                                                                                                   N
                                                                                           def ^
The proof of Theorem 1 follows from the results stated in [8].               φlib (P, R)   = (   φi (I~i , Oi ))
   We now describe the encoding more formally. Let P and                                          i=1
R denote the union of all formal inputs (parameters) and
                                                                                 ~ O, P, R) def
                                                                                                        ^
formal outputs (return variables) of the components respec-            ψconn (L, I,         =                    (lx = ly ⇒ x = y)
tively, that is,                                                                                          ~
                                                                                                  x,y∈P∪R∪I∪{O}
       P := N
              S      ~     R := N
                                  S
                i=1 Ii              i=1 {Oi } = {O1 , . . . , ON }   where φlib represents the semantics of the base components
Any straight-line program constructed using N components
                                                                     (that relates the inputs and outputs of each component),
can be described by a set of location variables L
                                                                     and ψconn represents the dataflow semantics (that matches
          L := {lx | x ∈ P ∪ R}
                                                                     the inputs and output of the different components and the
that contains one new variable lx for each variable x in
                                                                     inputs and output of the overall program with each other,
P ∪ R with the following interpretation associated with each
                                                                     in accordance with values of location variables).
of these variables.
                                                                                             ~ O) represents the class of all syn-
                                                                       The formula φfunc (L, I,
 • If x is the output variable Oi of the component fi , then lx      tactically well-formed programs P , constructed using only
    is the line number in the program where the component
    fi is used.                                                      the N base components, that on input I~ return output O.
                                                                     Hence, we can solve the program synthesis problem by find-
 • If x is the j th input parameter of the component fi , then
                                                                     ing appropriate values for the L variables. We need to find
    lx is the line number “from where component fi gets its
                                                                     values for L such that the input-output behavior of the re-
    j th input”.                                                     sulting program matches the input-output behavior specified
   In the above description, line number refers to either a          by the I/O oracle.
line of the program, or to some input. For uniformity, each            A key step in our solution of the program synthesis prob-
input in I~ is assigned a line number from 0, . . . , |I|
                                                       ~ − 1 and     lem is to synthesize programs that work for finitely many
                                                     ~         ~
the program line numbers then take values from |I|, . . . , |I|+     input-output pairs. We discuss this next.
                       ~
N − 1. Let M = |I| + N . The variables L take values in
the range 0, . . . , M − 1 and these new line numbers have the       4.2 I/O-behavioral Constraint
following interpretation.                                              In this section, we show how to generate a constraint
                    ~ line number j is blank; it takes the value     whose solution provides a candidate program whose input-
 • For 0 ≤ j < |I|,                                                  output behavior matches a given finite set of input-output
   of the j th input of the program.                                 examples.
         ~ ≤ j < M , line number j contains the (j−|I|+1)-th
 • For |I|                                          ~                  Given a set E of input-output examples {(αj , βj )}j , we
   assignment statement of the original program P .                  use the notation BehaveE to denote the following constraint,
  The well-formedness constraint ψwfp (L), defined below,            which we refer to as I/O-behavioral constraint.
encodes the interpretation of the location variables lx along                              def     ^
                                                                             BehaveE (L) =                φfunc (L, αj , βj )
with syntactic well-formedness constraints, such as consis-
                                                                                                   (αj ,βj )∈E
tency and acyclicity constraints.
              def ^                     ^
                                             ~ ≤ lx < M )               Let L0 be a set of values such that BehaveE (L0 ) is true. It
   ψwfp (L) =          (0 ≤ lx < M ) ∧     (|I|                      follows from the definition of the I/O-behavioral constraint
                   x∈P                         x∈R
                                                                     that the program encoded by L0 will give output βj , when-
                       ∧ ψcons (L) ∧ ψacyc (L)                       ever it is given an input αj , for all pairs (αj , βj ) in E. This
             def       ^                                             property of the I/O-behavioral constraint is stated below.
     ψcons   =                (lx 6= ly )
                   x,y∈R,x6≡y
                                                                        Theorem 2 (I/O-behavioral Constraint). For any
             def
                   N
                   ^      ^                                          satisfying solution L0 to the I/O-behavioral constraint, the
     ψacyc   =                      lx < l y                         input-output behavior of the program Lval2Prog(L0 ) matches
                   i=1 x∈I
                         ~i ,y≡Oi                                    all the input-output examples in the set E.
The proof of the above theorem is immediate from the                  IterativeSynthesis():
definition of an I/O-behavioral contraint and Theorem 1.            1     // Input: Set of base components used in
   We next check if the program, which is synthesized by            2     // construction of BehaveE and DistinctE,L
considering finitely many input-output pairs, is the desired        3     // Output: Candidate Program
program. We want to avoid the use of the validation oracle,         4     E := {(α0 , I(α0 ))} // Pick any value α0 for I~
since it is expensive. Here we use what is perhaps the central      5     while (1) {
idea of our approach: generate a “distinguishing” input that        6       L := T-SAT(BehaveE (L));
differentiates this program from another candidate program.         7       if (L == ⊥) return "Components insufficient";
                                                                     8                                ~
                                                                            α := T-SAT(DistinctE,L (I));
4.3 Distinguishing Constraint                                        9      if (α == ⊥) {
   In this section, we show how to generate a constraint            10         P := Lval2Prog(L);
whose solution provides an input that distinguishes a given         11         if (V(P )) return P ;
candidate program from another non-equivalent candidate             12         else return "Components insufficient"; }
program, both of which have a given set of input-output             13      E := E ∪ {α, I(α)}; }
pairs in their respective input-output behavior.
   Let E be a set of input-output pairs. Let P be a can-                 Figure 3: Oracle-guided Synthesis Procedure
didate program, defined by values L, whose input-output
behavior matches the set E. Suppose P is not the desired            for A. Otherwise, it returns ⊥. The function T-SAT is imple-
program. Then, there should be some input I~ such that P            mented as a call to a Satisfiability Modulo Theory (SMT)
gives incorrect output on I.~ But, how do we find such an I?   ~    solver. SMT solvers check for satisfiability of a first-order
   If P is not the desired program, then let us assume that         formula with respect to underlying background theories [4].
there is a correct program P ′ . Clearly, for all input-output         The pseudo-code for the procedure is given in Figure 3.
pairs (αj , βj ) in E, the program P ′ should return βj when        The procedure maintains a set E of input-output examples
it is given input αj . But since P is not the desired program,      constructed by querying the I/O oracle I on a new input at
whereas P ′ is the desired program, P and P ′ should give           the start of the while loop (Line 4) and in each iteration of
different outputs on some new input.                                the while loop (Line 13). In each iteration of the while loop,
                                                                    the procedure attemps to synthesize a candidate program P
   We say I~ is a distinguishing input if there is another pro-     (represented by L) that satisfies the set E of input-output
gram P ′ whose input-output behavior also matches E, but            examples (Line 6). If it fails, then it returns failure (Line 7).
P and P ′ give different outputs on the input I.      ~ The con-
                                                                    Otherwise, it checks (in Line 9) whether the candidate pro-
                        ~
straint DistinctE,P (I), defined below, represents the set of       gram P is the semantically unique program that satisfies the
all distinguishing inputs I~ and we refer to it as distinguishing   given set of input-output examples. A program is semanti-
constraint.                                                         cally unique if any other program that satisfies the given set
DistinctE,L (I)  ~ def                                      ~ O)
                    = ∃L′ , O, O′ BehaveE (L′ ) ∧ φfunc (L, I,      of input-output examples produces the same output as the
                                            ′ ~                     program for any other input. If P is the semantically unique
                                  ∧ φfunc (L , I, O ) ∧ O 6= O′
                                                    ′
                                                                    program, then the procedure either returns P (Line 11) or
   Theorem 3 (Distinguishing Constraint). If α is a                 failure (Line 12) depending on whether the validation oracle
satisfying solution to the distinguishing constraint                V validates P or not. If the candidate program is not se-
               ~ then there exists a program P ′ such that
DistinctE,P (I),                                                    mantically unique, then an input α is obtained that is added
          ′
P and P have different behaviors on input α, but have the           to E to help narrow down the choice of candidate programs
same behavior on all the inputs in the set E.                       (Line 13).
                                                                       The following theorem states the correctness of Proce-
  The proof of Theorem 3 follows from the definition of the         dure IterativeSynthesis. Note that if the inputs I~ take
distinguishing constraint, Theorem 2 and Theorem 1. We              values from a finite domain, then the number of iterations
now have all the ingredients for describing our overall pro-        of the loop in the procedure is bounded by the total number
cedure for solving the synthesis problem.                           of different inputs; and hence, in such cases the procedure
                                                                    is guaranteed to terminate.
4.4 Oracle-Guided Synthesis
   In this section, we describe our oracle-guided iterative syn-       Theorem 4. If Procedure IterativeSynthesis, given in
thesis procedure. The description uses the I/O-behavioral           Figure 3, returns a program P , then V(P ) is true. If Proce-
constraint and the distinguishing constraint described above.       dure IterativeSynthesis returns “Components insufficient”,
   The procedure works by iteratively synthesizing new pro-         then there does not exist any program P constructed from the
grams that work correctly on more and more inputs. It               set of base-components such that V(P ) is true. Furthermore,
starts with a set containing just one arbitrarily chosen input.     Procedure IterativeSynthesis is guaranteed to terminate
In each iteration, the procedure synthesizes a program that         when the inputs I~ take values from a finite domain.
works correctly on the current finite set of inputs. If such a
program is found, then the procedure attempts to find a dis-           The proof of the correctness theorem follows immediately
tinguishing input. If a distinguishing input is found, then it      from the description of the procedure in Figure 3, combined
is added into the set of inputs for subsequent iterations. In       with the properties stated in Theorem 1, Theorem 2 and
all other cases, the procedure terminates. It either returns        Theorem 3. We also illustrate in Figure 4 all three cases in
the correct program, or it notes that the components pro-           which Procedure IterativeSynthesisterminates. The first
vided are insufficient for synthesizing the correct program.        case corresponds to step 7 and the second and third cases
   For solving the I/O-behavioral constraint and the dis-           correspond to step 11 and step 12 respectively.
tinguishing constraint, the procedure makes use of a func-
tion T-SAT. Given a formula φ(A), the function T-SAT(φ(A))          4.5 Illustration on Running Example
searches for values for A that will make the formula φ true. If       We illustrate the oracle-guided synthesis approach on the
successful, then T-SAT(φ(A)) returns one such specific value        running example presented in Section 2.1. The problem was,
No program exists with given components                                      ConstrainedRandomInput:
1             a Step 7: Set E of I/O examples show components insufficient      1  // cnt is a global variable initialized to 0
                          for synthesizing valid program.                       2  // K is a parameter (#rightmost bits to set)
              b                                                                                K
                Step 12: Discovered semantically unique program P is found      3 if (cnt < 2 ) {
                          incorrect by the validator − no synthesis feasible.   4      α := sample(Inputs);
2
     Valid program exists with given components                                 5      α :=Set rightmost K bits of α to cnt;
                 Step 11: Valid program P is returned.
                                                                                6      cnt := cnt + 1; }
                                                                                                               ~
                                                                                7 else α := T-SAT(DistinctE,L (I));
                                                                                Figure 5: Strategy for generating a new input
Figure 4: Termination cases of Synthesis Procedure.                             based on sampling from the input space with an
The validation oracle is needed only to ensure cor-                             application-dependent bias.
rectness in case 1b.
                                                                                in some way. We explore two alternative ways for sampling
                                                                                the input space.
given a bit-vector integer x, of finite but arbitrary length,
to construct a new bit-vector y that corresponds to x with                      Sampling Uniformly at Random
the rightmost string of contiguous 1s turned off.
  Our technique starts with a random input 01011 and the                        Let Inputs be the set of all possible valuations for the input
I/O oracle I (the user) is used to obtain the corresponding                     variables. Let sample(Inputs) be a function that returns
expected output 01000. This step corresponds to Line 4 of                       a particular input from the input space Inputs by sam-
the algorithm presented in Figure 3.                                            pling the set Inputs uniformly at random. The function
  Given the input/output pair (01011, 01000), our technique                     sample(Inputs) can be used to find a new input, in place
generates the following candidate program (Line 6): (we give                    of the call to the SMT solver, in Line 8 of Procedure Iter-
only the expression returned)                                                   ativeSynthesis. We will call this new variant as Random.
                (x + 1) & (x − 1)                                               Sampling With Bias
Then, it checks whether a semantically different program can
be generated in Line 8. In this case, our technique generates                   The second approach we consider is based on biasing the
the following alternative program and the distinguishing in-                    search for inputs towards a certain part of the input space.
put 00000:                                                                      Not all inputs in the input space are equally important. For
                (x + 1) & x                                                     example, a program may take an integer input i, but have
The I/O oracle is used to obtain the output 00000 for this                      the same behavior for all i > 5 and have interesting behav-
input (Line 13). This is added to the set of input/output                       iors only on values 0 ≤ i ≤ 5. For many applications, the
pairs E. Note that the newly added pair rules out one of                        user knows a-priori which inputs are more crucial in defin-
the candidate programs, namely, (x + 1) & (x − 1).                              ing the overall program. The idea behind the sampling with
  In the next iteration, with the updated set E, the tech-                      bias strategy is to search for distinguishing inputs by biasing
nique finds the program                                                         the search to this part of the input space.
                −(¬x) & x                                                          In the bitvector benchmarks used in this paper, the in-
and the check in Line 8 generates the alternate program                         put space consists of all (tuples of) bitvectors of a certain
                (((x& − x) | − (x − 1))&x) ⊕ x                                  bit width. It is well-known that, for a very large class of
and the input 00101. Hence, we add (00101, 00100) to E.                         commonly-used bitvector functions, the rightmost bits in-
This rules out (((x& − x) | − (x − 1))&x) ⊕ x.                                  fluence the output more than the leftmost bits.
  Note that at this stage, the program (x + 1)&x remains a                         Property 1 (See [29], Chapter 2). A function map-
candidate, since it was not ruled out in the earlier iterations.                ping bitvectors to bitvectors can be implemented with add,
In next four iterations, Brahma generates (01111, 00000),                       subtract, bitwise and, bitwise or, and bitwise not instruc-
(00110, 00000), (01100, 00000) and (01010, 01000) as input-                     tions if and only if each bit of the output depends only on
output examples and adds them to E. The semantically                            bits at and to the right of that bit in each input operand.
unique program generated from the resulting set E is the
desired program:                                                                This suggests that we should bias the sampling so that we
                                                                                get more variety on the rightmost bits.
                      (((x − 1)|x) + 1)&x.                                         The code ConstrainedRandomInput in Figure 5 uses a con-
                                                                                strained random strategy for generating a new input. It
4.6 Optimization                                                                starts with an input α that is sampled uniformly at random,
                                                                                but then it sets its rightmost K bits to the (rightmost K bits
  The basic procedure described above can be improved by
                                                                                in the) number cnt. Since cnt is incremented each time, we
using alternate ways to generate the inputs that are used by
                                                                                get a new combination in each time. Specifically, if K = 2,
the procedure for synthesis.
                                                                                then in four calls to the Function ConstrainedRandomInput,
  IterativeSynthesis uses an SMT solver in two ways:
                                                                                we will get all four combinations – 00, 01, 10 and 11 – in the
(a) First, an SMT solver is used to generate a candidate
                                                                                rightmost 2 digits of I. The code ConstrainedRandomInput
program that works for the current set of inputs.
                                                                                finds the first 2K inputs this way. If more are needed, then
(b) Second, an SMT solver is used to generate a new distin-
                                                                                it goes back to using the SMT solver. The new variant of
guishing input on which the currently synthesized program
                                                                                IterativeSynthesis – obtained by replacing the call to the
and the desired program potentially differ.
                                                                                SMT solver in Line 8 by the code ConstrainedRandomInput
  Although SMT solvers are fast and capable of handling
                                                                                – will be called Constrained Random.
very large formulas, using them in every iteration compro-
                                                                                   We will compare the performance of IterativeSynthe-
mises efficiency. It is tempting to speculate that the use of
                                                                                sis, Random, and Constrained Random in Section 6.
SMT solvers for generating a distinguishing input (case (b)
above) can be avoided; for example, by replacing it by a
function that finds new inputs by sampling the input space                      5. DISCUSSION
5.1 Choosing Base Components                                      tween the cheaper queries to the I/O oracle and the more
   It is reasonable to ask how base components are chosen in      expensive queries to the validation oracle, which allows us to
our approach and what happens when the given set of base          optimize our implementation. Finally, our algorithm iterates
components is either insufficient or very large.                  by finding distinguishing inputs, which is not an operation
   The choice of base components is made by the user and is       supported by Angluin’s model.
guided by the application domain. This allows the user to            Two other results from learning theory also shed light on
use his/her knowledge to guide the synthesis and influence        why our oracle-based approach is effective in practice.
success. It is not unreasonable to expect users to provide this      First, note that our focus on loop-free programs that com-
information. In several application domains, there is a nat-      pute functions of finite-precision bit-vector inputs indicates a
ural choice for the set of base components. For example, a        connection to the work on learning Boolean circuits. In par-
natural set of base components for synthesizing bitvector al-     ticular, the classic result on learning constant-depth Boolean
gorithms will contain components that perform bitwise and,        (AC 0 ) circuits from a few test inputs [17] provides a partial
or, not, xor, negation, increment and decrement operations.       explanation for the effectiveness of this strategy. The result
In our experiments on synthesizing bitvector programs (Sec-       relies on a theorem stating that AC 0 circuits can be ap-
tion 6), we started with such a set of base components, re-       proximated well by low-degree polynomials, which in turn
ferred to as the standard library. If the synthesis procedure     are known to be identifiable by their behavior on few inputs.
found that this set of components was insufficient, the stan-        The second relevant result relates to the notion of teach-
dard library was augmented with a set of new components           ing dimension introduced by Goldman and Kearns [7]. In-
suggested by the user and the synthesis procedure was re-run      formally, the teaching dimension of a concept class is the
with this extended library.                                       minimum number of examples a teacher (oracle) must re-
   Nevertheless, choosing a reasonable set of base compo-         veal to uniquely identify any target concept from that class.
nents is crucial for the feasibility of our synthesis approach.   As our experiments show, we need very few examples to syn-
The search space of candidate programs grows exponentially        thesize our target programs in practice, indicating that these
with the number of base components. The strategy of start-        programs form a concept class with a low teaching dimen-
ing with a small set of base components, and then incremen-       sion. Moreover, our algorithm fits closely with a result from
tally adding components, can partly avoid the need to deal        the Goldman-Kearns paper [7], showing that the generation
with very large set of base components. However, it can be        of an optimal teaching sequence of examples is equivalent to
successful only if the synthesis engine not only synthesizes      a minimum set cover problem. In the set cover problem for
correct programs quickly, but also reports infeasibility of the   a given target concept, the universe of elements is the set
synthesis problem quickly. In our experiments, we show that       of all incorrect concepts (programs) and each set Si , corre-
our technique can detect infeasibility efficiently.               sponding to example xi , contains concepts that are differen-
Choosing Base Components for Deobfuscation. When                  tiated from the target concept by this example xi . We can
using our program synthesis approach for performing pro-          see that our Procedure IterativeSynthesis computes such
gram deobfuscation, the base components are picked from           a distinguishing example in each iteration, and terminates
the assignment and conditional statements in the obfuscated       when it has computed a “set cover” that distinguishes the
code. For example, consider the obfuscated program in Fig-        target concept from all other candidate concepts (the “uni-
ure 1. Although it is difficult to understand the obfuscated      verse”). Given this close connection, it does seem that the
program, it is easier to guess that the set of important base     classes of functions corresponding to the bit-manipulating
components will include if-then-else components and equal-        and deobfuscation examples we consider have small teach-
ity comparators. Similarly, for the other deobfuscation ex-       ing dimension, and also Procedure IterativeSynthesis is
amples (reported in Section 6), the base components used          effective at generating a sequence of examples close to the
for synthesis contain only operators (such as left-shift and      optimal teaching sequence.
bitwise-xor) that appear explicitly in the obfuscated code           These connections to learning theory are very intriguing.
(see Figure 7).                                                   We leave a formal exploration of these links to future work.
5.2 Connections to Learning
   Our oracle-guided synthesis framework has close connec-        6. EXPERIMENTAL RESULTS
tions to certain fundamental results in computational learn-        We present experimental evaluation of our technique and
ing theory. We explore these connections in this section.         compare different approaches namely IterativeSynthesis,
   Our oracle-based model is similar to the query-based learn-    Random, and Constrained Random discussed in Section 4.
ing model proposed by Angluin [1], but with some important
distinctions. In Angluin’s model, a learner interacts with        Setup and Benchmarks.
an oracle through the use of membership and equivalence              We have implemented IterativeSynthesis in a tool called
queries in order to learn a target concept. In our setting,       Brahma. It uses Yices 1.0.21 [25] as the underlying SMT
the target concept is the program we seek to synthesize. A        solver. We ran our experiments on 8x Intel(R) Xeon(R)
membership query is similar to the query we make to an            CPU 1.86GHz with 4GB of RAM. Brahma was able to syn-
I/O oracle, except that the former returns a binary answer        thesize the desired circuit for each of the benchmark exam-
whereas the I/O oracle returns an output value. An equiv-         ples. Semi-biased Brahma implements Constrained Ran-
alence query is similar to a query to the validation oracle,      dom with the parameter K = 2. Thus, it differs only in first
except that, in Angluin’s model, if the candidate concept         4 steps from Brahma. As mentioned in Section 4, this is
is not equivalent to the target concept, the oracle returns       specially targetted towards synthesis of bitvector programs.
a counterexample as evidence for this non-equivalence. In            We selected a set of 25 benchmark examples to evaluate
our context, since the validation oracle is called only at the    our technique. 22 benchmarks (P1-P22) are bit-manipulation
end, when we are left with a semantically unique program          programs from the book Hacker’s Delight, commonly re-
consistent with the set of examples, such a counterexample        ferred to as the Bible of bit twiddling hacks [29]. 3 bench-
is not needed. Moreover, Angluin’s model treats both kinds        marks were used as examples to illustrate the use of our tech-
of queries as equally expensive. We make a distinction be-        nique for deobfuscation. These benchmarks reflect obfusca-
14                                     7
                                                                                                   Runtime
P1(x)      :                       P22(x)                                                        Iter Count
Turn-off                           :    Round                                         12                                     6
                P21(x)      :

                                                                                                                                  Ratio of Iteration Counts
rightmost 1                        up to the
bit.            Next higher                                                           10                                     5
                                   next highest

                                                                   Ratio of Runtime
                unsigned
1 o1 =(x - 1) number               power of 2
2 res=(x & o1 )                                                                           8                                  4
                with same            1 o1 =(x - 1)
                number of 1          2 o2 =(o1 >> 1)
P19(x)     :                         3 o3 =(o1 | o2 )                                     6                                  3
                bits
Turn-off the                         4 o4 =(o3 >> 2)
                1 o1 =(- x)                                                               4                                  2
rightmost                            5 o5 =(o3 | o4 )
                2 o2 =(x & o1 )
contiguous                           6 o6 =(o5 >> 4)
string of 1     3 o3 =(x + o2 )                                                           2                                  1
                                     7 o7 =(o5 | o6 )
bits            4   o4 =(x ⊕  o2 )
                                     8 o8 =(o7 >> 8)
                5 o5 =(o4 >> 2)                                                           0                                  0
1 o1 =(x - 1) 6 o6 =(o5 / o2 ) 9 o9 =(o7 | o8 )                                               1 3 5 7 9 11 13 15 17 19 21
2 o2 =(x | o1 ) 7 res=(o6 | o3 ) 10 o10 =(o9 >> 16)
3 o3 =(o2 + 1)                     11 o11 =(o9 | o10 )                                            Bitvector Benchmarks
4 res=(o3 & x)                     12 res=(o10 + 1)              Figure 8: Ratio of Runtime for Random Input Genera-
                                                                 tion to SemiBiased Brahma
     Figure 6: Selected Bit-vector Benchmarks                                         6                                     1.5
                                                                                                Runtime
                                                                                              Iter Count
tion strategies from literature on obfuscation techniques [5]                         5
                                                                                                                            1.2
(P23) and Internet worms - Conficker [22] (P24) and My-

                                                                                                                                     Ratio of Iteration Counts
Doom [21] (P25). Some of these examples are presented in

                                                                  Ratio of Runtime
                                                                                      4
Figure 6 and Figure 7. All the benchmarks used in the ex-
                                                                                                                            0.9
periments are listed in a more detailed version of the paper
made available as a technical report [9].                                             3
                                                                                                                            0.6
6.1 Bit-Manipulating Programs                                                         2
   Recall from Section 5.1 that the bitvector benchmarks
were run using a standard library of base components, and                                                                   0.3
if necessary, an extended library. In Table 1, we report the                          1
runtime when using the standard library (col. 4) and when
using the user-augmented extended library (col. 5), in case                           0                  0
the standard library was not sufficient. Note that our tool              1 3 5 7 9 11 13 15 17 19 21
quickly terminates when the given library is insufficient.                     Bitvector Benchmarks
   For bitvector benchmarks, the user plays the role of the      Figure 9: Ration of Runtime for Brahma to SemiBiased
I/O oracle as well as the validation oracle. If the user guar-   Brahma
antees that the provided set of base components is sufficient
to encode the desired solution, then we do not require the       6.2 Deobfuscation
validation oracle. Otherwise, it is theoretically impossible
to know whether or not the generated solution is the cor-           The I/O oracle involves simply evaluating the obfuscated
rect one without a validation oracle. However, in practice,      program on the given input. The validation oracle can be a
our algorithm detects insufficiency of the base components       program equivalence checking tool or the user.
by discovering inconsistency, and not by a query to the val-        An additional challenge that this domain offers is the pres-
idation oracle. This suggests that in the absence of any         ence of arbitrary string constants. Our synthesis framework
validation oracle, we can consider the semantically unique       can be easily extended to discovering such constants. For
candidate program returned by our algorithm to be the cor-       this purpose, we introduce a generic base component fc that
rect program for all practical purposes.                         simply outputs some arbitrary constant c. The component
   We now compare the three approaches on bit-vector bench-      fc takes no input and returns one output O and its func-
marks using two metrics - the total runtime and the number       tional specification is written as O = c. The only change
of iterations. We present the ratio of runtimes of random        to the framework is that since c is allowed to be arbitrary,
input generation (col 2 of Table 1) and semi-biased Brahma       we existentially quantify over c in the functional constraint
(col 5 of Table 1) in Figure 8. Semi-biased Brahma is            φfunc described in Section 4.1.
1.5 times to 12 times faster than random technique. For             For the three examples that we used in experiments, ob-
P22, the random technique did not finish in 1 hour while         serve that Brahma gives the best performance. The key
semi-biased Brahma was able to synthesize it in 186 sec-         observation from the experiments is that random input gen-
onds. Also, the number of iterations required to synthe-         eration does not work well for examples such as P25 where
size a program is also reduced significantly as shown in Ta-     randomly generating integers has a rare chance of 1 in 232 to
ble 1. Brahma and semi-biased Brahma is compared in              pick an input which produces any of the first 4 possible out-
Figure 9. While the number of iterations is more for semi-       puts. Brahma takes exactly 5 iterations to query the I/O
biased Brahma, it is faster than the Brahma on larger            oracle with inputs that generate all the 5 possible outputs.
benchmarks. It reduces the runtime for P18 from 140.65              The experimental results indicate that adding a distin-
seconds to 25.55 seconds, P21 from 527.91 seconds to 272.28      guishing input is better than adding a random input to E
seconds and P22 from 1108.15 seconds to 187.17 seconds.          because it guarantees that atleast one candidate program is
This validates the optimization proposed in Section 4.           definitely removed from the search space. Thus, it guaran-
P24: Multiply by 45
P23: Interchange the source and destination addresses.
                                                                                   1   int multiply45Obs(int y)
 1   interchangeObs(IPaddress* src, IPadress* dest)                                2   { a=1; b=0; z=1; c=0;
 2   { *src = *src ⊕ *dest;                                                        3     while(1) { if (a == 0) {
 3     if (*src == *src ⊕ *dest) { *src = *src ⊕ *dest;                            4       if (b == 0) { y=z+y; a =¬a;
 4        if (*src == *src ⊕ *dest) { *dest = *src ⊕ *dest;                        5       b=¬b;c=¬c; if (¬ c) break; }
 5           if (*dest == *src ⊕ *dest) { *src = *dest ⊕ *src;                     6       else {
 6              return; }                                                          7       z=z+y; a=¬a; b=¬b; c=¬c;
 7           else { *src = *src ⊕ *dest; *dest = *src ⊕ *dest;                     8       if (¬ c) break; } }
 8             return;} }                                                          9       else if(b == 0) {z=y
Shapiro’s Algorithmic Debugging System [23] performs           [8] S. Gulwani, S. Jha, A. Tiwari, and R. Venkatesan.
synthesis by repeatedly using the oracle to identify errors           Component based synthesis applied to bitvector
in the current (incorrect) program and then fixing those er-          circuits. Technical Report MSR-TR-2010-12, Microsoft
rors. We do not fix incorrect programs. We use the in-                Research, Feb 2010.
correct program to identify a distinguishing input and then       [9] S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari.
re-synthesize a new program that works on the new input-              Oracle-guided component-based program synthesis.
output pair as well.                                                  Technical Report UCB/EECS-2010-15, EECS
   In programming by demonstration [15, 16, 6], the user              Department, University of California, Berkeley, Feb
demonstrates how to perform a task and the system learns              2010.
an appropriate representation of the task procedure. Un-         [10] T. A. Johnson and R. Eigenmann. Context-sensitive
like our method, these approaches do not make active oracle           domain-independent algorithm composition and
queries, but rely on the demonstrations the user chooses.             selection. In PLDI, 2006.
Making active queries is important for efficiency and termi-     [11] R. Joshi, G. Nelson, and K. H. Randall. Denali: A
nating quickly (so that user is not overwhelmed with queries).        goal-directed superoptimizer. In PLDI, 2002.
   In programming by sketching [24], implementations are         [12] E. Kitzelmann and U. Schmid. Inductive synthesis of
synthesized from sketches – partially-specified programs with         functional programs: An explanation based
holes. The SKETCH system uses SAT solving within a                    generalization approach. J. Machine Learning Res.,
counterexample-guided loop that constantly interacts with             7:429–454, 2006.
a verifier to check candidate implementations against a com-
                                                                 [13] D. E. Knuth. The art of computer programming.
plete specification, where the verifier provides counterexam-
                                                                      http://www-cs-faculty.stanford.edu/~knuth/
ples until a correct solution has been found. In contrast,
                                                                      taocp.html.
we do not use counterexamples for synthesis. Further, we
require a validation oracle only when the specification can      [14] J. Kominek. rot13 implementation. http://www.
not be realized using the provided components. This veri-             miranda.org/∼jkominek/rot13/, Accessed Sep. 2009.
fier is not required to return a counter-example. In practice,   [15] T. Lau, P. Domingos, and D. S. Weld. Version space
we never require a query to the verifier and our technique            algebra and its application to programming by
correctly identifies infeasible scenarios without calling the         demonstration. In ICML, pages 527–534, 2000.
validation oracle.                                               [16] H. Lieberman, editor. Your wish is my command:
                                                                      Giving users the power to instruct their software.
8.   CONCLUSION                                                       Morgan Kaufmann, 2001.
  We have presented a novel approach to program synthesis        [17] N. Linial, Y. Mansour, and N. Nisan. Constant depth
based on oracle-guided learning from examples and SMT                 circuits, fourier transform, and learnability. In FOCS,
solvers. Applications to synthesis of bit-vector programs             pages 574–579, 1989.
and deobfuscation have been demonstrated. Experiments            [18] D. Mandelin, L. Xu, R. Bodı́k, and D. Kimelman.
indicate that our approach can be efficient and effective for         Jungloid mining: Helping to navigate the API jungle.
discovering unintuitive code and for program understanding.           In PLDI, pages 48–61, 2005.
                                                                 [19] Z. Manna and R. Waldinger. A deductive approach to
Acknowledgments                                                       program synthesis. ACM TOPLAS, 2(1):90–121, 1980.
                                                                 [20] H. Massalin. Superoptimizer - a look at the smallest
We are grateful to Rastislav Bodik, George Necula, John
                                                                      program. In ASPLOS, pages 122–126, 1987.
Rushby, Natarajan Shankar, Hassen Saidi, Dawn Song, Richard
Waldinger and the anonymous reviewers for their insightful       [21] MyDoom Wikipedia Article. http://en.wikipedia.
comments. The UC Berkeley authors were supported in part              org/wiki/Mydoom, URL accessed Sep. 2009.
by NSF grants CNS-0644436 and CNS-0627734, and by an             [22] P. Porras, H. Saidi, and V. Yegneswaran. An analysis
Alfred P. Sloan Research Fellowship. The fourth author was            of conficker’s logic and rendezvous points. Technical
supported in part by NSF grants CNS-0720721 and CSR-                  report, SRI International, March 2009.
0917398.                                                         [23] E. Y. Shapiro. Algorithmic Program DeBugging. MIT
                                                                      Press, Cambridge, MA, USA, 1983.
                                                                 [24] A. Solar-Lezama, L. Tancau, R. Bodı́k, S. Seshia, and
9.   REFERENCES                                                       V. Saraswat. Combinatorial sketching for finite
 [1] D. Angluin. Queries and concept learning. Machine                programs. In ASPLOS, 2006.
     Learning, 2(4):319–342, 1987.
                                                                 [25] SRI Intl. Yices: An SMT solver.
 [2] S. Bansal and A. Aiken. Automatic generation of                  http://yices.csl.sri.com/.
     peephole superoptimizers. In ASPLOS, 2006.
                                                                 [26] M. Stickel, R. Waldinger, M. Lowry, T. Pressburger,
 [3] S. Bansal and A. Aiken. Binary translation using                 and I. Underwood. Deductive composition of
     peephole superoptimizers. In OSDI, 2008.                         astronomical software from subroutine libraries. In
 [4] C. Barrett, R. Sebastiani, S. A. Seshia, and C. Tinelli.         CADE, 1994.
     Satisfiability modulo theories. In Handbook of              [27] P. D. Summers. A methodology for lisp program
     Satisfiability, volume 4, chapter 8. IOS Press, 2009.            construction from examples. J. ACM, 24(1), 1977.
 [5] C. Collberg, C. Thomborson, and D. Low. A taxonomy          [28] Symantec Corporation. Internet security threat report
     of obfuscating transformations. Technical Report 148,            volume XIV. http://www.symantec.com/business/
     Dept. Comp. Sci., The Univ. of Auckland, July 1997.              theme.jsp?themeid=threatreport, April 2009.
 [6] A. Cypher, editor. Watch what I do: Programming by          [29] H. S. Warren. Hacker’s Delight. Addison-Wesley
     demonstration. MIT Press, 1993.                                  Longman, Boston, MA, USA, 2002.
 [7] S. A. Goldman and M. J. Kearns. On the complexity
     of teaching. Journal of Computer and System
     Sciences, 50:20–31, 1995.
You can also read