GPU-based Cloud Computing for Comparing the Structure of Protein Binding Sites

Page created by Bill Dixon
 
CONTINUE READING
GPU-based Cloud Computing for Comparing the Structure of Protein Binding Sites
GPU-based Cloud Computing for Comparing the
         Structure of Protein Binding Sites
   Matthias Leinweber1 , Lars Baumgärtner1 , Marco Mernberger1 , Thomas Fober1 , Eyke Hüllermeier1 , Gerhard
                                            Klebe2 , Bernd Freisleben1
1 Department    of Mathematics & Computer Science and Center for Synthetic Microbiology, University of Marburg
                                Hans-Meerwein-Str. 3, D-35032 Marburg, Germany
              2 Department of Pharmacy and Center for Synthetic Microbiology, University of Marburg

                                  Marbacher Weg 6, D-35037 Marburg, Germany
           1 {leinweberm, lbaumgaertner, mernberger, thomas, eyke, freisleb}@informatik.uni-marburg.de
                                          2 klebe@staff.uni-marburg.de

   Abstract— In this paper, we present a novel approach for           In this paper, we present a novel approach to signifi-
using a GPU-based Cloud computing infrastructure to efficiently    cantly speed up the computation times of a recent graph-
perform a structural comparison of protein binding sites. The      based algorithm for performing a structural comparison of
original CPU-based Java version of a recent graph-based algo-
rithm called SEGA has been rewritten in OpenCL to run on           protein binding sites, called SEGA [15], by using the digital
NVIDIA GPUs in parallel on a set of Amazon EC2 Cluster GPU         ecosystem of a GPU-based Cloud computing infrastructure.
Instances. This new implementation of SEGA has been tested         The original CPU-based Java version of SEGA has been
on a subset of protein structure data contained in the CavBase,    rewritten in OpenCL to run on NVIDIA GPUs in parallel
providing a structural comparison of protein binding sites on a    on a set of Amazon EC2 Cluster GPU Instances. This new
much larger scale than in previous research efforts reported in
the literature.                                                    implementation of SEGA has been tested on protein structure
                                                                   data of the CavBase [16], providing a structural comparison of
   Index Terms— GPU, Cloud computing, protein binding sites,       protein binding sites on a much larger scale than in previous
structure comparison, graph alignment, OpenCL.
                                                                   research efforts reported in the literature.
                                                                      This paper is organized as follows. Section II discusses
                      I. I NTRODUCTION                             related work. The SEGA algorithm is described in Section
   A major goal in synthetic biology is the manipulation of the    III, and its GPU implementation is presented in Section IV.
genetic setup of living cells to introduce novel biochemical       Experimental results are discussed in Section V. Section VI
pathways and alter existing ones. A prerequisite for the con-      concludes the paper and outlines areas for future work.
stitution of new biochemical pathways in microorganisms is a
working knowledge of the biochemical function of the proteins
                                                                                        II. R ELATED W ORK
of interest. Since assessing protein function experimentally is
time-consuming and in some cases even infeasible, the pre-            Several graph-based algorithms for protein structure anal-
diction of protein function is a central task in bioinformatics.   ysis have been proposed in the literature. For example,
   Typically, the function of a protein is inferred from sim-      a subgraph isomorphism algorithm [20] has been used by
ilar proteins with known functions, most prominently by a          Artymiuk et al. [2] to identify amino acid side chain patterns.
sequence comparison, owing to the observation that proteins        Furthermore, Xie and Bourne [22] have proposed an approach
with an amino acid sequence similarity larger than 40% tend        utilizing weighted subgraph isomorphism, while Jambon et
to have similar functions [19]. Accordingly, a plethora of         al. [6] employ heuristics to find correspondences. A more
algorithms exists for comparing protein sequences, including       recent approach based on fuzzy histograms to find similarities
the well-known NCBI BLAST algorithm [1]. Yet, below this           in structural protein data has been presented by Fober and
threshold of 40%, results of sequence comparisons become           Hüllermeier [5]. Fober et al. [4] have shown that pair-wise or
more and more uncertain [11].                                      multiple alignments on structural protein information can be
   In cases where a sequence-based inference of protein func-      achieved using labeled point clouds, i.e., sets of vertices in a
tion remains inconclusive, a structural comparison can provide     three-dimensional coordinate system.
further insights and uncover more remote similarities [18],           Apparently, algorithms for performing a structural compar-
especially when focusing on functionally important regions         ison of protein binding sites have not been designed to run
of proteins, such as protein binding sites. Several algorithms     on modern GPUs. However, there are several sequence-based
are known to compare possible protein binding sites based on       protein analysis approaches that were ported to GPUs. For
structural data [9], [17], [3]. However, such algorithms have      example, NCBI BLAST runs on GPUs to achieve significant
much longer runtimes than their sequence-based counterparts,       speedups [21]. Other projects, such as CUDASW++ and
severely limiting their use for large scale comparisons.           CUDA-BLASTP [13],[14],[12],[8], have shown that GPUs can
be used as cheap and powerful accelerators for well-known                                                                             v1        v4
                                                                                                                 v2
algorithms for performing local sequence alignment, such as                                                                     vc
the Smith-Waterman algorithm.                                                     v4
                                                                                                                      v1                              v1

                 III. T HE SEGA A LGORITHM                                             v1                        vc                             vc
                                                                                                                                                       v4
   The SEGA algorithm constructs a global graph alignment               v2                                                       v2
                                                                                  vc                                                       vc
of complete node-labeled and edge-weighted graphs, i.e., a 1-
to-1 correspondence of nodes. In principle, SEGA realizes a
                                                                                                       v3                  v4
divide and conquer strategy by first solving a correspondence                                                                                          vc
problem on a local scale to derive a distance measure on                     v3                                                       v3
nodes. This local distance measure is used in a second step to                                              v2
solve another correspondence problem on a global scale, by                                                             vc
                                                                                                                                                 v3
deriving a mutual assignment of nodes to construct a global
graph alignment.
                                                                   Fig. 1. Decomposition of the neighborhood of node vc with nneigh = 4. The
   To derive a local distance measure, nodes are compared in       subgraph defined by the nneigh nearest nodes is decomposed into triangles
terms of their immediate surroundings, i.e., the node neigh-       containing the center node vc .
borhood. This node neighborhood is defined by the subgraph
formed by the n nearest neighbor nodes. Since SEGA has
been developed for graphs representing protein binding sites          If ambiguities arise, SEGA resorts to global information by
based on CavBase data [16], nodes represent pseudocenters,         selecting assignments for which both nodes preferably show
i.e., spatial descriptors of physicochemical properties present    a small deviation with respect to an already obtained partial
within a binding site. Edges are weighted with the Euclidian       solution. More precisely, the relative position of candidate
distance between pseudocenters.                                    nodes to each node in the partial solution is determined and
   The basic assumption is that the more similar the imme-         used to calculate another cost matrix, containing a measure of
diate surroundings of two pseudocenters are, the higher the        the geometric deviation for each candidate pair. The actual
likelihood that they belong to corresponding protein regions.      assignments are then obtained by solving another optimal
Comparing the node neighborhood thus corresponds to com-           assignment problem, using the Hungarian algorithm [10]. A
paring the spatial constellation of physicochemical properties     more detailed description of the approach can be found in
in close proximity of these pseudocenters. If these are highly     Mernberger et al. [15].
similar, a mutual assignment of these nodes should be favored.
   Given two input graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 )                          IV. SEGA IN A GPU C LOUD
with |V1 | = m1 and |V2 | = m2 , a local m1 × m2 distance
matrix D = (dij )1≤i≤m1 ,1≤j≤m2 is obtained by extracting the        In this section, a version of the SEGA algorithm running
induced neighborhood subgraph for each center node vi ∈ V1         on GPU hardware and a pipelined computation framework for
and vj ∈ V2 as given by the set of nodes including the center      performing large scale GPU-based structural comparisons of
nodes themselves and the closest n neighbor nodes.                 protein binding sites in a Cloud environment are presented.
   To obtain a distance measure between two nodes vi and
vj , the corresponding subgraphs are decomposed into the set
                                                                   A. GPU Implementation of SEGA
of all triangles containing the center node vc (see Figure
1). Then, an assignment problem is solved to obtain the               A common problem when developing applications to run on
number of matching triangles. Triangles are considered to          GPU hardware is that it is not easy to utilize all resources of
match, if a mutual assignment of nodes exists for which node       a computational node efficiently. If the complete algorithm
labels of corresponding neighbor nodes are identical and all       is implemented to run on a GPU, the host CPU’s work
corresponding edge weights are within an  -range of each          only consists of controlling the device, which usually is not
other. In other words, a superposition preserving node labels      sufficient to operate the processor at full load.
(exempting the center node) and edge lengths is obtained. The         The SEGA algorithm is well suited for a division into a
node labels of the center nodes are not required to match,         GPU and a CPU part. The part of the algorithm that solves
to introduce a certain level of tolerance, which is necessary      the correspondence problem has been rewritten to run on GPU
when dealing with molecular structure data. Likewise, the          hardware using OpenCL. The iterative part constructing a
parameter  ≥ 0 is a tolerance threshold determining the           global alignment is computed on the host CPU, supported
allowed deviation of edge lengths.                                 by intermediate results generated by the GPU part of the
   The obtained distance matrix D can be considered as a cost      implementation.
matrix, indicating the cost for each potential assignment of          The creation of the cost matrix D (see Section III) is divided
nodes vi ∈ V1 and vj ∈ V2 . In the second step of the algorithm,   into four OpenCL kernels. The first OpenCL kernel builds
an optimal assignment of nodes from V1 and V2 is derived           input graphs G = (V, E) from the point cloud information
incrementally, by first realizing the assignment of nodes that     provided by the protein cavity database. The data is stored
have the smallest distance to each other before assigning the      in a m × m matrix where m = |V | is the number of points
next pair of nodes.                                                describing this cavity. Based on the data parallelism in this
Fig. 2.   SEGA GPU architecture overview.

task, this kernel can run with m2 threads at once, where each     calculation object that contains a set of tasks to handle a group
thread computes a pair-wise distance.                             of comparisons. A calculation object contains two important
    The second OpenCL kernel constructs an intermediate ma-       pieces of information: (a) a description of the entities to be
trix for a protein cavity. This matrix contains for each V ∈ G    compared, and (b) a set of instructions that are to be issued
the indices for the n nearest neighbors. Each line in this        when a comparison is performed.
matrix is data-independent and contains the indices for the          Figure 2 shows the orchestration of the six components
n smallest values from the corresponding matrix line in D.        involved in the comparison of the protein binding sites using
This is calculated by m · (m/2) threads, where m/2 threads        our GPU enhanced implementation of the SEGA algorithm.
calculate the n smallest values with parallel reduction and the   Furthermore, it also illustrates the data flow through the
use of block-shared memory.                                       framework.
    A neighborhood size of n results in l = n · (n − 1)/2            The Selector component is the entrance point of the
triangles for each V ∈ G. These triangles are stored in a         framework. It provides both an interconnection to a data
m × l matrix Z that is created by the third OpenCL kernel.        store with caching capabilities and the program logic that
This kernel is executed with m × l threads in parallel using a    controls which entities should be compared next. To perform
vector containing the indices indicating which of the n nearest   the SEGA comparisons, the Selector combines a set of
neighbors is combined with which other neighbor.                  protein cavity identifiers and loads the point cloud data. This
    The last OpenCL kernel combines two triangle matrices         information is passed via a queue to the DataProcessor.
Z1 , Z2 into a distance matrix D with m1 × m2 elements. It is     Additionally, the Selector stores meta-information in the
executed with m1 · m2 threads, where each thread loops over       Monitor component, such as the tasks in progress. In our
l · l triangles computing the cost for a match.                   case, no further work of the algorithm depends on the CPU at
    The final alignment of the distance matrix D is computed      this point, so the next component belongs to the GPU.
as described in Section III, supported by intermediate results
                                                                     The decision to split the GPU part into three components
generated by the OpenCL part.
                                                                  is mainly due to the design of modern GPU hardware. The
                                                                  latest generation of GPU hardware offers independent control
B. Management Framework                                           flows of memory reads, memory writes and kernel execution
   We have developed a software framework for managing the        induced by the host system. Therefore, the DataProcessor
GPU and CPU computations involved in our implementation.          component containing an arbitrary number of threads is re-
The framework consists of six major components. Three             sponsible for converting (if needed) and transferring data from
components control the GPU hardware, the fourth component         the host system to the GPU device memory. Moreover, each
is responsible for selecting objects for comparison, the fifth    GPU device is controlled by its own set of GPU components
component offers a service to manage thread pools for work-       to ensure maximum utilization of the given resources. For
loads on CPUs, and the sixth component provides progress          SEGA, the point cloud data is copied into OpenCL buffers
monitoring functionality.                                         and transferred to the GPU. At this point, we encountered a
   The six components communicate via queues that offer           possible bottleneck in the management of OpenCL memory
multithreading inside each component and additionally a vi-       objects: handling several thousands of objects dramatically
able way for utilizing multiple GPU devices on a single           reduced the allocation performance. Thus, we had to introduce
compute node. Furthermore, this design offers the possibility     an additional component responsible for ensuring a simple and
of repeated execution of a computation on GPU and CPU             efficient reuse of memory objects. Additionally, this allows a
hardware. This can easily be realized by states inside a          safer use of GPU device memory because such a pre-allocation
guarantees that the GPU device memory is not exceeded                                                                        V. E VALUATION
during execution, and also limits the number of computations           To assess the performance of our approach, several exper-
currently in progress. After a successful write operation to the    iments have been conducted. The evaluation is split into two
GPU, the calculation object containing the meta-information is      parts. First, the performance gains of the SEGA GPU com-
passed via an additional queue to the Launcher component.           pared to the original SEGA algorithm are investigated. Second,
   The Launcher executes the corresponding GPU kernels,             the results of a large scale comparison of protein binding
which in the case of SEGA are responsible for creating the          sites on Amazon’s EC2 Cluster GPU Instances are presented.
polygon data and combining two distance matrices. After             The structural data has been taken from the CavBase [16]
completion, the calculation object is pushed into the next          maintained by the Cambridge Crystallographic Data Centre.
queue.
   The last GPU related component is the Dispatcher. It is
responsible for reading back the results of the kernel execution                                            10

to the host memory and if necessary process the data further.                                                8

                                                                                                time [ms]
                                                                                                             6

Afterwards, the results are pushed to the ThreadService.                                                     4

Here, the alignments of the polygons are calculated, and the                                           0
                                                                                                             2

                                                                                                     300
results are stored. After successfully finishing a computation,                                                      200
                                                                                                                                                                    200
                                                                                                                                                                           250
                                                                                                                                                                                    300

                                                                                                                                100                          150
the Monitor component is informed.                                                                          number of pseudocenters    0   0
                                                                                                                                               50
                                                                                                                                                    100
                                                                                                                                                          number of pseudocenters

   The Monitor fulfills two major tasks. First, it cre-
                                                                                                                     (a) GPU part of SEGA GPU
ates an interconnection between the Selector and the
ThreadService for storing the results. This is necessary to
know whether all combinations have been successfully calcu-                                                 30

lated. Additionally, it records the progress of the computation                                             20

                                                                                                time [ms]
on a persistent storage. If a computation becomes interrupted                                               10

due to unpredictable reasons, such as system failures or disk                                          0
                                                                                                     300

I/O errors, the computation can be resumed at the correct                                                            200
                                                                                                                               100
                                                                                                                                                    100
                                                                                                                                                          150
                                                                                                                                                                  200
                                                                                                                                                                         250
                                                                                                                                                                                    300

                                                                                                             number of pseudocenters           50
position.                                                                                                                              0   0         number of pseudocenters

   The described framework has been implemented in Java us-                                                          (b) CPU part of SEGA GPU
ing the JogAmp JOCL library [7] for controlling the OpenCL
platform.
                                                                                                            30

                                                                                                            20
                                                                                                time [ms]

C. Cloud Deployment                                                                                         10

   A common approach for parallelizing a computational prob-                                           0
                                                                                                     300
                                                                                                                     200                                                            300
                                                                                                                                                                           250
lem is its division into three steps: work partitioning and                                                                     100
                                                                                                             number of pseudocenters           50
                                                                                                                                                    100
                                                                                                                                                             150
                                                                                                                                                                    200

                                                                                                                                       0   0              number of pseudocenters
distribution, task computation and result collection. In case
of a commutative comparison where a self-comparison is not                       (c) Maximum of GPU and CPU parts of SEGA
necessary, an input set of n elements results in a total number                  GPU
n · (n − 1)/2 computations.
   A straight-forward approach is to divide the total number of                                 2500

                                                                                                2000
computations by the available number of Cloud nodes. If every
                                                                                    time [ms]

                                                                                                1500

comparison is indexed by a single unique identifier, a single                                   1000

                                                                                                     500
node simply needs the identifier to perform a comparison.                                              0
                                                                                                     300

However, a better approach is to divide the total number of                                                          200
                                                                                                                                100                       150
                                                                                                                                                                  200
                                                                                                                                                                         250
                                                                                                                                                                                    300

                                                                                                                                                    100
                                                                                                             number of pseudocenters
comparisons by an arbitrary number that is larger than the                                                                             0   0
                                                                                                                                               50     number of pseudocenters

available number of nodes. This allows one to start the result                                                                 (d) Original SEGA
collection phase before the end of the task computation phase
and, moreover, enables an on-demand scheduling of tasks to          Fig. 3.   SEGA benchmarks
other nodes in case a node fails. The work partitioning and
distribution phase also includes the distribution of the input
data. For this purpose, several approaches are possible, such
as data replication, network exports, and cluster file systems.     A. SEGA vs. SEGA GPU
Fortunately, in our case the required data of the cavity database      The performance of the original SEGA implementation has
could be reduced to about 140 MB. Consequently, the data has        been measured on a single core of an Intel Core i7-2600 @
been transferred and loaded into the main memory of each            3.40 GHz with 8 GB RAM, whereas the performance of SEGA
node. Due to the overall runtimes, this has a negligible impact     GPU has been measured on a single NVIDIA GeForce GTX
on the total computation time. After data and task distribution,    580 with 3 GB RAM.
the nodes can calculate their part(s). When a task has finished,       The runtimes depend on the number of pseudocenters
its results can be collected from the Cloud and stored locally.     present in the protein cavities, and thus both SEGA versions
Fig. 5. Pseudocenter distribution among the selected subset of the CavBase

Fig. 4.   Comparison of original SEGA and SEGA GPU benchmarks
                                                                  must have at least 11 pseudo centers. This resulted in n =
                                                                  144.849 protein binding sites, leading to n ∗ (n − 1)/2 =
                                                                  10.490.543.976 comparisons in total.
have been benchmarked using a subset of the CavBase with
a large spectrum of numbers of pseudocenters. In particular,
the subset consists of cavities where the numbers of pseu-
docenters range from 15 to 250. For each comparison, some                        12

cavities matching certain size requirements were selected and                    10
compared several times (100 times for SEGA GPU; 10 times
for original SEGA) to calculate the average runtimes for a                        8

particular size combination. The plots in Figure 3 show the
                                                                                  6
runtimes depending on the number of pseudocenters of each
cavity. Figure 3(a) shows the average runtime of the GPU                          4

part of a SEGA GPU run, and Figure 3(b) shows the runtime
                                                                                  2
of the CPU part of a SEGA GPU run. It is evident that the
needed CPU runtime is often higher but is never twice as much
as the GPU runtime. One could argue that in typical cluster
nodes that offer GPU hardware, for each GPU at least two
physical CPU cores are available. Instead, we decided to look               (a) Boxplot showing the runtime distribution for SEGA GPU
at the worst case and compared the results with Figure 3(c).
This plot shows the maximum of the two preceding graphs.
Finally, Figure 3(d) shows the runtimes of the original SEGA                   1000
implementation.
   Figure 4 shows the SEGA GPU and original SEGA runtimes                       800

in a single plot. It is important to note that the z-axis
                                                                                600
has a logarithmic scale. It is evident that the SEGA GPU
implementation is 10 to 200 (with an average of 110) times                      400

faster than the original SEGA implementation, depending on
the number of pseudocenters in each cavity.                                     200

                                                                                  0

B. SEGA GPU @ Amazon EC2                                                                   OpenCL                 pure Java

   The main target platform for SEGA GPU is Amazon’s EC2
Cluster GPU Instances. Each node (instance type: cg1.4xlarge)           (b) Boxplot showing the runtime distribution for SEGA GPU
has two Intel Xeon X5570 CPUs, 22 GB RAM and two                        compared to the original SEGA implementation
NVIDIA Tesla M2050 with 2 GB RAM. Benchmarks between
                                                                  Fig. 6.    Boxplots showing the randomly sampled runtime distributions
the Tesla M2050 and GeForce GTX 580 have shown that
the GTX 580 is about two times faster than the TESLA.
This matches the theoretical GFLOPS specifications from              Using Amazon’s EC2 resources with associated costs makes
NVIDIA (single precision floating point). Thus, the GPU           it important to predict the expected total runtime of a compu-
runtime measured in the previous section corresponds to a         tation especially if a hard limit for the financial budget must
single EC2 node.                                                  be respected. According to Figure 5, the number of pseudo
   The subset of the CavBase used in our experiments has been     centers of the proteins in the selected subset of the CavBase
selected based on the following (pharmaceutically meaningful)     is not uniformly distributed. Thus, to predict the total runtime,
criteria: The resolution of a cavity must be larger than 2.5Å;   randomly sampled pairs from the CavBase were visualized
the volume must be between 350Å3 and 3500Å3 ; a protein         with boxplots. The blue box enclosed by the lower and upper
quartile contains the medial 50% of the data. The distance                                ACKNOWLEDGEMENTS
between the upper and lower quartile defines the interquartile       This work is partially supported within the LOEWE pro-
range (IQR), a measure for the variance of the data. The (lower    gram of the State of Hesse, Germany, the German Research
and upper) whisker visualizes the remaining data that is not       Foundation (DFG), and a research grant provided by Amazon
contained in the box defined by the lower and upper quartile.      Web Services (AWS) in Education.
Its length is bounded by 1.5· IQR. Data outside the whisker are
outliers and marked by a cross. The 50th percentile (median) is                                 R EFERENCES
visualized by a red line, the confidence interval (α = 0.05) for
                                                                    [1] S. F. Altschul. BLAST Algorithm. John Wiley & Sons, Ltd, 2001.
the mean by a triangle. Figure 6 (a) shows the boxplot for the      [2] P. J. Artymiuk, A. R. Poirrette, H. M. Grindley, D. W. Rice, and
SEGA GPU implementation. Figure 6 (b) shows a comparison                P. Willett. A Graph-theoretic Approach to the Identification of Three-
between the original SEGA implementation and SEGA GPU                   dimensional Patterns of Amino Acid Side-chains in Protein Structures.
                                                                        Journal of Molecular Biology, 243(2):327–344, 1994.
to exemplify the performance gain.                                  [3] T. Binkowski and A. Joachimiak. Protein functional surfaces: global
   A runtime per comparison of 1.7 ms was expected due to               shape matching and local spatial alignments of ligand binding sites.
the boxplot. To efficiently use of the infrastructure provided          BMC structural biology, 8(1):45–68, 2008.
                                                                    [4] T. Fober, G. Glinca, G. Klebe, and E. Hüllermeier. Superposition
by Amazon EC2, the entire computation was divided to run                and Alignment of Labeled Point Clouds. IEEEACM Transactions on
on 8 Amazon EC2 Cluster GPU Instances in parallel. The                  Computational Biology and Bioinformatics, 8(6):1653–1666, 2011.
comparisons were grouped into 4096 packages and distributed         [5] T. Fober and E. Hullermeier. Similarity Measures for Protein Structures
                                                                        Based on Fuzzy Histogram Comparison. Computational Intelligence,
by assigning 512 packages to each node. Due to the runtime of           pages 18–23, 2010.
a single comparison and a total number of about 10.5 billion        [6] M. Jambon, A. Imberty, G. Delage, and C. Geourjon. A new bioin-
comparisons, a runtime of about 24 days on eight EC2 nodes              formatic approach to detect common 3D sites in protein structures.
                                                                        Proteins, 52(2):137–145, 2003.
was expected. In reality, the computation took about 22 days        [7] JogAmp Community. JogAmp JOCL. http://jogamp.org/jocl/www/,
to complete. The cost was about 6.700 US-$ (10.490.543.976              2012.
comparisons · 1,7 ms / 3.600.000 ms/h · 1,234 US-$/h =              [8] M. A. Kentie. Biological Sequence Alignment on Graphics Processing
                                                                        Units. Master’s thesis, Delft University of Technology, 2010.
6.113 US-$ for computations, the rest for storage and network       [9] K. Kinoshita and H. Nakamura. Identification of protein biochemical
traffic).                                                               functions by similarity search using the molecular surface database eF-
   In contrast, performing the 10.5 billion comparisons on a            site. Protein Science, 12(8):1589–1595, 2003.
                                                                   [10] H. Kuhn. The hungarian method for the assignment problem. Naval
single core of an Intel Core i7-2600 @ 3.40 GHz with about              Research Logistics, 52(1):7–21, 2005.
300 ms runtime per comparison (see Figure 6 (b)) would             [11] D. Lee, O. Redfern, and C. Orengo. Predicting protein function
require about 36.425 days (∼ 100 years); on a quadcore node             from sequence and structure. Nature Reviews Molecular Cell Biology,
                                                                        8(12):995–1005, 2007.
with the same specifications, about 9.106 days (∼ 25 years)        [12] W. Liu, B. Schmidt, and W. Muller-Wittig. Cuda-blastp: Accelerating
are required. If an Amazon High Quad CPU Instance with a                blastp on cuda-enabled graphics hardware. IEEE/ACM Trans. Comput.
cost of 0,40 US-$ per hour were used, the total cost would              Biol. Bioinformatics, 8(6):1678–1684, Nov. 2011.
                                                                   [13] Y. Liu, W. Huang, J. Johnson, and S. Vaidya. GPU accelerated smith-
amount to about 87.421 US-$.                                            waterman. In Proceedings of the 6th international conference on
                                                                        Computational Science - Volume Part IV, ICCS’06, pages 188–195,
                                                                        Berlin, Heidelberg, 2006. Springer-Verlag.
                                                                   [14] Y. Liu, D. Maskell, and B. Schmidt. CUDASW++: optimizing Smith-
                     VI. C ONCLUSIONS                                   Waterman sequence database searches for CUDA-enable graphics pro-
                                                                        cessing units. BMC Research Notes, 2(1):73, 2009.
   In this paper, we have presented a novel approach to            [15] M. Mernberger, G. Klebe, and E. Hüllermeier. SEGA: Semi-global
significantly speed up the computation times of the SEGA al-            graph alignment for structure-based protein comparison. IEEE/ACM
                                                                        Transactions on Computational Biology and Bioinformatics, 8(5):1330–
gorithm for a structural comparison of protein binding sites by         1343, 2011.
using the digital ecosystem of a GPU-based Cloud computing         [16] S. Schmitt, D. Kuhn, and G. Klebe. A New Method to Detect Related
infrastructure. The original CPU-based Java version of SEGA             Function Among Proteins Independent of Sequence and Fold Homology.
                                                                        Journal of Molecular Biology, 323(2):387 – 406, 2002.
has been rewritten in OpenCL to run on NVIDIA GPUs in              [17] A. Stark and R. Russell. Annotation in three dimensions. PINTS:
parallel on a set of Amazon EC2 Cluster GPU Instances. This             patterns in non-homologous tertiary structures. Nucleic Acids Research,
new implementation of SEGA has been tested on a subset of               31(13):3341–3344, 2003.
                                                                   [18] J. M. Thornton. From genome to function. Science, 292(5524):2095–
protein structure data of the CavBase, requiring an acceptable          2097, 2001.
computation time of about three weeks. Thus, a structural          [19] A. Todd, C. Orengo, and J. Thornton. Evolution of function in protein
approach to compare protein binding sites becomes a viable              superfamilies, from a structural perspective. Journal of Molecular
                                                                        Biology, 307(4):1113–1143, 2001.
alternative to sequence-based alignment algorithms.                [20] J. R. Ullmann. An Algorithm for Subgraph Isomorphism. Journal of
   There are several directions for future work. For example,           the ACM, 23(1):31–42, 1976.
                                                                   [21] P. D. Vouzis and N. V. Sahinidis. GPU-BLAST: using graphics
a comparative analysis could be done for the entire protein             processors to accelerate protein sequence alignment. Bioinformatics,
space in the CavBase, which not only allows a classification            27(2):182–188, 2011.
of the protein space into structurally and functionally similar,   [22] L. Xie and P. E. Bourne. Detecting evolutionary relationships across
                                                                        existing fold space, using sequence order-independent profileprofile
homologous and non-homologous protein groups, but also                  alignments. Proceedings of the National Academy of Sciences of the
supports the systematic search for unexpected similarities              United States of America, 105(14):5441–5446, 2008.
and functional relationships. Furthermore, other algorithms
for a structural comparison of protein binding sites could be
rewritten to run on GPU hardware to provide further insights.
You can also read