DataGrid, Prototype of a Biomedical Grid
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1
Methods MIMST 12 © 2003 Schattauer GmbH
DataGrid, Prototype of a Biomedical Grid
V. Breton1, R. Medina2, J. Montagnat3
1
Laboratoire de Physique Corpusculaire, CNRS-IN2P3, Campus des Cézeaux, Aubière , France
2
Laboratoire d’Informatique, de Modélisation et d’Optimisation des Systèmes, Université Blaise
Pascal, Campus des Cézeaux, Aubière, France
3
Creatis, CNRS UMR 5515, INSA, – Bât. B. Pascal, Villeurbanne, France
Summary 1. Introduction On a grid, the exchange of information
between computers is hidden from the
Background: The availability of large amounts of data
in heterogeneous formats and the rapid progress in Bio-informatics and automated medical user. High level services (resource broker,
fields such as computer based drug design, medical image analysis are today identified as high distributed file system…) hide the underly-
imaging and medical simulations have lead to a grow- priority research by funding agencies be- ing infrastructure required to respond to
ing demand for large computational power and easy cause progresses in health care are clearly the user requests.
accessibility to heterogeneous data sources. connected to the analysis of genomics data This transparency from the user point of
Objectives: The goal is to address these needs by de- and the diffusion of information technolo- view requires an extra layer of software,
ploying computing grids. Grids provide both large scale gy in medicine. called middleware. Beside research projects
and distributed storage facilities and an increased But how to allow multiple laboratories dedicated to develop new middleware or to
computing power. Moreover, Grids are a promising
to collect genomics and post-genomics data enhance the performances of the existing
tool to foster the synergy between bio-informatics and
computerised medical imaging. around Europe and to analyse them in an ones, grids should be deployed to address
Methods: A first biomedical grid is being deployed up-to-date and competitive environment? the needs of the biomedical community us-
within the framework of the DataGrid IST project Large biology or bio-informatics research ing the state of the art of the middleware
(http://www.edg.org). The goal of the project is to laboratories have to maintain their own technology.
provide a novel environment to support globally dis- computing resources, but they are facing a The DataGrid European project bio-
tributed scientific exploration involving up to multi- challenging growth of the data they need to medical work package gathers biologists,
Perabyte datasets. manage and process for recent algorithms computer scientists, physicians and physi-
Results and Conclusions: The first biomedical applica- such as data mining. cists around the common goal of deploying
tions deployed inside the project demonstrate the rele- The medical image processing commu- a first biomedical grid.
vance of the grid paradigm for genomics and medical nity is also facing a growing need for large In this paper, we briefly present the
image processing. They also highlight the specific re-
computations to analyse 2D, 3D, 4D images, DataGrid project, explain the relevance of
quirements of the biomedical community.
to simulate medical treatments or surgeries the grid concept for genomics and medical
(radiotherapy, plastic surgery …), and to imaging and describe the first applications
Keywords
Genomics, medical image processing, computing grid develop computer aided surgery. An in- being deployed on DataGrid as a proof of
creasing need for large computing re- concept of a biomedical grid.
Methods Inf Med 2003; 42: ■–■ sources is appearing in hospitals. Physicians
should be able to download and process all
their patients’ medical data from their
office.
The grid paradigm (1, 12) offers CPU
2. The DataGrid Project
and data handling capabilities to the user. The goal of the European DataGrid Pro-
Indeed, grids are designed to share multiple ject (2) is the development of a novel envi-
computing and data storage resources in- ronment to support globally distributed
terconnected through high bandwidth net- scientific exploration involving multi-Per-
works between a large user community. abyte datasets. The project designs and de-
This differs from the Internet where the velops middleware solutions and testbeds
user has to choose on which machine he capable of scaling to handle Perabytes of
wants to connect and which information he distributed data, tens of thousands of re-
wants to retrieve among the tremendous sources (processors, disks, etc.), and thou-
amount of data available. sands of simultaneous users. The scale of
Methods Inf Med 2/20032
Breton et al.
the problem and the distribution of the from post-genomics (micro arrays, protein in each hospital) due to the amount of data
resources and user community preclude structure …) must also be added. This in- composing 3D or 4D images.
straightforward replication of the data on formation comes in multiple formats, from Automatic processing of these data-
several sites, while the aim of providing a many laboratories around the world. A bases is increasingly needed in clinical
general purpose application environment laboratory actively involved in genomics practice. Indeed, the recent availability of
precludes distributing the data using static or post-genomics faces three basic needs multiple digital acquisition devices in hos-
policies. This environment is built by com- related to bio-informatics: pitals (X-ray, CT, MRI, US, TEP scan-
bining and extending newly emerging ● The need to acquire and store the data ners…) is responsible for the increasing
“Grid” technologies to manage large dis- produced with their own experimental amount of digital images. The need for
tributed datasets in addition to computa- resources (mass spectrometer, se- large scale management of medical images
tional elements. A consequence of this pro- quencer, etc.). led to defining distributed health care in-
ject will be the emergence of fundamental ● The need to access to the web servers formation systems (3). However, physicians
new modes of scientific exploration, since (EBI, NCBI, InfoBiogen, etc.) where do not have access to the necessary tools
access to fundamental scientific data is no they can compare their sequences to the today to easily access medical databases
longer restricted to the only producer of public data banks and run the available and make use of automated image process-
that data. While the project focuses on algorithms. ing algorithms that could help for diagnosis.
scientific applications such as High Energy ● The need to store private databases as a The grid architecture will be extremely
Physics, Earth and Biomedical Sciences, result of previous data acquisition and valuable for distributing computational re-
issues of sharing data are common to many analysis. sources over a large community of medical
applications and thus the project has a po- users and to ease data access between dif-
tential impact on future industrial and com- Once these basic needs are met, some re- ferent centres. Image production centres do
mercial activities. search teams may want to develop their not dispose today of the necessary compu-
own algorithms to analyze their data. Some tation resources to process their data. A
others are eager to make their data avail- grid architecture would allow medical cen-
able to the rest of the community. As a re- tres to share computation resources and
3. The Grid, a New Tool sult, the databases made available by the
bio-informatics computing centers are up-
make accessible image processing algo-
rithms to physicians in all centres. The grid
to Face the Challenges dated weekly. would be responsible for optimising access
of Biomedical Sciences A grid offers the opportunity to provide
CPU and storage resources distributed in
to the computation resources available.
Moreover, the grid architecture would
Biomedical sciences are facing a growth of the laboratories, rather than concentrated facilitate the development of telemedicine
the amount of data, as well as a growing in larger and larger computing centers. Its (4, 10).
need for processing larger data sets in order architecture could be a flat grid made of The grid is also expected to bring solu-
to tackle emerging challenges (comparative many “small” (10 to 100 CPU’s, 1 to 10 Tb tions to actual problems that can not be
genomics, image guided epidemiology…). disk) clusters where the public databases handled by commonly available resources
These data are produced in many laborato- would be mirrored weekly. Such mirroring in medical centres. Some medical applica-
ries and hospitals that are generally not can take full advantage of the high flux net- tions have huge memory and computation
equipped to archive or to analyse them.The works. The biology laboratories would ac- requirements and can be parallelized. The
format of these data is highly dependent on cess these resources through web portals grid is expected to provide a parallel archi-
the device used to produce them, whether providing grid-enabled algorithms running tecture in which these applications could be
an imaging or a sequencing device. These on distributed databases. run. Indeed parallel and distributed archi-
data are generally confidential and should tecture have been successfully reported to
not be accessed without careful identifica- solve challenging problems related to med-
tion. 3.2 Using the Grid ical image visualisation (5, 6) and process-
ing (7). Other medical studies involve very
for Medical Imaging large database of images that are not neces-
3.1 Using the Grid for Medical images are distributed over their sarily available on a single site.
production sites (radiology departments,
Bio-informatics hospitals…). Although there is no widely
Biologists are facing an exponential growth established standard for sharing data be-
of their databases. Every time a new tween sites today, there is an increasing
genome is sequenced and annotated, the need for remote medical data access and
whole database is processed again to find processing. Medical image databases are
new homologies. Additional data coming huge (several Tb of data produced per year
Methods Inf Med 2/20033
DataGrid, Biomedical Grid Prototype
4. Grid-Blast, First Use Case et Chimie des Protéines (http://npsa-
pbil.ibcp.fr).
in Genomics Comparative
Analysis
The first application deployed on DataGrid
biomedical testbed dealt with genomics
5. Design of a Biomedical Grid
comparative analysis. BLAST (Basic Local Considering the common requirements of
Alignment Search Tool) (8) is a set of simi- bio-informatics and medical imaging, we
Fig. 1 Structure of the DataGrid biomedical testbed
larity search programs designed to explore proposed an architecture for a biomedical
all of the available sequence databases re- grid. In this section, we describe how we
gardless of whether the query is protein or perceive the different levels of software be-
DNA. BLAST is typically used by biolo- tween the local operating system and the general grid services and an extra layer of
gists when they need to compare sequences user and how the community of biomedical so-called biomedical services is needed.
of nucleic acids or amino acids coming from users of the grid could organize its work. Among these services specifically relevant
their own research to the ones stored in As we stressed earlier, running a grid re- to the needs of the biomedical community
public databases. The BLAST programs quires an extra layer of software, called are distributed data management, automa-
have been designed for speed, with a mini- middleware. This middleware makes a set tic mirroring and updating of databases,
mal sacrifice of sensitivity to distant se- of services available to the grid users. Low visualization and interaction with remote
quence relationships.The scores assigned in level services are useful to grid developers processes…
a BLAST search have a well-defined statis- and high level services for the end users. These biomedical services are available
tical interpretation, making real matches These services are made available to bio- for different families of applications de-
easier to distinguish from random back- medical users who can work on the grid ployed by different groups of users. Three
ground hits. provided they are identified as authorized user groups are experimenting the Data-
Many web portals in the world dedicat- users. The mechanism to authenticate users Grid biomedical testbed today:
ed to genomics comparative analysis offer belonging to a given community is through ● Computer scientists are taking advan-
to the biologist the possibility to compare a so-called virtual organization. tage of the grid architecture and services
his sequences to databases with BLAST. to design new distributed and/or parallel
These portals have to restrict the length algorithms for bio-medical analysis.
and the number of sequences to compare in 5.1 The Different Layers Grid-aware algorithms are distributed
order to avoid saturating their computing algorithms that benefit from the grid
resources. A straightforward impact of exe-
of a Biomedical Grid architecture to optimize and parallelize
cuting BLAST comparisons on a distant The DataGrid project is developing a computations. These algorithms rely on
node on a grid is to reduce the work load on middleware based on the Globus toolkit. an efficient communication interface for
the local computers dedicated to the portal. Led by K. Kesselman and I. Foster of message exchanges between parallel
Moreover, the input file of sequences pro- Argonne National Laboratory (ANL) and processes. They also usually rely on an
vided by the biologist can be split in small- the University of Chicago. The Globus efficient data management service to
er sets of sequences that can be compared (www.globus.org) project (11) is develop- access large amounts of data. grid-aware
to the selected database in parallel on sev- ing fundamental technologies needed to algorithm development is an emerging
eral distant grid nodes. This requires an up- build computational grids. It provides basic research area and mainly involves the
dated copy of the database to be available services on top of which scientists can deve- definition of new algorithms.
on these grid nodes. lop application programs. The most funda- ● Bioinformaticians are creating Grid Ser-
The impact on executing BLAST on the mental layer consists of a set of core servi- vices Portals. These are actual service
grid was demonstrated by measuring the ces, including resource management, secu- providers wishing to take advantage of
time needed to compare the Swissprot rity, remote execution, file transfer, and the grid’s computational power and data
database to itself on one Linux Pentium III communications that enable the linking storage capacity. Grid Portals may be
processor and on a DataGrid testbed. Com- and interoperation of distributed computer used to run the presently existing algo-
puting time was reduced 80 times on the systems. On top of these core services, the rithms as well as new grid-aware ones.
grid. DataGrid middleware work packages have Many biomedical service providers rely
Based on the experience with the Visual been developing an additional layer of ser- on web-based technologies to offer ac-
DataGridBlast (9), several bio-informatics vices dealing with workload scheduling and cess to their databases and computa-
algorithms deployed on DataGrid will be management, data management, grid moni- tional resources. The applications de-
made available from the Protein Sequence toring services, local fabric management scribed in this section intend to take
Analysis portal of the Institut de Biologie and mass storage management. These are advantage of the grid computational
Methods Inf Med 2/20034
Breton et al.
power and the data storage capacity.
Existing or new portals should therefore
5.2 The Biomedical Virtual cal is to prepare the future and to evalu-
ate what would be the benefits and the
interface to the grid jobs submission and Organization limits of using the grid to mine very
data management services. A grid is, by definition, shared by different large databases.
● Biologists and researchers in image groups of users with different goals. These ● Parallel magnetic resonance image sim-
guided diagnosis and therapy use the communities of users use common re- ulator: with the increased interest in
grid as a cooperative framework: their sources but they do not share their data. computer-aided MRI image analysis
aim is to take advantage of the grid in Virtual organizations are simply a way to methods (segmentation, data fusion,
order to organize their work in a coop- organize the different communities, their quantization, etc.), there is a greater
erative manner. A computational grid access to data and resources. Each grid user need for objective methods of algo-
can help the biomedical community by has to be recorded in one virtual organiza- rithms evaluation. In this context, a MRI
offering a cooperative framework with tion where his roles (access rights, autho- simulator provides an interesting assess-
shared resources as well as shared data rizations…) are recorded. The DataGrid ment tool since it generates 3D realistic
bases and data format.The grid will help testbed is used for applications in High images (volumes) from virtual medical
users to organize their work in a cooper- Energy Physics, Earth and Biomedical objects. In order to take into account the
ative manner. It will allow assembling Sciences. Each research field has its own MRI artifacts, a 3D simulator is under
distributed databases, opening new op- virtual organization: the one for biomedical development at CREATIS. The data
portunities for large scale studies such as sciences is shown on Fig. 2. We divided it in grid will be specifically of interest for
epidemiological studies. The required two subgroups: one for genomics and one the parallelization of the isochromats
components are the data management for medical imaging. The users involved in and the MR sequences.
interface for sharing, replicating, updat- the different applications are recorded ac- ● The Bioinformatics initiative in Padova
ing, and exchanging data, and for offer- cording to their application and their home concentrates on the study of indexing
ing access to large CPU resources, etc. institute. techniques to create and to query large
Seven applications are being presently databases of 3D structures (13). Index-
These three families of users/applications deployed on the DataGrid biomedical ing techniques, initially proposed within
have different requirements. The users also testbed: 5 in genomics and 2 in medical the area of computer vision, are used in
do not have the same level of computing imaging. different contexts and differ in the type
awareness. For instance biologists wishing We are going to present three of them of invariant properties (either local or
to use the grid as a collaborative frame- briefly: global), in the transformation class
work are not necessarily as skilled in the ● Data mining on the grid: Knowledge (rigid body or affine transformations),
use of computers as computer scientists Discovery in Databases (KDD) stands and in the method used to formulate
developing distributed algorithms. for the non trivial process of implicit in- and verify hypotheses of associations of
Fig. 1 gives a schematic representation formation, previously unknown and po- the query object. Within the proposed
of the different layers of the DataGrid bio- tentially useful, contained in stored da- scheme, the structural data are stored in
medical test bed. ta. The aim of the Université Blaise Pas- separate tables, spread across the grid.
6. Conclusion
Biomedical sciences are facing an exponen-
tial growth of the volume of data they need
to process and analyze. On one hand, bio-
logists are sequencing more and more
genomes and proteins and want to analyze
them with more and more sophisticated
algorithms. On the other hand, imaging
devices are widely spreading in the hospi-
tals, generating terabytes of data that need
to be stored and made available to physi-
cians.
Fig. 2 The computing needs of bio-informatics
Biomedical virtual organi- and medical informatics are basically of the
zation same nature:
Methods Inf Med 2/20035
DataGrid, Biomedical Grid Prototype
● handle large volumes of data produced
in many centers,
References Workshop on Parallel Image Analysis, pp 65-
78, Lyon, France, December 1995.
1. Foster I, Kesselman C.The Grid, blueprint for a 8. Altschul SF, Gish W, Miller W, Myers EW, Lip-
● define common standards for their in- new computing infrastructure. Morgan Kauf- man DJ. Basic local alignment search tool.
teroperability, man, San Francisco, 1999. J Mol Biol 1990; 215: 403-10.
● provide to a large community of users 2. Segal B. Grid computing: the European data 9. Legré Y, Météry R, Fougas AS, Joubert M.
project. IEEE Nuclear Science Symposium and Visual DataGridBlast. Private communication.
(biologists, physicians) a secured and Medical Imaging Conference, Lyon, 15-20 10. Montagnat J, Davila E, Magnin IE. 3D objects
efficient access to their content. October 2000. visualization for remote interactive medical ap-
3. Thomson M, Johnson W, Goujun J, Lee J, Tier- plications. 3D Data Visualization, Processing,
We have tried to demonstrate that the grid ney B, Terdiman JF. Distributed health care and Transmission, Padova, Italy, June 2002.
imaging information systems. PACS Design 11. Foster I, Kesselman C. Globus: A Metacomput-
paradigm was a good response to these and Evaluation: Engineering and Clinical ing Infrastructure Toolkit. International J Su-
needs. The DataGrid biomedical work Issues, volume 3035, SPIE Medical Imaging, percomputer Applications 1997;11(2): 115-28.
package is the first attempt to develop the 1997. 12. Foster I, Kesselman C, Tuecke S. The anatomy
specific grid services that will allow the bio- 4. Graves S, Tullio J, Downs JH, Kassel N. Tele- of the Grid: enabling scalable virtual organiza-
presence in neurosurgery: the integrated re- tions. International J Supercomputer Applica-
medical community to successfully address mote neurosurgical system. Medicine meets tions 2001; 15 (3).
its challenges. virtual reality 5, 1997. 13. Guerra C, Lonardi S, Zanotti G.Analysis of sec-
5. von Laszewski G, Su MH, Insley JA, Foster I, ondary structures of proteins using indexing
Acknowledgment Bresnahan J, Kesselman C, Thiebaux M, Rivers techniques. IEEE Proc. First Int. Symposium
The authors acknowledge the contributions of ML, Wang S, Tieman B, McNulty I. Real-time on 3D Data Processing Visualization and
all the participants to the biomedical work pack- analysis, visualization, and Steering of Microto- Transmission, 2002.
age of DataGrid. Special thanks are due to mography experiments at photon sources. 9th
Christophe Blanchet, Emmanuel Cornillot, SIAM Conference on Parallel Processing for
Scientific Computing, April 1999. Correspondence to:
Nicolas Jacq, and Christian Michau. Vincent Breton
6. Li JJ, Miguet S. Parallel volume rendering of
medical images. EWPC’92: From theory to Laboratoire de Physique Corpusculaire
sound Practice, pp 332-343, Barcelone, 1992. Campus des Cezeaux
7. Miguet S, Nicod JM. An optimal parallel iso- 63177 Aubière Cedex, France
surface extraction algorithm. 4th International E-mail:breton@clermont.in2p3.fr
Methods Inf Med 2/2003You can also read