What is Bioinformatics? A Proposed Definition and Overview of the Field

Page created by Ida Weaver
 
CONTINUE READING
What is Bioinformatics? A Proposed Definition and Overview of the Field
346
© 2001                                   Schattauer GmbH

What is Bioinformatics?
A Proposed Definition and Overview of the Field
N. M. Luscombe, D. Greenbaum, M. Gerstein
Department of Molecular Biophysics and Biochemistry
Yale University, New Haven, USA

Summary                                                       1. Introduction                                    related to this new field has been surging,
                                                                                                                 and now comprise almost 2% of the
Background: The recent flood of data from genome                                                                 annual total of papers in PubMed.
sequences and functional genomics has given rise to
                                                              Biological data are being produced at a                This unexpected union between the two
new field, bioinformatics, which combines elements
of biology and computer science.                              phenomenal rate [1]. For example as                subjects is attributed to the fact that life
Objectives: Here we propose a definition for this             of April 2001, the GenBank repository of           itself is an information technology; an
new field and review some of the research that is             nucleic acid sequences contained                   organism’s physiology is largely deter-
being pursued, particularly in relation to transcriptional    11,546,000 entries [2] and the SWISS-              mined by its genes, which at its most basic
regulatory systems.                                           PROT database of protein sequences con-            can be viewed as digital information.At the
Methods: Our definition is as follows: Bioinformatics         tained 95,320 [3]. On average, these databa-       same time, there have been major advances
is conceptualizing biology in terms of macromolecules         ses are doubling in size every 15 months [2].      in the technologies that supply the initial
(in the sense of physical-chemistry) and then applying        In addition, since the publication of              data; Anthony Kervalage of Celera recent-
“informatics” techniques (derived from disciplines            the H. influenzae genome [4], complete             ly cited that an experimental laboratory
such as applied maths, computer science, and statis-          sequences for nearly 300 organisms have            can produce over 100 gigabytes of data a
tics) to understand and organize the information
                                                              been released, ranging from 450 genes to           day with ease [5]. This incredible pro-
associated with these molecules, on a large-scale.
Results and Conclusions: Analyses in bioinformatics           over 100,000. Add to this the data from the        cessing power has been matched by devel-
predominantly focus on three types of large datasets          myriad of related projects that study gene         opments in computer technology; the most
available in molecular biology: macromolecular struc-         expression, determine the protein structu-         important areas of improvements have
tures, genome sequences, and the results of function-         res encoded by the genes, and detail how           been in the CPU, disk storage and Internet,
al genomics experiments (eg expression data).                 these products interact with one another,          allowing faster computations, better data
Additional information includes the text of scientific        and we can begin to imagine the enormous           storage and revolutionalised the methods
papers and “relationship data” from metabolic path-           quantity and variety of information that is        for accessing and exchanging data.
ways, taxonomy trees, and protein-protein interaction         being produced.
networks. Bioinformatics employs a wide range                    As a result of this surge in data, compu-
of computational techniques including sequence and            ters have become indispensable to biologi-
structural alignment, database design and data
                                                              cal research. Such an approach is ideal
                                                                                                                 1.1 Aims of Bioinformatics
mining, macromolecular geometry, phylogenetic tree
construction, prediction of protein structure and             because of the ease with which computers           In general, the aims of bioinformatics are
function, gene finding, and expression data clustering.       can handle large quantities of data and            three-fold. First, at its simplest bioinfor-
The emphasis is on approaches integrating a variety of        probe the complex dynamics observed in             matics organises data in a way that allows
computational methods and heterogeneous data                  nature. Bioinformatics, the subject of the         researchers to access existing information
sources. Finally, bioinformatics is a practical discipline.   current review, is often defined as the appli-     and to submit new entries as they are
We survey some representative applications, such as           cation of computational techniques to              produced, e.g. the Protein Data Bank for
finding homologues, designing drugs, and performing           understand and organise the information            3D macromolecular structures [6, 7]. While
large-scale censuses. Additional information pertinent        associated with biological macromolecules.         data-curation is an essential task, the in-
to the review is available over the web at                    Fig. 1 shows that the number of papers             formation stored in these databases is
http://bioinfo.mbb.yale.edu/what-is-it.                                                                          essentially useless until analysed. Thus the
                                                                                                                 purpose of bioinformatics extends much
Keywords                                                                                                         further. The second aim is to develop tools
Bioinformatics, Genomics, Introduction, Transcription         Updated version of an invited review paper         and resources that aid in the analysis of
Regulation                                                    that appeared in Haux, R., Kulikowski, C. (eds.)
                                                              (2001). IMIA Yearbook of Medical Informatics       data. For example, having sequenced a par-
Method Inform Med 2001; 40: 346–58                            2001: Digital Libraries and Medicine, pp. 83–99.   ticular protein, it is of interest to compare it
                                                              Stuttgart: Schattauer.                             with previously characterised sequences.

Method Inform Med 4/2001
347
                                                                                                                                                              What is Bioinformatics?

                                                                                                                                     2001). At the next level are protein sequenc-
                                                                                                                                     es comprising strings of 20 amino acid-
                                                                                                                                     letters. At present there are about 400,000
                                                                                                                                     known protein sequences [3], with a typical
                                                                                                                                     bacterial protein containing approximately
                                                                                                                                     300 amino acids. Macromolecular struc-
                                                                                                                                     tural data represents a more complex form
                                                                                                                                     of information. There are currently 15,000
                                                                                                                                     entries in the Protein Data Bank, PDB
                                                                                                                                     [6, 7], containing atomic structures of pro-
                                                                                                                                     teins, DNA and RNA solved by x-ray
                                                                                                                                     crystallography and NMR. A typical PDB
                                                                                                                                     file for a medium-sized protein contains the
                                                                                                                                     xyz-coordinates of approximately 2,000
                                                                                                                                     atoms.
                                                                                                                                         Scientific euphoria has recently centred
                                                                                                                                     on whole genome sequencing. As with the
                                                                                                                                     raw DNA sequences, genomes consist of
                                                                                                                                     strings of base-letters, ranging from 1.6
                                                                                                                                     million bases in Haemophilus influenzae
Fig. 1 Plot showing the growth of scientific publications in bioinformatics between 1973 and 2000. The histogram bars
(left vertical axis) counts the total number of scientific articles relating to bioinformatics, and the black line (right vertical   [10] to 3 billion in humans [11, 12]. The
axis) gives the percentage of the annual total of articles relating to bioinformatics. The data are taken from PubMed.               Entrez database [13] currently has com-
                                                                                                                                     plete sequences for nearly 300 archaeal,
                                                                                                                                     bacterial and eukaryotic organisms. In
    This needs more than just a simple text-                        finally some of the major practical applica-                     addition to producing the raw nucleotide
based search, and programs such as FASTA                            tions of bioinformatics.                                         sequence, a lot of work is involved in
[8] and PSI-BLAST [9] must consider what                                                                                             processing this data. An important aspect
constitutes a biologically significant match.                                                                                        of complete genomes is the distinction
Development of such resources dictates                                                                                               between coding regions and non-coding
expertise in computational theory, as well
as a thorough understanding of biology.
                                                                    2. “…the INFORMATION                                             regions -‘junk’ repetitive sequences making
                                                                                                                                     up the bulk of base sequences especially in
The third aim is to use these tools to ana-                         associated with these                                            eukaryotes. Within the coding regions,
lyse the data and interpret the results in a
biologically meaningful manner. Traditio-                           Molecules…”                                                      genes are annotated with their translated
                                                                                                                                     protein sequence, and often with their
nally, biological studies examined individu-                                                                                         cellular function.
al systems in detail, and frequently com-                           Table 1 lists the types of data that are
pared them with a few that are related. In                          analysed in bioinformatics and the range of
bioinformatics, we can now conduct global                           topics that we consider to fall within the                        Bioinformatics – a Definition1
analyses of all the available data with the                         field. Here we take a broad view and in-
aim of uncovering common principles that                            clude subjects that may not normally be                           (Molecular) bio – informatics: bioinfor-
apply across many systems and highlight                             listed. We also give approximate values                           matics is conceptualising biology in
novel features.                                                     describing the sizes of data being discussed.                     terms of molecules (in the sense of Phy-
    In this review, we provide a systematic                             We start with an overview of the sources                      sical chemistry) and applying “informa-
definition of bioinformatics as shown in                            of information. Most bioinformatics analy-                        tics techniques” (derived from disci-
Box 1. We focus on the first and third aims                         ses focus on three primary sources of data:                       plines such as applied maths, computer
just described, with particular reference to                        DNA or protein sequences, macromolecu-                            science and statistics) to understand and
the keywords: information, informatics,                             lar structures and the results of functional                      organise the information associated
organisation, understanding, large-scale                            genomics experiments. Raw DNA se-                                 with these molecules, on a large scale. In
and practical applications. Specifically, we                        quences are strings of the four base-letters                      short, bioinformatics is a management
discuss the range of data that are currently                        comprising genes, each typically 1,000 bases                      information system for molecular biolo-
being examined, the databases into which                            long. The GenBank [2] repository of                               gy and has many practical applications.
they are organised, the types of analyses                           nucleic acid sequences currently holds a                          1As   submitted to the Oxford English
that are being conducted using transcrip-                           total of 12.5 billion bases in 11.5 million                       Dictionary.
tion regulatory systems as an example, and                          entries (all database figures as of April

                                                                                                                                                          Method Inform Med 4/2001
348
Luscombe, Greenbaum, Gerstein

Table 1 Sources of data used in bioinformatics, the quantity of each type of data that is currently (April 2001) available,   quences. While more biological informa-
and bioinformatics subject areas that utilize this data.                                                                      tion can be derived from a single structure
                                                                                                                              than a protein sequence, the lack of depth
                                                                                                                              in the latter is compensated by analysing
                                                                                                                              larger quantities of data.

                                                                                                                              3. “… ORGANISE the Infor-
                                                                                                                              mation on a LARGE SCALE…”
                                                                                                                              3.1 Redundancy and Multiplicity
                                                                                                                              of Data
                                                                                                                              A concept that underpins most research
                                                                                                                              methods in bioinformatics is that much of
                                                                                                                              the data can be grouped together based on
                                                                                                                              biologically meaningful similarities. For
                                                                                                                              example, sequence segments are often
                                                                                                                              repeated at different positions of genomic
                                                                                                                              DNA [27]. Genes can be clustered into
                                                                                                                              those with particular functions (eg enzy-
                                                                                                                              matic actions) or according to the meta-
                                                                                                                              bolic pathway to which they belong [28],
                                                                                                                              although here, single genes may actually
                                                                                                                              possess several functions [29]. Going
                                                                                                                              further, distinct proteins frequently have
                                                                                                                              comparable sequences – organisms often
                                                                                                                              have multiple copies of a particular gene
                                                                                                                              through duplication and different species
                                                                                                                              have equivalent or similar proteins that
                                                                                                                              were inherited when they diverged from
                                                                                                                              each other in evolution. At a structural
                                                                                                                              level, we predict there to be a finite number
                                                                                                                              of different tertiary structures – estimates
   More recent sources of data have been                        tities of data when experiments are con-                      range between 1,000 and 10,000 folds
from functional genomics experiments, of                        ducted for larger organisms and at more                       [30, 31] – and proteins adopt equivalent
which the most common are gene expres-                          time-points.                                                  structures even when they differ greatly in
sion studies. We can now determine expres-                          Further genomic-scale data include                        sequence [32]. As a result, although the
sion levels of almost every gene in a given                     biochemical information on metabolic                          number of structures in the PDB has
cell on a whole-genome level, however                           pathways, regulatory networks, protein-                       increased exponentially, the rate of discov-
there is currently no central repository for                    protein interaction data from two-hybrid                      ery of novel folds has actually decreased.
this data and public availability is limited.                   experiments, and systematic knockouts of                          There are common terms to describe the
These experiments measure the amount of                         individual genes to test the viability of an                  relationship between pairs of proteins or
mRNA that is produced by the cell [14-18]                       organism.                                                     the genes from which they are derived:
under different environmental conditions,                           What is apparent from this list is the                    analogous proteins have related folds, but
different stages of the cell cycle and differ-                  diversity in the size and complexity of dif-                  unrelated sequences, while homologous
ent cell types in multi-cellular organisms.                     ferent datasets. There are invariably more                    proteins are both sequentially and structu-
Much of the effort has so far focused on the                    sequence-based data than others because                       rally similar. The two categories can some-
yeast [19-24] and human genomes [25, 26].                       of the relative ease with which they can be                   times be difficult to distinguish especially if
One of the largest dataset for yeast has                        produced.This is partly related to the great-                 the relationship between the two proteins
made approximately 20 time-point meas-                          er complexity and information-content of                      is remote [33, 34]. Among homologues, it is
urements for 6,000 genes [19]. However,                         individual structures or gene expression                      useful to distinguish between orthologues,
there is potential for much greater quan-                       experiments compared to individual se-                        proteins in different species that have evolv-

Method Inform Med 4/2001
349
                                                                                                                           What is Bioinformatics?

ed from a common ancestral gene, and            straightforward to access and cross-            producing multiple sequence alignments
paralogues, proteins that are related by        reference these sources of information be-      [48], and searching for functional domains
gene duplication within a genome [35].          cause of differences in nomenclature and        from conserved sequence motifs in such
Normally, orthologues retain the same           file formats.                                   alignments. Investigations of structural
function while paralogues evolve distinct,          At a basic level, this problem is fre-      data include prediction of secondary and
but related functions [36].                     quently addressed by providing external         tertiary protein structures, producing
    An important concept that arises from       links to other databases. For example in        methods for 3D structural alignments [49,
these observations is that of a finite “parts   PDBsum, web-pages for individual struc-         50], examining protein geometries using
list” for different organisms [37-39]: an       tures direct the user towards corresponding     distance and angular measurements, calcu-
inventory of proteins contained within an       entries in the PDB, NDB, CATH, SCOP             lations of surface and volume shapes and
organism, arranged according to different       and SWISS-PROT databases. At a more             analysis of protein interactions with other
properties such as gene sequence, protein       advanced level, there have been efforts to      subunits, DNA, RNA and smaller mole-
fold or function. Taking protein folds as an    integrate access across several data sources.   cules. These studies have lead to molecular
example, we mentioned that with a few           One is the Sequence Retrieval System, SRS       simulation topics in which structural data
exceptions, the tertiary structures of pro-     [41], which allows flat-file databases to be    are used to calculate the energetics in-
teins adopt one of a limited repertoire         indexed to each other; this allows the user     volved in stabilising macromolecular struc-
of folds. As the number of different fold       to retrieve, link and access entries from       tures, simulating movements within macro-
families is considerably smaller than the       nucleic acid, protein sequence, protein         molecules, and computing the energies
number of genes, categorising the proteins      motif, protein structure and bibliographic      involved in molecular docking. The increa-
by fold provides a substantial simplification   databases. Another is the Entrez facility       sing availability of annotated genomic
of the contents of a genome. Similar sim-       [42], which provides similar gateways to        sequences has resulted in the introduction
plifications can be provided by other attri-    DNA and protein sequences, genome               of computational genomics and proteomics
butes such as protein function. As such, we     mapping data, 3D macromolecular structu-        – large-scale analyses of complete genomes
expect this notion of a finite parts list to    res and the PubMed bibliographic database       and the proteins that they encode. Re-
become increasingly common in future            [43].A search for a particular gene in either   search includes characterisation of protein
genomic analyses.                               database will allow smooth transitions to       content and metabolic pathways between
    Clearly, an essential aspect of managing    the genome it comes from, the protein           different genomes, identification of interac-
this large volume of data lies in developing    sequence it encodes, its structure, biblio-     ting proteins, assignment and prediction of
methods for assessing similarities between      graphic reference and equivalent entries for    gene products, and large-scale analyses of
different biomolecules and identifying          all related genes. In our own group, we have    gene expression levels. Some of these re-
those that are related. There are well-docu-    developed the SPINE [44] and PartsList          search topics will be demonstrated in our
mented classifications for all of the main      [39] web resources; these databases inte-       example analysis of transcription regula-
types of data we described earlier. Al-         grate many types of experimental data and       tory systems.
though detailed descriptions of these clas-     organise them using the concept of the              Other subject areas we have included in
sification systems are beyond the scope of      finite “parts list” we described above.         Table 1 are: development of digital libraries
the current review, they are of great impor-                                                    for automated bibliographical searches,
tance as they ease comparisons between                                                          knowledge bases of biological information
genomes and their products. Links to the                                                        from the literature, DNA analysis methods
major databases are available from our
supplementary website.
                                                4. “…UNDERSTAND and                             in forensics, prediction of nucleic acid struc-
                                                                                                tures, metabolic pathway simulations, and
                                                Organise the Information…”                      linkage analysis – linking specific genes to
                                                                                                different disease traits.
                                                Having examined the data, we can discuss            In addition to finding relationships be-
3.2 Data Integration                            the types of analyses that are conducted.As     tween different proteins, much of bioin-
The most profitable research in bioinfor-       shown in Table 1, the broad subject areas in    formatics involves the analysis of one type
matics often results from integrating mul-      bioinformatics can be separated according       of data to infer and understand the obser-
tiple sources of data [40]. For instance, the   to the type of information that is used. For    vations for another type of data. An exam-
3D coordinates of a protein are more useful     raw DNA sequences, investigations involve       ple is the use of sequence and structural
if combined with data about the protein’s       separating coding and non-coding regions,       data to predict the secondary and tertiary
function, occurrence in different genomes,      and identification of introns, exons and        structures of new protein sequences [51].
and interactions with other molecules. In       promoter regions for annotating genomic         These methods, especially the former, are
this way, individual pieces of information      DNA [45, 46]. For protein sequences, ana-       often based on statistical rules derived
are put in context with respect to other        lyses include developing algorithms for         from structures, such as the propensity for
data. Unfortunately, it is not always           sequence comparisons [47], methods for          certain amino acid sequences to produce

                                                                                                                      Method Inform Med 4/2001
350
Luscombe, Greenbaum, Gerstein

                                                                     Paradigm shifts during the past couple of decades have taken much of biology away from the
                                                                     laboratory bench and have allowed the integration of other scientific disciplines, specifically
                                                                     computing. The result is an expansion of biological research in breadth and depth. The vertical axis
                                                                     demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab.
                                                                     Starting with a single gene sequence, we can determine with strong certainty, the protein
                                                                     sequence. From there, we can determine the structure using structure prediction techniques. With
                                                                     geometry calculations, we can further resolve the protein’s surface and through molecular
                                                                     simulation determine the force fields surrounding the molecule. Finally docking algorithms can
                                                                     provide predictions of the ligands that will bind on the protein surface, thus paving the way for the
                                                                     design of a drug specific to that molecule. The horizontal axis shows how the influx of biological
                                                                     data and advances in computer technology have broadened the scope of biology. Initially with a pair
                                                                     of proteins, we can make comparisons between the between sequences and structures of
                                                                     evolutionary related proteins. With more data, algorithms for multiple alignments of several
                                                                     proteins become necessary. Using multiple sequences, we can also create phylogenetic trees to
                                                                     trace the evolutionary development of the proteins in question. Finally, with the deluge of data we
                                                                     currently face, we need to construct large databases to store, view and deconstruct the
                                                                     information. Alignments now become more complex, requiring sophisticated scoring schemes and
                                                                     there is enough data to compile a genome census – a genomic equivalent of a population census –
                                                                     providing comprehensive statistical accounting of protein features in genomes.

Fig. 2 Organizing and understanding biological data

different secondary structural elements.              transferred between homologous proteins                   analysis in two dimension, depth and
Another example is the use of structural              [55].                                                     breadth. The first is represented by the
data to understand a protein’s function;                                                                        vertical axis in the figure and outlines a
here studies have investigated the rela-                                                                        possible approach to the rational drug
tionship different protein folds and their                                                                      design process. The aim is to take a single
functions [52, 53] and analysed similarities
                                                      4.1 The Bioinformatics Spectrum                           gene and follow through an analysis that
between different binding sites in the ab-            Fig. 2 summarises the main points we                      maximises our understanding of the
sence of homology [54]. Combined with                 raised in our discussions of organising                   protein it encodes. Starting with a gene
similarity measurements, these studies pro-           and understanding biological data – the                   sequence, we can determine the protein
vide us with an understanding of how much             development of bioinformatics techniques                  sequence with strong certainty. From there,
biological information can be accurately              has allowed an expansion of biological                    prediction algorithms can be used to calcu-

Method Inform Med 4/2001
351
                                                                                                                           What is Bioinformatics?

late the structure adopted by the protein.
Geometry calculations can define the
                                                 son methods such as text search and one-
                                                 dimensional alignment algorithms. Motif
                                                                                                6.1 Structural Studies
shape of the protein’s surface and molecu-       and pattern identification for multiple        As of April 2001, there were 379 structures
lar simulations can determine the force          sequences depend on machine learning,          of protein-DNA complexes in the PDB.
fields surrounding the molecule. Finally,        clustering and data-mining techniques. 3D      Analyses of these structures have provided
using docking algorithms, one could              structural analysis techniques include Eu-     valuable insight into the stereochemical
identify or design ligands that may bind         clidean geometry calculations combined         principles of binding, including how par-
the protein, paving the way for designing a      with basic application of physical chemis-     ticular base sequences are recognized
drug that specifically alters the protein’s      try, graphical representations of surfaces     and how the DNA structure is quite often
function. In practise, the intermediate steps    and volumes, and structural comparison         modified on binding.
are still difficult to achieve accurately, and   and 3D matching methods. For molecular            A structural taxonomy of DNA-binding
they are best combined with experimental         simulations, Newtonian mechanics, quan-        proteins, similar to that presented in SCOP
methods to obtain some of the data, for          tum mechanics, molecular mechanics and         and CATH, was first proposed by Harrison
example characterising the structure of the      electrostatic calculations are applied. In     [56] and periodically updated to accom-
protein of interest.                             many of these areas, the computational         modate new structures as they are solved
    The aim of the second dimension, the         methods must be combined with good             [57]. The classification consists of a two-tier
breadth in biological analysis, is to compare    statistical analyses in order to provide an    system: the first level collects proteins into
a gene or gene product with others. Ini-         objective measure for the significance of      eight groups that share gross structural
tially, simple algorithms can be used to         the results.                                   features for DNA-binding, and the second
compare the sequences and structures of a                                                       comprises 54 families of proteins that are
pair of related proteins. With a larger num-                                                    structurally homologous to each other.
ber of proteins, improved algorithms can be
used to produce multiple alignments, and
                                                 6. Transcription Regulation –                  Assembly of such a system simplifies the
                                                                                                comparison of different binding methods; it
extract sequence patterns or structural          a Case Study in Bioinformatics                 highlights the diversity of protein-DNA
templates that define a family of proteins.                                                     complex geometries found in nature, but
Using this data, it is also possible to con-     DNA-binding proteins have a central role       also underlines the importance of inter-
struct phylogenetic trees to trace the evolu-    in all aspects of genetic activity within an   actions between -helices and the DNA
tionary path of proteins. Finally, with even     organism, participating in processes such as   major groove, the main mode of binding in
more data, the information must be stored        transcription, packaging, rearrangement,       over half the protein families. While the
in large-scale databases. Comparisons            replication and repair. In this section, we    number of structures represented in the
become more complex, requiring multiple          focus on the studies that have contributed     PDB does not necessarily reflect the rela-
scoring schemes, and we are able to con-         to our understanding of transcription          tive importance of the different proteins in
duct genomic scale censuses that provide         regulation in different organisms. Through     the cell, it is clear that helix-turn-helix,
comprehensive statistical accounts of            this example, we demonstrate how bio-          zinc-coordinating and leucine zipper motifs
protein features, such as the abundance of       informatics has been used to increase our      are used repeatedly.These provide compact
particular structures or functions in diffe-     knowledge of biological systems and also       frameworks to present the -helix on the
rent genomes. It also allows us to build         illustrate the practical applications of the   surfaces of structurally diverse proteins. At
phylogenetic trees that trace the evolution      different subject areas that were briefly      a gross level, it is possible to highlight the
of whole organisms.                              outlined earlier. We start by considering      differences between transcription factor
                                                 structural analyses of how DNA-binding         domains that “just” bind DNA and those
                                                 proteins recognise particular base se-         involved in catalysis [58]. Although there
5. “… applying INFORMATICS                       quences. Later, we review several genomic
                                                 studies that have characterised the nature
                                                                                                are exceptions, the former typically
                                                                                                approach the DNA from a single face and
TECHNIQUES…”                                     of transcription factors in different orga-    slot into the grooves to interact with base
                                                 nisms, and the methods that have been used     edges. The latter commonly envelope the
The distinct subject areas we mention            to identify regulatory binding sites in the    substrate, using complex networks of
require different types of informatics tech-     upstream regions. Finally, we provide an       secondary structures and loops.
niques. Briefly, for data organisation, the      overview of gene expression analyses that         Focusing on proteins with -helices, the
first biological databases were simple flat      have been recently conducted and suggest       structures show many variations, both in
files. However with the increasing amount        future uses of transcription regulatory ana-   amino acid sequences and detailed geo-
of information, relational database              lyses to rationalise the observations made     metry. They have clearly evolved indepen-
methods with Web-page interfaces have            in gene expression experiments. All the        dently in accordance with the requirements
become increasingly popular. In sequence         results that we describe have been found       of the context in which they are found.
analysis, techniques include string compari-     through computational studies.                 While achieving a close fit between the

                                                                                                                      Method Inform Med 4/2001
352
Luscombe, Greenbaum, Gerstein

-helix and major groove, there is enough        interactions of arginine or lysine with          6.2 Genomic Studies
flexibility to allow both the protein and        guanine, asparagine or glutamine with
DNA to adopt distinct conformations.             adenine and threonine with thymine. Such         Due to the wealth of biochemical data that
However, several studies that analysed the       preferences were explained through exami-        are available, genomic studies in bioin-
binding geometries of -helices demon-           nation of the stereochemistry of the amino       formatics have concentrated on model
strated that most adopt fairly uniform con-      acid side chains and base edges. Also            organisms, and the analysis of regulatory
formations regardless of protein family.         highlighted were more complex types of           systems has been no exception. Identification
They are commonly inserted in the major          interactions where single amino acids            of transcription factors in genomes invari-
groove sideways, with their lengthwise axis      contact more than one base-step simulta-         ably depends on similarity search strate-
roughly parallel to the slope outlined by        neously, thus recognising a short DNA            gies, which assume a functional and evolu-
the DNA backbone. Most start with the            sequence. These results suggested that           tionary relationship between homologous
N-terminus in the groove and extend out,         universal specificity, one that is observed      proteins. In E. coli, studies have so far
completing two to three turns within             across all protein-DNA complexes, indeed         estimated a total of 300 to 500 transcription
contacting distance of the nucleic acid [59,     exists. However, many interactions that are      regulators [71] and PEDANT [72], a data-
60].                                             normally considered to be non-specific,          base of automatically assigned gene funct-
   Given the similar binding orientations, it    such as those with the DNA backbone, can         ions, shows that typically 2-3% of pro-
is surprising to find that the interactions      also provide specificity depending on the        karyotic and 6-7% of eukaryotic genomes
between each amino acid position along           context in which they are made.                  comprise DNA-binding proteins.As assign-
the -helices and nucleotides on the DNA            Armed with an understanding of                ments were only complete for 40-60% of
vary considerably between different pro-         protein structure, DNA-binding motifs and        genomes as of August 2000, these figures
tein families. However, by classifying the       side chain stereochemistry, a major applica-     most likely underestimate the actual num-
amino acids according to the sizes of their      tion has been the prediction of binding          ber. Nonetheless, they already represent a
side chains, we are able to rationalise the      either by proteins known to contain a parti-     large quantity of proteins and it is clear that
different interactions patterns. The rules of    cular motif, or those with structures solved     there are more transcription regulators
interactions are based on the simple pre-        in the uncomplexed form. Most common             in eukaryotes than other species. This is
mise that for a given residue position on        are predictions for -helix-major groove         unsurprising, considering the organisms
-helices in similar conformations, small        interactions – given the amino acid se-          have developed a relatively sophisticated
amino acids interact with nucleotides that       quence, what DNA sequence would it               transcription mechanism.
are close in distance and large amino acids      recognise [61, 67]. In a different approach,        From the conclusions of the structural
with those that are further [60, 61]. Equi-va-   molecular simulation techniques have been        studies, the best strategy for characterising
lent studies for binding by other structural     used to dock whole proteins and DNAs on          DNA-binding of the putative transcription
motifs, like -hairpins, have also been con-     the basis of force-field calculations around     factors in each genome is to group them
ducted [62]. When considering these              the two molecules [68, 69].                      by homology and to analyse the individual
interactions, it is important to remember           The reason that both methods have             families. Such classifications are provided
that different regions of the protein surface    been met with limited success is because         in the secondary sequence databases
also provide interfaces with the DNA.            even for apparently simple cases like -         described earlier and also those that
   This brings us to look at the atomic level    helix-binding, there are many other factors      specialise in regulatory proteins such as
interactions between individual amino            that must be considered. Comparisons             RegulonDB [73] and TRANSFAC [74].
acid-base pairs. Such analyses are based on      between bound and unbound nucleic acid           Of even greater use is the provision of
the premise that a significant proportion of     structures show that DNA-bending is a            structural assignments to the proteins;
specific DNA-binding could be rationalised       common feature of complexes formed with          given a transcription factor, it is helpful to
by a universal code of recognition between       transcription factors [58, 70].This and other    know the structural motif that it uses for
amino acids and bases, ie whether certain        factors such as electrostatic and cation-        binding, therefore providing us with a
protein residues preferably interact with        mediated interactions assist indirect            better understanding of how it recognises
particular nucleotides regardless of the         recognition of the nucleotide sequence,          the target sequence. Structural genomics
type of protein-DNA complex [63]. Studies        although they are not well understood yet.       through bioinformatics assigns structures
have considered hydrogen bonds, van der          Therefore, it is now clear that detailed rules   to the protein products of genomes by
Waals contacts and water-mediated bonds          for specific DNA-binding will be family          demonstrating similarity to proteins of
[64-66]. Results showed that about 2/3 of all    specific, but with underlying trends such as     known structure [75]. These studies have
interactions are with the DNA back-              the arginine-guanine interactions.               shown that prokaryotic transcription fac-
bone and that their main role is one of                                                           tors most frequently contain helix-turn-
sequence-independent stabilisation. In                                                            helix motifs [71, 76] and eukaryotic factors
contrast, interactions with bases display                                                         contain homeodomain type helix-turn-
some strong preferences, including the                                                            helix, zinc finger or leucine zipper motifs.

Method Inform Med 4/2001
353
                                                                                                                            What is Bioinformatics?

From the protein classifications in each         specific but different members bind distinct    approach, it was found that at least 27% of
genome, it is clear that different types of      base sequences. Here protein residues           known E. coli DNA-regulatory motifs are
regulatory proteins differ in abundance and      undergo frequent mutations, and family          conserved in one or more distantly related
families significantly differ in size. A study   members can be divided into subfamilies         bacteria [84].
by Huynen and van Nimwegen [77] has              according to the amino acid sequences               The detection of regulatory sites in
shown that members of a single family have       at base-contacting positions; those in the      eukaryotes poses a more difficult problem
similar functions, but as the requirements       same subfamily are predicted to bind the        because consensus sequences tend to be
of this function vary over time, so does         same DNA sequence and those of different        much shorter, variable, and dispersed over
the presence of each gene family in the          subfamilies to bind distinct sequences. On      very large distances. However, initial stud-
genome.                                          the whole, the subfamilies corresponded         ies in S. cerevisiae provided an interesting
   Most recently, using a combination of         well with the proteins’ functions and mem-      observation for the GATA protein in nitro-
sequence and structural data, we examined        bers of the same subfamilies were found to      gen metabolism regulation. While the 5
the conservation of amino acid sequences         regulate similar transcription pathways.        base-pair GATA consensus sequence is
between related DNA-binding proteins,            The combined analysis of sequence and           found almost everywhere in the genome, a
and the effect that mutations have on            structural data described by this study pro-    single isolated binding site is insufficient to
DNA sequence recognition. The structural         vided an insight into how homologous            exert the regulatory function [85]. There-
families described above were expanded to        DNA-binding scaffolds achieve different         fore specificity of GATA activity comes
include proteins that are related by sequence    specificities by altering their amino acid      from the repetition of the consensus se-
similarity, but whose structures remain          sequences. In doing so, proteins evolved        quence within the upstream regions of con-
unsolved. Again, members of the same             distinct functions, therefore allowing          trolled genes in multiple copies. An initial
family are homologous, and probably derive       structurally related transcription factors to   study has used this observation to predict
from a common ancestor.                          regulate expression of different genes.         new regulatory sites by searching for over-
   Amino acid conservations were calculat-       Therefore, the relative abundance of tran-      represented oligonucleotides in non-coding
ed for the multiple sequence alignments          scription regulatory families in a genome       regions of yeast and worm genomes [86,
of each family [78]. Generally, alignment        depends, not only on the importance of a        87].
positions that interact with the DNA are         particular protein function, but also in the        Having detected the regulatory binding
better conserved than the rest of the pro-       adaptability of the DNA-binding motifs to       sites, there is the problem of defining the
tein surface, although the detailed patterns     recognise distinct nucleotide sequences.        genes that are actually regulated, commonly
of conservation are quite complex. Residues      This, in turn, appears to be best accommo-      termed regulons. Generally, binding sites
that contact the DNA backbone are highly         dated by simple binding motifs, such as the     are assumed to be located directly upstream
conserved in all protein families, providing     zinc fingers.                                   of the regulons; however there are different
a set of stabilising interactions that are          Given the knowledge of the transcription     problems associated with this assumption
common to all homologous proteins. The           regulators that are contained in each           depending on the organism. For prokary-
conservation of alignment positions that         organism, and an understanding of how           otes, it is complicated by the presence of
contact bases, and recognise the DNA             they recognise DNA sequences, it is of          operons; it is difficult to locate the regulat-
sequence, are more complex and could be          interest to search for their potential bind-    ed gene within an operon since it can lie
rationalised by defining a three-class model     ing sites within genome sequences [79].         several genes downstream of the regulatory
for DNA-binding. First, protein families         For prokaryotes, most analyses have in-         sequence. It is often difficult to predict the
that bind non-specifically usually contain       volved compiling data on experimentally         organisation of operons [88], especially to
several conserved base-contacting residues;      known binding sites for particular proteins     define the gene that is found at the head,
without exception, interactions are made in      and building a consensus sequence that in-      and there is often a lack of long-range con-
the minor groove where there is little           corporates any variations in nucleotides.       servation in gene order between related
discrimination between base types. The           Additional sites are found by conducting        organisms [89]. The problem in eukaryotes
contacts are commonly used to stabilise          word-matching searches over the entire          is even more severe; regulatory sites often
deformations in the nucleic acid structure,      genome and scoring candidate sites by           act in both directions, binding sites are
particularly in widening the DNA minor           similarity [80-83]. Unsurprisingly, most of     usually distant from regulons because of
groove. The second class comprise families       the predicted sites are found in non-coding     large intergenic regions, and transcription
whose members all target the same                regions of the DNA [80] and the results of      regulation is usually a result of combined
nucleotide sequence; here, base-contacting       the studies are often presented in databases    action by multiple transcription factors in a
positions are absolutely or highly conser-       such as RegulonDB [73]. The consensus           combinatorial manner.
ved allowing related proteins to target the      search approach is often complemented by            Despite these problems, these studies
same sequence.                                   comparative genomic studies searching           have succeeded in confirming the transcrip-
   The third, and most interesting, class        upstream regions of orthologous genes in        tion regulatory pathways of well-character-
comprises families in which binding is also      closely related organisms. Through such an      ized systems such as the heat shock response

                                                                                                                       Method Inform Med 4/2001
354
Luscombe, Greenbaum, Gerstein

system [83]. In addition, it is feasible to      relates to the functions associated with         the distinction between different forms of
experimentally verify any predictions, most      these folds; the former are commonly in-         cancer – for example subclasses of acute
notably using gene expression data.              volved in metabolic pathways and the latter      leukaemia – has been well established, it is
                                                 in signalling or transport processes [98].       still not possible to establish a clinical diag-
                                                 This is also reflected in the relationship       nosis on the basis of a single test. In a
                                                 with subcellular localisations of proteins,      recent study, acute myeloid leukaemia and
6.3 Gene Expression Studies                      where expression of cytoplasmic proteins is      acute lymphoblastic leukaemia were suc-
Many expression studies have so far              high, but nuclear and membrane proteins          cessfully distinguished based on the ex-
focused on devising methods to cluster           tend to be low [99, 100].                        pression profiles of these cells [26]. As the
genes by similarities in expression profiles.       More complex relationships have also          approach does not require prior biological
This is in order to determine the proteins       been assessed. Conventional wisdom is that       knowledge of the diseases, it may provide a
that are expressed together under different      gene products that interact with each other      generic strategy for classifying all types of
cellular conditions. Briefly, the most com-      are more likely to have similar expression       cancer.
mon methods are hierarchical clustering,         profiles than if they do not [101, 102]. How-        Clearly, an essential aspect of under-
self-organising maps, and K-means cluster-       ever, a recent study showed that this rela-      standing expression data lies in understand-
ing. Hierarchical methods originally de-         tionship is not so simple [103]. While ex-       ing the basis of transcription regulation.
rived from algorithms to construct phyloge-      pression profiles are similar for gene pro-      However, analysis in this area is still limited
netic trees, and group genes in a “bottom-       ducts that are permanently associated, for       to preliminary analyses of expression levels
up” fashion; genes with the most similar         example in the large ribosomal subunit,          in yeast mutants lacking key components of
expression profiles are clustered first, and     profiles differ significantly for products       the transcription initiation complex [19,
those with more diverse profiles are             that are only associated transiently, includ-    107].
included iteratively [90-92]. In contrast, the   ing those belonging to the same metabolic
self-organising map [93, 94] and K-means         pathway.
methods [95, 96] employ a “top-down”
approach in which the user pre-defines the
                                                    As described below, one of the main
                                                 driving forces behind expression analysis
                                                                                                  7 “… many PRACTICAL
number of clusters for the dataset. The          has been to analyse cancerous cell lines         APPLICATIONS…”
clusters are initially assigned randomly, and    [104]. In general, it has been shown that dif-
the genes are regrouped iteratively until        ferent cell lines (eg epithelial and ovarian     Here, we describe some of the major uses
they are optimally clustered.                    cells) can be distinguished on the basis of      of bioinformatics.
    Given these methods, it is of interest to    their expression profiles, and that these
relate the expression data to other attri-       profiles are maintained when cells are           7.1 Finding Homologues
butes such as structure, function and sub-       transferred from an in vivo to an in vitro
cellular localisation of each gene product.      environment [105]. The basis for their phy-      As described earlier, one of the driving
Mapping these properties provides an             siological differences were apparent in the      forces behind bioinformatics is the search
insight into the characteristics of proteins     expression of specific genes; for example,       for similarities between different biomole-
that are expressed together, and also sug-       expression levels of gene products neces-        cules. Apart from enabling systematic orga-
gest some interesting conclusions about the      sary for progression through the cell cycle,     nisation of data, identification of protein
overall biochemistry of the cell. In yeast,      especially ribosomal genes, correlated well      homologues has some direct practical uses.
shorter proteins tend to be more highly          with variations in cell proliferation rate.      The most obvious is transferring informa-
expressed than longer proteins, probably         Comparative analysis can be extended to          tion between related proteins. For example,
because of the relative ease with which they     tumour cells, in which the underlying            given a poorly characterised protein, it is
are produced [97]. Looking at the amino          causes of cancer can be uncovered by             possible to search for homologues that are
acid content, highly expressed genes are         pinpointing areas of biological variations       better understood and with caution, apply
generally enriched in alanine and glycine,       compared to normal cells. For example in         some of the knowledge of the latter to the
and depleted in asparagine; these are            breast cancer, genes related to cell prolif-     former. Specifically with structural data,
thought to reflect the requirements of           eration and the IFN-regulated signal trans-      theoretical models of proteins are usually
amino acid usage in the organism, where          duction pathway were found to be upregu-         based on experimentally solved structures
synthesis of alanine and glycine are energe-     lated [25, 106]. One of the difficulties in      of close homologues [108]. Similar tech-
tically less expensive than asparagine.Turn-     cancer treatment has been to target specific     niques are used in fold recognition in which
ing to protein structure, expression levels of   therapies to pathogenetically distinct tu-       tertiary structure predictions depend on
the TIM barrel and NTP hydrolase folds           mour types, in order to maximise efficacy        finding structures of remote homologues
are highest, while those for the leucine         and minimise toxicity. Thus, improvements        and checking whether the prediction is
zipper, zinc finger and transmembrane            in cancer classifications have been central      energetically viable [109]. Where biochem-
helix-containing folds are lowest. This          to advances in cancer treatment. Although        ical or structural data are lacking, studies

Method Inform Med 4/2001
355
                                                                                                                                                                   What is Bioinformatics?

Fig. 3 Above is a schematic outlining how scientists can use bioinformatics to aid rational drug discovery. MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the
short arm of chromosome 3. Through linkage analysis and its similarity to mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer. Given the nucleotide sequence,
the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can be used to find homologues in model organisms, and
based on sequence similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could design molecules that
could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.

could be made in low-level organisms like                       simplifies the problem of understanding                          missing in humans. On a smaller scale,
yeast and the results applied to homo-                          complex genomes by analysing simple                              structural differences between similar pro-
logues in higher-level organisms such as                        organisms first and then applying the                            teins may be harnessed to design drug
humans, where experiments are more                              same principles to more complicated ones –                       molecules that specifically bind to one
demanding.                                                      this is one reason why early structural                          structure but not another.
   An equivalent approach is also employed                      genomics projects focused on Mycoplasma
in genomics. Homologue-finding is exten-                        genitalium [75].
sively used to confirm coding regions in                            Ironically, the same idea can be applied
newly sequenced genomes and functional                          in reverse. Potential drug targets are quickly
                                                                                                                                 7.2 Rational Drug Design
data is frequently transferred to annotate                      discovered by checking whether homo-                             One of the earliest medical applications of
individual genes. On a larger scale, it also                    logues of essential microbial proteins are                       bioinformatics has been in aiding rational

                                                                                                                                                             Method Inform Med 4/2001
356
Luscombe, Greenbaum, Gerstein

drug design. Fig. 3 outlines the commonly        protein folds are often related to specific      this way, we are able to examine individual
cited approach, taking the MLH1 gene pro-        biochemical functions [52, 53], these find-      systems in detail and also compare them
duct as an example drug target. MLH1 is a        ings highlight the diversity of metabolic        with those that are related in order to un-
human gene encoding a mismatch repair            pathways in different organisms [36, 89].        cover common principles that apply across
protein (mmr) situated on the short arm of           As we discussed earlier, one of the most     many systems and highlight unusual fea-
chromosome 3 [110]. Through linkage ana-         exciting new sources of genomic information      tures that are unique to some.
lysis and its similarity to mmr genes in mice,   is the expression data. Combining expression
the gene has been implicated in nonpoly-         information with structural and functional       Acknowledgments
                                                                                                  We thank Patrick McGarvey for comments on
posis colorectal cancer [111]. Given the         classifications of proteins we can ask           the manuscript.
nucleotide sequence, the probable amino          whether the high occurrence of a protein
acid sequence of the encoded protein can         fold in a genome is indicative of high ex-
be determined using translation software.        pression levels [97]. Further genomic scale
Sequence search techniques can then be           data that we can consider in large-scale sur-
                                                                                                  References
                                                                                                   1. Reichhardt T. It’s sink or swim as a tidal wave
used to find homologues in model orga-           veys include the subcellular localisations of        of data approaches. Nature 1999. 399 (6736):
nisms, and based on sequence similarity, it      proteins and their interactions with each            517-20.
is possible to model the structure of the        other [114-116]. In conjunction with struc-       2. Benson DA, et al. GenBank. Nucleic Acids
human protein on experimentally character-       tural data, we can then begin to compile a           Res 2000; 28 (1): 15-8.
                                                                                                   3. Bairoch A, Apweiler R. The SWISS-PROT
ized structures. Finally, docking algorithms     map of all protein-protein interactions in           protein sequence database and its supplement
could design molecules that could bind the       an organism.                                         TrEMBL in 2000. Nucleic Acids Res 2000; 28
model structure, leading the way for bio-                                                             (1): 45-8.
chemical assays to test their biological                                                           4. Fleischmann RD, et al. Whole-genome ran-
                                                                                                      dom sequencing and assembly of Haemo-
activity on the actual protein.                                                                       philus influenzae Rd. Science 1995; 269
                                                 8. Conclusions                                       (5223): 496-512.
                                                                                                   5. Drowning in data. The Economist 1999 (26
                                                 With the current deluge of data, compu-              June 1999).
7.3 Large-scale Censuses                         tational methods have become indispens-
                                                                                                   6. Bernstein FC, et al. The Protein Data Bank. A
                                                                                                      computer-based archival file for macromolec-
Although databases can efficiently store all     able to biological investigations. Originally        ular structures. Eur J Biochem 1977; 80 (2):
the information related to genomes, struc-       developed for the analysis of biological se-         319-24.
tures and expression datasets, it is useful to   quences, bioinformatics now encompasses           7. Berman HM, et al. The Protein Data Bank.
                                                                                                      Nucleic Acids Res 2000; 28 (1): 235-42.
condense all this information into under-        a wide range of subject areas including           8. Pearson WR, Lipman DJ. Improved tools for
standable trends and facts that users can        structural biology, genomics and gene ex-            biological sequence comparison. Proc Natl
readily understand. Broad generalisations        pression studies. In this review, we provided        Acad Sci USA 1988; 85 (8): 2444-8.
                                                                                                   9. Altschul SF, et al. Gapped BLAST and PSI-
help identify interesting subject areas for      an introduction and overview of the cur-             BLAST: a new generation of protein database
further detailed analysis, and place new ob-     rent state of field. In particular, we discus-       search programs. Nucleic Acids Res 1997; 25
servations in a proper context. This enables     sed the types of biological information and          (17): 3389-402.
one to see whether they are unusual in any       databases that are commonly used, exa-           10. Fleischmann RD, et al. Whole-genome ran-
                                                                                                      dom sequencing and assembly of Haemo-
way.                                             mined some of the studies that are being con-        philus influenzae Rd. Science 1995; 269
   Through these large-scale censuses, one       ducted – with reference to transcription             (5223): 496-512.
can address a number of evolutionary, bio-       regulatory systems – and finally looked at       11. Lander ES, et al. Initial sequencing and analy-
chemical and biophysical questions. For          several practical applications of the field.         sis of the human genome. Nature 2001; 409:
                                                                                                      860-921.
example, are specific protein folds associat-    Two principal approaches underpin all stud-      12. Venter JC, et al. The sequence of the human
ed with certain phylogenetic groups? How         ies in bioinformatics. First is that of com-         genome. Science 2001; 291 (5507): 1304-51.
common are different folds within partic-        paring and grouping the data according to        13. Tatusova TA, Karsch-Mizrachi I, Ostell JA.
ular organisms? And to what degree are           biologically meaningful similarities and sec-        Complete genomes in WWW Entrez: data
                                                                                                      representation and analysis. Bioinformatics
folds shared between related organisms?          ond, that of analysing one type of data to           1999; 15 (7-8): 536-43.
Does this extent of sharing parallel meas-       infer and understand the observations for        14. Eisen MB, Brown PO. DNA arrays for analy-
ures of relatedness derived from traditional     another type of data. These approaches are           sis of gene expression. Methods Enzymol,
evolutionary trees? Initial studies show         reflected in the main aims of the field,             1999; 303: 179-205.
                                                                                                  15. Cheung VG, et al. Making and reading micro-
that the frequency of folds differs greatly      which are to understand and organise the             arrays. Nat Genet 1999; 21 (1 Suppl): 15-9.
between organisms and that the sharing of        information associated with biological           16. Duggan DJ, et al. Expression profiling using
folds between organisms does in fact follow      molecules on a large scale. As a result,             cDNA microarrays. Nat Genet 1999. 21
                                                                                                      (1 Suppl): 10-4.
traditional phylogenetic classifications         bioinformatics has not only provided great-
                                                                                                  17. Lipshutz RJ, et al. High density synthetic
[37, 112, 113]. We can also integrate data on    er depth to biological investigations, but           oligonucleotide arrays. Nat Genet 1999; 21 (1):
protein functions; given that the particular     added the dimension of breadth as well. In           20-4.

Method Inform Med 4/2001
357
                                                                                                                                           What is Bioinformatics?

18. Velculescu VE, et al. Serial Analysis of Gene      37. Gerstein M, Hegyi H. Comparing genomes in              sequence, structure and function through
    Expression. Detailed Protocol 1999.                    terms of protein structure: surveys of a finite        traditional and probabilistic scores. J Mol Biol
19. Holstege FC, et al. Dissecting the regulatory          parts list. FEMS Microbiol Rev 1998; 22 (4):           2000; 297 (1): 233-49.
    circuitry of a eukaryotic genome. Cell 1998; 95        277-304.                                           56. Harrison SC. A structural taxonomy of DNA-
    (5): 717-28.                                       38. Skolnick J, Fetrow JS. From genes to protein           binding domains. Nature 1991; 353 (6346):
20. Roth FP, Estep PW, Church GM. Finding                  structure and function: novel applications of          715-9.
    DNA regulatory motifs within unaligned non-            computational approaches in the genomic era.       57. Luscombe NM, et al.An overview of the struc-
    coding sequences clustered by whole-genome             Trends Biotech 2000; 18: 34-9.                         tures of protein-DNA complexes. Genome
    mRNA quantitation. Nat Biotech 1998; 16            39. Qian J, et al. PartsList: a web-based system for       Biology 2000; 1 (1): 1-37.
    (10): 939-45.                                          dynamically ranking protein folds based on         58. Jones S, et al. Protein-DNA interactions: A
21. Jelinsky SA, Samson LD. Global response of             disparate attributes, including whole-genome           structural analysis. J Mol Biol 1999; 287 (5):
    Saccharomyces cerevisiae to an alkylating              expression and interaction information.                877-96.
    agent. Proc Natl Acad Sci USA 1999; 96 (4):            Nucleic Acids Res 2001; 29 (8): 1750-64.           59. Suzuki M, Gerstein M. Binding geometry of
    1486-91.                                           40. Gerstein M. Integrative database analysis in           alpha-helices that recognize DNA. Proteins
                                                           structural genomics. Nat Struct Biol 2000; 7           1995; 23 (4): 525-35.
22. Cho RJ, et al. A genome-wide transcriptional           Suppl: 960-3.
    analysis of the mitotic cell cycle. Mol Cell                                                              60. Luscombe NM, Thornton JM. Protein-DNA
                                                       41. Etzold T, Ulyanov A, Argos P. SRS: informa-            interactions: a 3D analysis of alpha-helix-
    1998; 2 (1): 65-73.                                    tion retrieval system for molecular biology data       binding in the major groove. Manuscript in
23. DeRisi JL, Iyer VR, Brown PO. Exploring the            banks. Methods Enzymol 1996; 266: 114-28.              preparation.
    metabolic and genetic control of gene expres-      42. Schuler GD, et al. Entrez: molecular biology
    sion on a genomic scale. Science 1997; 278                                                                61. Suzuki M, et al. DNA recognition code of
                                                           database and retrieval system. Methods                 transcription factors. Protein Eng 1995; 8 (4):
    (5338): 680-6.                                         Enzymol 1996; 266: 141-62.                             319-28.
24. Winzeler EA, et al. Functional characteriza-       43. Wade K. Searching Entrez PubMed and                62. Suzuki M. DNA recognition by a -sheet.
    tion of the S. cerevisiae genome by gene               uncover on the internet. Aviat Space Environ           Protein Eng 1995; 8 (1): 1-4.
    deletion and parallel analysis. Science 1999;          Med 2000; 71 (5): 559.
                                                                                                              63. Seeman NC, Rosenberg JM, Rich A. Sequence
    285 (5429): 901-6.                                 44. Bertone P, et al. SPINE: An integrated
                                                                                                                  specific recognition of double helical nucleic
25. Perou CM, et al. Molecular portraits of human          tracking database and datamining approach
                                                                                                                  acids by proteins. Proc Natl Acad Sci USA
    breast tumours. Nature 2000; 406 (6797):               for high-throughput structural proteomics,
                                                                                                                  1976; 73: 804-8.
    747-52.                                                enabling the determination of the properties
                                                                                                              64. Suzuki M. A framework for the DNA-protein
26. Golub TR, et al. Molecular classification of           of readily characterized proteins. Nucleic
                                                                                                                  recognition code of the probe helix in
    cancer: class discovery and class prediction by        Acids Res. In Press.
                                                                                                                  transcription factors: the chemical and stereo-
    gene expression monitoring. Science 1999; 286      45. Zhang MQ. Promoter analysis of co-regulated
                                                                                                                  chemical rules. Structure 1994; 2 (4): 317-26.
    (5439): 531-7.                                         genes in the yeast genome. Comput Chem
                                                           1999; 23 (3-4): 233-50.                            65. Mandel-Gutfreund Y, Schueler O, Margalit H.
27. Pedersendagger AG, et al. A DNA structural                                                                    Comprehensive analysis of hydrogen bonds in
                                                       46. Boguski MS. Biosequence exegesis. Science
    atlas for Escherichia coli. J Mol Biol 2000; 299                                                              regulatory protein-DNA complexes: in search
                                                           1999; 286 (5439): 453-5.
    (4): 907-30.                                                                                                  of common principles. J Mol Biol 1995; 253
                                                       47. Miller C, Gurd J, Brass A. A RAPID
28. Kanehisa M; Goto S. KEGG: kyoto encyclo-               algorithm for sequence database comparisons:           (2): 370-82.
    pedia of genes and genomes. Nucleic Acids              application to the identification of vector        66. Luscombe NM, Laskowski RA, Thornton JM.
    Res 2000; 28 (1): 27-30.                               contamination in the EMBL databases. Bio-              Protein-DNA interactions: a 3D analysis of
29. Jeffery CJ. Moonlighting proteins. TIBS 1999;          informatics 1999; 15 (2): 111-21.                      amino acid-base interactions. Nucleic Acids
    24 (1): 8-11.                                      48. Gonnet GH, Korostensky C, Brenner S.                   Res. In Press.
30. Chothia, C. Proteins. One thousand families            Evaluation measures of multiple sequence           67. Mandel-Gutfreund Y, Margalit H. Quantita-
    for the molecular biologist. Nature 1992; 357          alignments. J Comput Biol 2000; 7 (1-2):               tive parameters for amino acid-base inter-
    (6379): 543-4.                                         261-76.                                                action: inplications for prediction of protein-
                                                       49. Orengo CA, Taylor WR. SSAP: sequential                 DNA binding sites. Nucleic Acids Res 1998;
31. Orengo CA, Jones DT, Thornton JM. Protein
                                                           structure alignment program for protein struc-         26: 2306-12.
    superfamilies and domain superfolds. Nature
    1994; 372 (6507): 631-4.                               ture comparison. Methods Enzymol 1996; 266:        68. Sternberg MJ, Gabb HA, Jackson RM. Predic-
                                                           617-35.                                                tive docking of protein-protein and protein-
32. Lesk AM, Chothia C. How different amino                                                                       DNA complexes. Curr Opin Struct Biol 1998;
    acid sequences determine similar protein           50. Orengo CA. CORA – topological fingerprints
                                                           for protein structural families. Protein Sci           8 (2): 250-6.
    structures: the structure and evolutionary                                                                69. Aloy P, et al. Modelling repressor proteins
    dynamics of the globins. J Mol Biol 1980; 136          1999; 8 (4): 699-715.
                                                       51. Russell RB, Sternberg MJ. Structure predic-            docking to DNA. Proteins 1998; 33 (4):
    (3): 225-70.                                                                                                  535-49.
                                                           tion. How good are we? Curr Biol 1995; 5 (5):
33. Russell RB, et al. Recognition of analogous            488-90.                                            70. Dickerson RE. DNA-binding: the prevalence
    and homologous protein folds: analysis of          52. Martin AC, et al. Protein folds and functions.         of kinkiness and the virtues of normality.
    sequence and structure conservation. J Mol             Structure 1998; 6 (7): 875-84.                         Nucleic Acids Res 1998; 26 (8): 1906-26.
    Biol 1997; 269 (3): 423-39.                        53. Hegyi H, Gerstein M. The relationship be-          71. Perez-Rueda E, Collado-Vides J. The reper-
34. Russell RB, et al. Recognition of analogous            tween protein structure and function: a com-           toire of DNA-binding transcriptional regula-
    and homologous protein folds – assessment of           prehensive survey with application to the              tors in Escherichia coli K-12. Nucleic Acids
    prediction success and associated alignment            yeast genome. J Mol Biol 1999; 288 (1): 147-64.        Res 2000; 28 (8): 1838-47.
    accuracy using empirical substitution matri-       54. Russell RB, Sasieni PD, Sternberg MJE.             72. Mewes HW, et al. MIPS: a database for geno-
    ces. Protein Eng 1998; 11 (1): 1-9.                    Supersites within superfolds. Binding site             mes and protein sequences. Nucleic Acids Res
35. Fitch WM. Distinguishing homologous from               similarity in the absence of homology. J Mol           2000; 28 (1): 37-40.
    analogous proteins. Syst Zool 1970; 19: 99-110.        Biol 1998; 282 (4): 903-18.                        73. Salgado H, et al. RegulonDB (version 3.0):
36. Tatusov RL, Koonin EV, Lipman DJ. A geno-          55. Wilson CA, Kreychman J, Gerstein M. As-                transcriptional regulation and operon orga-
    mic perspective on protein families. Science           sessing annotation transfer for genomics:              nization in Escherichia coli K-12. Nucleic
    1997; 278 (5338): 631-7.                               quantifying the relations between protein              Acids Res 2000; 28 (1): 65-7.

                                                                                                                                      Method Inform Med 4/2001
You can also read