SQL Multimedia and Application Packages (SQL/MM)

Page created by Lori Tate
 
CONTINUE READING
SQL Multimedia and Application Packages (SQL/MM)
                    Jim Melton                                           Andrew Eisenberg
             Oracle, Sandy, UT 84093                                     IBM, Westford, MA
               jim.melton@acm.org                                   andrew.eisenberg@us.ibm.com

                                                         many incompatible extensions to SQL would be de-
Introduction                                             fined by various data management communities, the
                                                         end result being a situation in which no single prod-
Regular readers of this column will have become          uct could possibly implement all of the extensions
familiar with database language SQL — indeed, most       because of conflicts in keywords (and other related
readers are already familiar with it. We have also       conflicts).
discussed the fact that the SQL standard is being pub-        A summit meeting was held in Tokyo later in
lished in multiple parts and have even discussed one     1992 to seek a solution to the dilemma posed by the
of those parts in some detail[1].                        conflicting demands on SQL extensions. By that
     Another standard, based on SQL and its struc-       time, the SQL standards committees were in the
tured user-defined types[2], has been developed and      process of adding object-oriented extensions to SQL
published by the International Organization for Stan-    and a number of SQL vendors had indicated their
dardization (ISO). This standard, like SQL, is divided   support for what is often called the “object-relational
into multiple parts (more independent than the parts     model”. Based on suggestions from several of those
of SQL, in fact). Some parts of this other standard,     vendors, the Tokyo summit developed the notion of a
known as SQL/MM, have already been published and         second standard that would define several “class li-
are currently in revision, while others are still in     braries” of SQL object types, one for each significant
preparation for initial publication.                     category of complex data.
     In this issue, we introduce SQL/MM and review            The structured types defined in such libraries
each of its parts, necessarily at a high level.          would naturally be first-class SQL types that could be
                     Jim Melton and Andrew Eisenberg     accessed through ordinary SQL:1999 facilities, in-
                                                         cluding expressions that invoke SQL-invoked rou-
SQL Multimedia and Application                           tines associated with such types (that is, methods).
                                                              The proposed standard was immediately known
Packages — SQL/MM                                        as “SQL/MM” (MM for MultiMedia). A number of
In late 1991 or early 1992, a small group of text        candidate data domains were suggested, including
search engine vendors, operating under the auspices      full-text data, spatial data, image data (still and mov-
of the IEEE, released a specification for a language     ing), and others. Responsibility for SQL/MM’s de-
called SFQL (Structured Full-text Query Language).       velopment was given to the same ISO subcommittee
The goal of SFQL was to define extensions to SQL         as SQL (at that time, JTC1/SC21, but now
that would be suitable for applying full-text searches   JTC1/SC32), with the hope that domain experts
to repositories of documents.                            would attend to develop the specifications for each
     The proposal was given significant attention by     data domain.
the full-text community, but was immediately criti-           Like SQL, SQL/MM is a multi-part standard.
cized by several other data management communities       Unlike SQL, the various parts of SQL/MM are quite
on the grounds that SFQL “hijacked” many useful          independent from one another. However, there is one
keywords that were in common use by those other          part that is common to the remainder of the word.
communities. For example, the keyword CONTAINS           Part 1, known as the Framework[3], provides defini-
was proposed by SFQL to mean “the indicated unit of      tions of common concepts use in the other parts and
text contains the supplied word or phrase”, but the      outlines the definitional approach used by those other
spatial data community used the same keyword to          parts. In particular, it describes the manner in which
mean “one spatial entity contains a second spatial       the other parts use SQL’s structured user-defined
entity”. While the high-level semantics of the word      types to define the types required by the subject mat-
may seem to be quite similar in each case, the actual    ter of each part.
code required to implement it is dramatically differ-
ent.
     This controversy was sufficiently generalized
that the SQL standards organizations realized that
Full-Text                                                     document               FULLTEXT )
                                                           in which the docno column contains a value that
The term “full-text” (or, if you prefer, “full text”) is   captures some document identifier and the docu-
normally applied to textual data that differs from or-
                                                           ment column contains a full-text document.
dinary character string data principally in its length,
                                                               We could retrieve from that table the identifier of
but also in database-specific operations that can be
                                                           documents about full-text searching that contain
applied to it. Ordinary character strings are usually
                                                           words closely related to “standard” in the same para-
indexed by their entire values, but special types of
                                                           graph as words that sound like “sequel” by using a
indexes are defined for full-text data; such indexes
                                                           query like this:
might record information about the proximity of
words and phrases to one another or about words that       SELECT docno
appear in a document and related words that do not         FROM information
appear in the same document. Full-text data is subject     WHERE document.CONTAINS
to search operations that are normally not applied to            ('STEMMED FORM OF "standard"
“simple” character strings. It’s worth pointing out               IN SAME PARAGRAPH AS
that “full-text operations” are quite different than the          SOUNDS LIKE "sequel"') = 1
sort of pattern matches (such as regular expressions)
                                                                That query retrieves the docno column from the
with which most computer software people are inti-
mately familiar.                                           information table for every document for which
     The SQL/MM Full-Text standard[4] defines a            the value returned by the CONTAINS method ap-
number of structured user-defined types (henceforth,       plied to the document column is 1, meaning true.
just “UDTs”) to support the storage (presumably in         The parameter passed to that method uses three dif-
an object-relational database) of textual data. One of     ferent full-text operations: STEMMED FORM OF
these types is named FullText and it supports the con-     will find any of several words derived from “stan-
struction of full-text data values, testing whether that   dard”, such as “standards” and “standardization”; IN
data contains specified patterns, and conversion of        SAME PARAGRAPH AS requires that a second
that data to ordinary SQL character strings. The           word (or phrase!) appear in the same paragraph as the
specification of the FullText type includes a number       stemmed word; and SOUNDS LIKE finds words that
of methods that prepare the value associated with an       are pronounced (presumably in English, since we
instance of the type for the application of full-text      didn’t specify a different language) like “sequel” (of
searches, as well as Boolean methods that perform          which “SQL” might be a case).
the searches themselves.
     In addition to the FullText type, a number of ad-     Spatial
ditional types are defined to represent various sorts of   Many enterprises need the ability to store, manage,
patterns that can be used in full-text searches. Search    and retrieve information based on aspects of spatial
patterns can be quite complex, including searching         data, such as geometry, location, and topology. Ap-
for text that includes specific words, words stemmed       plications making use of spatial data include auto-
from (such as the past tense of a verb or the plural of    mated mapping, facilities management, geographic
a noun) specified words, words with similar defini-        systems, graphics, multimedia, and even integrated
tions, and even words that sound like a given word.        circuit design. The SQL/MM Spatial standard[5] de-
     Linguists among our readers will know that            fines SQL:1999 structured user-defined types and
some languages are much more amenable to com-              associated methods to provide the ability to support
puter identification of components of text than others.    such applications.
For example, most Western languages use white                    By its very nature, spatial data often represents
space to separate words from one another and use           2-dimensional and 3-dimensional data. SQL/MM
special punctuation (such as a period, or full stop) to    Spatial currently supports 0-dimensional (point), 1-
separate sentences. Other languages, such as Japa-         dimensional (line), and 2-dimensional (“flat” shape)
nese, do not separate words from one another by            data; future revisions might support 3-dimensional
spaces, depending primarily on context to distinguish      (volumetric shapes) and possibly data of even higher
words. SQL/MM Full-Text is generally acknowl-              dimensions.
edged to have better support for languages for which             There are an astonishingly large number of spa-
automatic distinction of language tokens (such as          tial reference systems in common use, the vast major-
words) is relatively easy.                                 ity of them used to describe geographic entities and
     Consider the following SQL table:                     concepts on the surface of our (relatively) spherical
CREATE TABLE information (                                 planet. Many of those spatial reference systems deal
  docno          INTEGER,                                  with large structures for which the curvature of the
planet is significant; as a result, various systems have        Most Spatial types have accessor methods that
evolved to describe structures in particular regions       permit applications to extract fundamental informa-
(e.g., countries, states and provinces, etc.) for which    tion about instances of the type, such as determining
the impacts of planet curvature vary from the impacts      the values of the X and Y coordinates of a point.
in other regions. (For example, lines of longitude              Consider the following table definition:
converge towards one another as one moves close to
                                                           CREATE TABLE CITY (
the poles—seemingly parallel lines of longitude are
                                                             NAME        VARCHAR(30),
in fact not parallel.)                                       POPULATION INTEGER,
      Support for these spatial reference systems are        CITY_PARKS VARCHAR(30) ARRAY[10],
economically critical to the design of SQL/MM Spa-           LOCATION    ST_GEOMETRY )
tial, because the largest users of spatial data man-
agement systems are often governmental bodies and          We can determine the area of San Francisco by exe-
very large commercial enterprises that have to deal        cuting a query like this:
with geographic data. Such users include local gov-
                                                           SELECT location.area
ernments (city planning, traffic management, acci-         FROM CITY
dent investigation), state and provincial governments      WHERE name = 'San Francisco'
(highway planning, natural resource management),
national governments (defense, border control), ex-             The expression location.area retrieves the
tractive industries (mineral and water location), and      area attribute of the ST_Geometry structured type
farming (plot allocation). Indeed, SQL/MM Spatial’s        value stored in the location column of the row
design seems to more naturally support geospatial          corresponding to San Francisco. (Retrieving the value
data than smaller-scale data such as integrated circuit    of an attribute of a structured type instance is equiva-
design and computer graphics.                              lent to invoking the accessor method on that attrib-
      SQL/MM Spatial defines several type hierar-          ute.)
chies. One of those hierarchies has as its most gener-          SQL/MM Spatial is closely related to, and fun-
alized type (that is, its maximal supertype) a type        damentally aligned with, other spatial standards being
called ST_Geometry. That type is not instantiable          developed by another ISO Technical Committee, TC
(meaning that no instances of it can be created—           211 (Geomatics) and by the Open GIS Consortium
Spatial defined less than a half-dozen such types), but    (“GIS” stands for “Geographic Information Sys-
it has a number of (about a dozen) subtypes that are       tems”). Keeping standards being developed in all
instantiable, such as ST_Point, ST_Curve, and              three forums has proved challenging, but all partici-
ST_MultiPolygon.                                           pants seem committed to doing so.
      A type (not a subtype of ST_Geometry) called
ST_SpatialRefSys is used to describe spatial refer-
ence systems. Every spatial value that participates in     Still Image
a given query must be defined in the same spatial          One of the fastest growing applications of computers
reference system, although a future version of the         is storage and processing of visual images such as
Spatial standard might relax that restriction.             photographs. Many enterprises expend tremendous
      In a future version of SQL/MM Spatial that is        resources on the acquisition, storage, and manage-
currently under development, another pair of types,        ment of collections of images, including graphics,
ST_Angle and ST_Direction, are used to capture in-         paintings, and photographs. Such data has tremen-
formation about various angles and directions that are     dous business value and represents large monetary
needed when storing and managing spatial informa-          outlays. One of the most challenging aspects to han-
tion.                                                      dling image data is that of locating an image already
      There are many operations that can be performed      in your possession.
on Spatial data. Among the most common operations               SQL/MM Still Image[6] represents a part of the
are: construction of a straight line from two points or    solution to those problems. This part of the SQL/MM
from one point, a direction, and a distance; construc-     standard provides structured user-defined types that
tion of a polygon from several lines, from several         allow you to store new images into a database, re-
points, or from a point and a collection of directions     trieve them, modify them in various ways, and—most
and distances. Other important operations are detec-       importantly—to locate them by applying various
tion of whether two lines intersect, whether two areas     “visual” predicates to your collections of images.
overlap or are adjacent to one another, whether a line          In SQL/MM Still Image, images are represented
is tangent to a curve, and whether two polygons share      using an SQL:1999 structured type called
a boundary.                                                SI_StillImage. This type stores collections of picture
                                                           elements (pixels) representing 2-dimensional images.
(Of course, images of 3-dimensional objects are very      SQL/MM Still Image, but it is possible that some
common, but the images themselves are 2-dimen-            future part of SQL/MM will be oriented towards
sional.) Images can be stored in any of several for-      moving images.
mats, depending on what the underlying implemen-
tation supports—for example, formats such as JPEG,        Data Mining
TIFF, and GIF are commonly supported as input and
output formats, as well as formats in which images        The parts of SQL/MM that we’ve presented so far in
are stored and manipulated. The SI_StillImage type        this column are all very reasonably described as ori-
also captures information about each image, such as       ented towards the handling of multimedia data. How-
its format, its dimensions (height and width in pix-      ever, as you saw in the early sections of the column,
els), its color space, and so forth.                      the full name of the SQL/MM standard is SQL Mul-
      Methods applied to SI_StillImage instances in-      timedia and Application Packages. In fact, work was
clude routines to scale an image (change its size pro-    initiated in early 2000 on a new part of SQL/MM that
portionally), to crop an image (remove undesired          does not address multimedia data, but instead defines
parts), rotate an image (such as changing its orienta-    an application package.
tion from horizontal to vertical), and creating a              SQL/MM Data Mining[7] defines SQL struc-
“thumbnail” image (a lower resolution image used          tured user-defined types—including methods on the
for quick display).                                       types—to address an important aspect of modern data
      Another group of data types are used to describe    management: the discovery of previously unknown,
various features of images. The SI_AverageColor           but important, information buried in large quantities
type is used to represent the “average” color of a        of data that might have been collected for other, quite
given image; this value may be used in locating im-       distinct reasons.
ages in collections (imagine wanting to find an image          Data mining is not a new concept; indeed, com-
that is primarily green to be used in advertising out-    panies have long wanted to use data collected in the
door furniture). The SI_ColorHistogram type pro-          ordinary course of business as a source of informa-
vides information about the colors in an image at a       tion about their customers or other resources. A num-
finer level of granularity than the image’s average       ber of relatively small, but important, companies
color; it indicates how much of each color is found in    were founded during the 1990s to provide enterprises
an image. The SI_PositionalColor type represents the      with data mining products, some of them based on
location of specific colors in an image, supporting       relational database systems, but most of them dedi-
queries such as “since sunsets at sea have red and        cated applications that require importing data stored
orange above dark blue, find me images with those         in another repository and reorganizing it into struc-
color characteristic”. Finally, the SI_Texture type       tures unique to a particular data mining approach.
allows the recording of information such as coarse-            SQL/MM Data Mining takes a different view of
ness, contrast, and direction of granularity. An          the problem: It attempts to provide a standardized
SI_FeatureList type permits recording all of the fea-     interface to data mining algorithms that can be lay-
tures described in this paragraph for each image.         ered atop any object-relational database system and
      By combining several features of an image, it is    even deployed as middleware when required.
possible to write queries that can retrieve from a very        In most data management environments, applica-
large image base a much smaller collection of images      tions pose questions to the data repositories that re-
from which you can quickly select the exact image         trieve information based on specific criteria. By
you want. It is also possible to screen collections of    contrast, in a data mining environment, applications
images to find images of potential interest for various   often ask the repository to find out what criteria are
reasons. For example, you might want to determine         most important.
whether a new logo you’ve commissioned might con-              For example, a data mining engine can discover,
flict with other logos that have already been copy-       informing its users of the discovery, that (to use a
righted. An SQL statement like this one:                  famous, if apocryphal, example) about half of the
                                                          customers who buy both disposable diapers and beer
SELECT *                                                  will buy an air freshener product as well. This is not
FROM REGISTERED_LOGOS                                     the sort of question that most users would dream up
WHERE SI_findTexture(newLogo).                            by themselves (it certainly doesn’t come to our minds
        SI_Score(Logo) > 1.2                              very often!), but it is precisely the kind of relation-
would do just what you need.                              ship that a data mining product will discover.
     Of course, not all images are “still”. Additional         A popular question that a data mining product
challenges are posed by moving images, such as digi-      might be asked is “Who are my most important cus-
tized video. That sort of data is not addressed by        tomers and what are the most significant attributes of
those customers and the trends in the values of those          Once a model has been created and trained, it
attributes?” The first part of the question may seem      can be tested by building instances of the
easy—it’s usually straightforward to find out what        DM_MiningData type that holds test data, and in-
customers have bought your products or services           stances of the DM_MiningMapping type that specify
recently. But “most important” may have other mean-       the different columns in a relational table that are to
ings than “recent purchases”—profits are not always       be used as a data source. The result of testing a model
directly related to purchases, since growth rates, ser-   is one or more instances of the DM_*TestResult type
vice demands, and other factors can significantly         (‘*’ can only be ‘Clas’ or ‘Reg’). When running your
affect the meaning of importance.                         model against real data, you get the results in in-
      Data mining tools are also used for predictive      stances of the DM_*Result type (‘*’ can be ‘Clas’,
purposes, such as insurance companies mining data         ‘Clus’, or ‘Reg’…but not ‘Rule’).
on existing customers to help evaluate the risks asso-         In most cases, you also create and use instances
ciated with new customers.                                of DM_*Task types to control the actual testing and
      There are four different data mining techniques     running of your models.
supported by this standard. One technique, the rule            At the time this column went to press, it seemed
model, allows you to search for patterns (“rules”) in     likely that final progression of the SQL/MM Data
the relationships between different parts of your data.   Mining standard might be slowed just a little bit to
A second technique, the clustering model, helps you       ensure that it is fully compatible with a “sister” data
group together data records that share common char-       mining API being developed for Java by the Java
acteristics and identify the most important of those      Community Process.
characteristics. The third technique, the regression
model, helps you predict the ranking of new data          Summary
based on an analysis of existing data. The final tech-
nique, the classification model, is very similar to the   The SQL/MM suite of standards includes a Frame-
regression model, but it is oriented towards predict-     work that describes the conventions used to define
ing which grouping or class new data will best fit        each of the other parts. There are other parts used to
based on its relationship to existing data.               manage full-text data, spatial data, and still images,
      For each of those techniques, as with most data     and to data mining.
mining product, there are three distinct stages                Careful inspection of the references below will
through which you can mine your data. First, you          reveal that there is no part 4 of this multi-part stan-
have to train a model; this means choosing the tech-      dard. That’s because an attempt to develop a set of
nique most appropriate to your goals, then setting a      classes for general mathematical operations was
few parameters to orient the model, and finally train-    eventually determined to satisfy too few users at too
ing the model by applying it to a reasonably-sized        great a cost; development of SQL/MM General Pur-
data set (perhaps several times for improved valid-       pose Facilities was thus abandoned several years ago.
ity). Second, if you’re using the classification or re-        Not all parts of SQL/MM are yet commercially
gression techniques, you can test the model by            successful, but the seems to be growing support at
applying it to known data and comparing the model’s       least for both Full-Text and Spatial by several impor-
predictions with that known data’s classification or      tant players in those fields. Support for Still Image
ranking. Finally, you apply the model to your busi-       seems to be developing more slowly, and it’s far too
ness data and use the results to improve your enter-      soon to say about Data Mining since that part has not
prise.                                                    yet been published. Whether additional data types
      The models are supported through the use of         (such as moving image data) are ever supported de-
several broad categories of new structured user-          pends on many factors, including interest from the
defined types. For each model, a type known as            technical community depending on such data. The
DM_*Model (where the ‘*’ is replaced by ‘Clas’ for        recent surge in consolidation within the database in-
a classification model, ‘Rule’ for a rule model, ‘Clus-   dustry causes some to think that there is a reduction
tering’ for a clustering model, and ‘Regression’ for a    in the need for such standards, but the greater atten-
regression model), is used to define the model that       tion being paid to the Internet and the World Wide
you want to use when mining your data. The models         Web prove that the need for portability of data and of
are parameterized using instances of the                  code continues to increase.
DM_*Settings (‘*’ is ‘Clas’, ‘Rule’, ‘Clus’, or ‘Reg’)         If you’re interested in acquiring copies of the
type and the models are trained using instances of the    SQL/MM standard’s various parts, you can do so at
DM_ClassificationData type. The DM_*Settings type         ANSI’s electronic standards store cited below. Unfor-
allows various parameters of a data mining model,         tunately, even in downloadable (PDF) form, these
such as the depth of a decision tree, to be set.          standards are a bit pricey. We expect that, once they
have been formally adopted as American National           Application Packages — Part 3: Spatial, Interna-
Standards, they will be available at the NCITS web        tional Organization For Standardization, 2000.
store for very reasonable prices.                     [6] ISO/IEC 13249-5:2001, Information technology
                                                          — Database languages — SQL Multimedia and
References                                                Application Packages — Part 5: Still Image, In-
                                                          ternational Organization For Standardization,
[1] Jim Melton, Jan-Eike Michels, Vanja Josifovski,       2001.
    Krishna, Kulkarni, Peter Schwarz, Kathy Zei-      [7] (ISO/IEC) FCD 13249-6, Information technol-
    denstein, SQL and Management of External              ogy — Database languages — SQL Multimedia
    Data, SIGMOD Record, Mar., 2001.                      and Application Packages — Part 6: Data Min-
[2] Jim Melton and Andrew Eisenberg, SQL:1999,            ing. [FCD = Final Committee Draft for ballot]
    formerly known as SQL3, SIGMOD Record,
    Feb. 1999.
[3] ISO/IEC 13249-1:2000, Information technology      Web References
    — Database languages — SQL Multimedia and         [1] National Committee for Information Technology
    Application Packages — Part 1: Framework, In-         Standards (NCITS): http://www.ncits.org
    ternational Organization for Standardization,     [2] NCITS H2 – Database Committee:
    2000.                                                 http://www.ncits.org/tc_home/h2.htm
[4] ISO/IEC 13249-2:2000, Information technology      [3] ISO/IEC JTC 1/SC 32: http://www.jtc1sc32.org
    — Database languages — SQL Multimedia and         [4] ANSI’s Electronic Standards Store:
    Application Packages — Part 2: Full-Text, In-         http://webstore.ansi.org
    ternational Organization For Standardization,     [5] NCITS’ Standards Store:
    2000.                                                 http://www.cssinfo.com/ncits.html
[5] ISO/IEC 13249-3:1999, Information technology      [6] ISO/IEC JTC 1/SC 32 (primarly WG 3, WG 4,
    — Database languages — SQL Multimedia and             and WG 5) archives:
                                                          ftp://www.sqlstandards.org/SC32/
You can also read