Advertisement: Support JavaWorld, click here!
![]()
XML for the absolute beginner
A guided tour from HTML to processing XML with Java
Summary
In just a few short years, the World Wide Web and HTML have taken the
world by storm. But HTML's limitations and the ever-increasing demand
for more flexibility in Internet systems has XML, the Extensible Markup
Language, brewing on the horizon. Further, Java applications that move
data around need a data representation format as portable as Java
itself. Developers who learn XML now will find it a powerful tool for
data representation, storage, modelling, and
interoperation.
Mark Johnson steps away from his popular JavaBeans
column this month to introduce you to the world of XML: where it came
from, why it's necessary, how it interoperates with existing Internet
technology, and how to use it in your designs. You'll learn about
Cascading Style Sheets and XSL, then follow up with a look at the XML
and Java technology base at a promising Internet startup, with comments
from that company's CEO and technical lead. By the time you've finished
reading Mark's article, you'll understand why so many people are paying
so much attention to this new data representation standard.
(11,000 words)
By Mark Johnson
TML
and the World Wide Web are everywhere. As an example of their
ubiquity, I'm going to Central America for Easter this year, and if I
want to, I'll be able to surf the Web, read my e-mail, and even do
online banking from Internet cafés in Antigua Guatemala and
Belize City. (I don't intend to, however, since doing so would take
time away from a date I have with a palm tree and a rum-filled coconut.)
And yet, despite the omnipresence and popularity of HTML, it is
severely limited in what it can do. It's fine for disseminating
informal documents, but HTML now is being used to do things it was
never designed for. Trying to design heavy-duty, flexible,
interoperable data systems from HTML is like trying to build an
aircraft carrier with hacksaws and soldering irons: the tools
(HTML and HTTP) just aren't up to the job.
The good news is that many of the limitations of HTML have been
overcome in XML, the Extensible Markup Language. XML is easily
comprehensible to anyone who understands HTML, but it is much more
powerful. More than just a markup language, XML is a
metalanguage -- a language used to define new markup
languages. With XML, you can create a language crafted specifically for
your application or domain.
XML will complement, rather than replace, HTML. Whereas HTML is used
for formatting and displaying data, XML represents the contextual
meaning of the data.
This article will present the history of markup languages and how XML
came to be. We'll look at sample data in HTML and move gradually
into XML, demonstrating why it provides a superior way to represent
data. We'll explore the reasons you might need to invent a
custom markup language, and I'll teach you how to do it.
We'll cover the basics of XML notation, and how to
display XML with two different sorts of style languages. Then, we'll
dive into the Document Object Model, a powerful tool for manipulating
documents as objects (or manipulating object structures as documents,
depending upon how you look at it). We'll go over how to write Java
programs that extract information from XML documents, with a pointer to
a free program useful for experimenting with these new concepts.
Finally, we'll take a look at an Internet company that's basing its
core technology strategy on XML and Java.
Is XML for you?
Though this article is written for anyone interested in XML, it has a
special relationship to the JavaWorld series on XML
JavaBeans. (See Resources for links to related articles.) If you've been reading that series and aren't quite "getting it," this article should clarify how to use XML with beans. If you are getting it, this article serves as the perfect companion piece to the XML JavaBeans series, since it covers topics untouched therein.
And, if you're one of the lucky few who still have the XML JavaBeans
articles to look forward to, I recommend that you read the present
article first as introductory material.
A note about Java
There's so much recent XML activity in the computer world that even an
article of this length can only skim the surface. Still, the whole
point of this article is to give you the context you need to use XML in
your Java program designs. This article also covers how XML operates
with existing Web technology, since many Java programmers work in such
an environment.
XML opens the Internet and Java programming to portable, nonbrowser
functionality. XML frees Internet content from the browser in much the
same way Java frees program behavior from the platform. XML makes
Internet content available to real applications.
Java is an excellent platform for using XML, and XML is an outstanding
data representation for Java applications. I'll point out some of
Java's strengths with XML as we go along.
Let's begin with a history lesson.
The origins of markup languages
The HTML we all know and love (well, that we know, anyway) was
originally designed by Tim Berners-Lee at CERN (le Conseil
Européen pour la Recherche Nucléaire, or the
European Laboratory for Particle Physics) in Geneva to allow physics
nerds (and even non-nerds) to communicate with each other. HTML was
released in December 1990 within CERN, and became publicly available in
the summer of 1991 for the rest of us. CERN and Berners-Lee gave away
the specifications for HTML, HTTP, and URLs, in the fine old tradition
of Internet share-and-enjoy.
Berners-Lee defined HTML in SGML, the Standard Generalized Markup
Language. SGML, like XML, is a metalanguage -- a language used for
defining other languages. Each so-defined language is called an
application of SGML. HTML is an application of SGML.
SGML emerged from research done primarily at IBM on text document
representation in the late '60s. IBM created GML ("General Markup
Language"), a predecessor language to SGML, and in 1978 the
American National Standards Institute (ANSI) created its first version
of SGML. The first standard was released in 1983, with the draft
standard released in 1985, and the first standard was published in
1986. Interestingly enough, the first SGML standard was published
using an SGML system developed by Anders Berglund at CERN, the
organization that, as we have seen, gave us HTML and the Web.
SGML is widely used in large industries and governments such as in
large aerospace, automotive, and telecommunications companies. SGML is
used as a document standard at the United States Department of Defense
and the Internal Revenue Service. (For readers outside of the US, the
IRS are the tax guys.)
Albert Einstein said everything should be made as simple as possible,
and no simpler. The reason SGML isn't found in more places is that
it's extremely sophisticated and complex. And HTML, which you can find
everywhere, is very simple; for a lot of applications, it's too
simple.
HTML: All form and no substance
HTML is a language designed to "talk about" documents:
headings, titles, captions, fonts, and so on. It's heavily document
structure- and presentation-oriented.
Admittedly, artists and hackers have been able to work miracles with
the relatively dull tool called HTML. But HTML has serious drawbacks
that make it a poor fit for designing flexible, powerful, evolutionary
information systems. Here a few of the major complaints:
- HTML isn't extensible
An extensible markup
language would allow application developers to define custom tags for
application-specific situations. Unless you're a 600-pound gorilla (and
maybe not even then) you can't require all browser manufacturers to
implement all the markup tags necessary for your application. So,
you're stuck with what the big browser makers, or the W3C (World Wide
Web Consortium) will let you have. What we need is a language that
allows us to make up our own markup tags without having to call the
browser manufacturer.
- HTML is very display-centric
HTML is a fine language for display purposes, unless you require a lot
of precise formatting or transformation control (in which case it
stinks). HTML represents a mixture of document logical structure
(titles, paragraphs, and such) with presentation tags (bold, image
alignment, and so on). Since almost all of the HTML tags have
to do with how to display information in a browser, HTML is useless for
other common network applications -- like data replication or
application services. We need a way to unify these common functions
with display, so the same server used to browse data can also, for
example, perform enterprise business functions and interoperate with
legacy systems.
- HTML isn't usually directly reusable
Creating
documents in word-processors and then exporting them as HTML is
somewhat automated but still requires, at the very least, some tweaking
of the output in order to achieve acceptable results. If the data from
which the document was produced change, the entire HTML translation
needs to be redone. Web sites that show the current weather around the
globe, around the clock, usually handle this automatic reformatting
very well. The content and the presentation style of the document are
separated, because the system designers understand that their content
(the temperatures, forecasts, and so on) changes constantly.
What we need is a way to specify data presentation in terms of
structure, so that when data are updated, the formatting can be
"reapplied" consistently and easily.
- HTML only provides one 'view' of data
It's
difficult to write HTML that displays the same data in different ways
based on user requests. Dynamic HTML is a start, but it requires an
enormous amount of scripting and isn't a general solution to this
problem. (Dynamic HTML is discussed in more detail below.) What we need
is a way to get all the information we may want to browse at once, and
look at it in various ways on the client.
- HTML has little or no semantic structure
Most
Web applications would benefit from an ability to represent data by
meaning rather than by layout. For example, it can be very difficult to
find what you're looking for on the Internet, because there's no
indication of the meaning of the data in HTML files (aside from META
tags, which are usually misleading). Type red into a search
engine, and you'll get links to Red Skeleton, red herring, red snapper,
the red scare, Red Letter Day, and probably a page or two of
"Books I've Red." HTML has no way to specify what a
particular page item means. A more useful markup language would
represent information in terms of its meaning. What we need is a
language that tells us not how to display information, but
rather, what a given block of information is so we know what
to do with it.
SGML has none of these weaknesses, but in order to be general, it's
hair-tearingly complex (at least in its complete form). The language
used to format SGML (its "style language"), called DSSSL
(Document Style Semantics and Specification Language), is
extremely powerful but difficult to use. How do we get a language
that's roughly as easy to use as HTML but has most of the power of
SGML?
Origins of XML
As the Web exploded in popularity and people all over the world began
learning about HTML, they fairly quickly started running into the
limitations outlined above. Heavy-metal SGML wonks, who had been
working with SGML for years in relative obscurity, suddenly found that
everyday people had some understanding of the concept of markup (that
is, HTML). SGML experts began to consider the possibility of using
SGML on the Web directly, instead of using just one application of it
(again, HTML). At the same time, they knew that SGML, while powerful,
was simply too complex for most people to use.
In the summer of 1996, Jon Bosak (currently online information
technology architect at Sun Microsystems) convinced the W3C to let him
form a committee on using SGML on the Web. He created a high-powered
team of muckety-mucks from the SGML world. By November of that year,
these folks had created the beginnings of a simplified form of SGML
that incorporated tried-and-true features of SGML but with reduced
complexity. This was, and is, XML.
In March 1997, Bosak released his landmark paper, "XML, Java and the
Future of the Web" (see Resources). Now, two
years later (a very long time in the life of the Web), Bosak's short
paper is still a good, if dated, introduction to why using XML is such
an excellent idea.
SGML was created for general document structuring, and HTML was created
as an application of SGML for Web documents. XML is a simplification
of SGML for general Web use.
An XML conceptual example
All this talk of "inventing your own tags" is pretty foggy:
What kind of tags would a developer want to invent and how would the
resulting XML be used? In this section, we'll go over an example that
compares and contrasts information representation in HTML and XML. In
a later section ("XSL: I like your style") we'll go over XML display.
First, we'll take an example of a recipe, and display it as one
possible HTML document. Then, we'll redo the example in XML and discuss
what that buys us.
HTML example
Take a look at the little chunk of HTML in Listing 1:
<!-- The original html recipe -->
<HTML>
<HEAD>
<TITLE>Lime Jello Marshmallow Cottage Cheese Surprise</TITLE>
</HEAD>
<BODY>
<H3>Lime Jello Marshmallow Cottage Cheese Surprise</H3>
My grandma's favorite (may she rest in peace).
<H4>Ingredients</H4>
<TABLE BORDER="1">
<TR BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR>
<TR><TD>1</TD><TD>box</TD><TD>lime gelatin</TD></TR>
<TR><TD>500</TD><TD>g</TD><TD>multicolored tiny marshmallows</TD></TR>
<TR><TD>500</TD><TD>ml</TD><TD>cottage cheese</TD></TR>
<TR><TD></TD><TD>dash</TD><TD>Tabasco sauce (optional)</TD></TR>
</TABLE>
<P>
<H4>Instructions</H4>
<OL>
<LI>Prepare lime gelatin according to package instructions...</LI>
<!-- and so on -->
</BODY>
</HTML>
Listing 1. Some HTML
(A printable version of this listing can be found at
example.html.)
Looking at the HTML code in Listing 1, it's probably clear to just
about anyone that this is a recipe for something (something awful, but
a recipe nonetheless). In a browser, our HTML produces something like
this:
Lime Jello Marshmallow Cottage Cheese Surprise
My grandma's favorite (may she rest in peace).
Ingredients
| Qty | Units | Item |
| 1 | box | lime gelatin |
| 500 | g | multicolored tiny marshmallows |
| 500 | ml | Cottage cheese |
| dash | Tabasco sauce (optional) |
Instructions
- Prepare lime gelatin according to package instructions...
|
Listing 2. What the HTML in Listing 1 looks like in a
browser
Now, there are a number of advantages to representing this recipe in
HTML, as follows:
- It's fairly readable. The markup may be a little cryptic, but if
it's laid out properly it's pretty easy to follow.
- The HTML can be displayed by just about any HTML browser, even one
without graphics capability. That's an important point: The display is
browser-independent. If there were a photo of the results of making
this recipe (and one certainly hopes there isn't), it would show up in
a graphical browser but not in a text browser.
- You could use a cascading style sheet (CSS -- we'll talk a bit
about those below) for general control over formatting.
There's one major problem with HTML as a data format, however. The
meaning of the various pieces of data in the document is lost.
It's really hard to take general HTML and figure out what the data in
the HTML mean. The fact that there's an
<Ingredient> of this recipe with a
<Qty> (quantity) of 500 ml
(<Units>) of <Item> cottage
cheese would be very hard to extract from this document in a way that's
generally meaningful.
Now, the idea of data in an HTML document meaning something
may be a bit hard to grasp. Web pages are fine for the human reader,
but if a program is going to process a document, it requires
unambiguous definitions of what the tags mean. For instance, the
<TITLE> tag in an HTML document encloses the title
of the document. That's what the tag means, and it doesn't mean
anything else. Similarly, an HTML <TR> tag means
"table row," but that's of little use if your program is
trying to read recipes in order to, say, create a shopping list. How
could a program find a list of ingredients from a Web page formatted in
HTML?
Sure, you could write a program that grabs the headers out of the
document, reads the table column headers, figures out the quantities
and units of each ingredient, and so on. The problem is, everyone
formats recipes differently. What if you're trying to get this
information from, say, the Julia Childs Web site, and she keeps messing
around with the formatting? If Julia changes the order of the columns
or stops using tables, she'll break your program! (Though it has to be
said: If Julia starts publishing recipes like this, she may want to
think about changing careers.)
Now, imagine that this recipe page came from data in a database and
you'd like to be able to ship this data around. Maybe you'd like to
add it to your huge recipe database at home, where you can search and
use it however you like. Unfortunately, your input is HTML, so you'll
need a program that can read this HTML, figure out what all the
"Ingredients," "Instructions," "Units,"
and so forth are, and then import them to your database. That's a lot
of work. Especially since all of that semantic information -- again,
the meaning of the data -- existed in that original database but were
obscured in the process of being transformed into HTML.
Now, imagine you could invent your own custom language for describing
recipes. Instead of describing how the recipe was to be displayed,
you'd describe the information structure in the recipe: how
each piece of information would relate to the other pieces.
XML example
Let's just make up a markup language for describing recipes, and
rewrite our recipe in that language, as in Listing 3.
<?xml version="1.0"?>
<Recipe>
<Name>Lime Jello Marshmallow Cottage Cheese Surprise</Name>
<Description>
My grandma's favorite (may she rest in peace).
</Description>
<Ingredients>
<Ingredient>
<Qty unit="box">1</Qty>
<Item>lime gelatin</Item>
</Ingredient>
<Ingredient>
<Qty unit="g">500</Qty>
<Item>multicolored tiny marshmallows</Item>
</Ingredient>
<Ingredient>
<Qty unit="ml">500</Qty>
<Item>Cottage cheese</Item>
</Ingredient>
<Ingredient>
<Qty unit="dash"/>
<Item optional="1">Tabasco sauce</Item>
</Ingredient>
</Ingredients>
<Instructions>
<Step>
Prepare lime gelatin according to package instructions
</Step>
<!-- And so on... -->
</Instructions>
</Recipe>
Listing 3. A custom markup language for recipes
It will come as little surprise to you, being the astute reader you
are, that this recipe in its new format is actually an XML document.
Maybe the fact that the file started with the odd header
<?xml version="1.0"?>
gave it away; in fact, every XML file should begin with this header.
We've simply invented markup tags that have a particular meaning; for
example, "An <Ingredient> is a
<Qty> (quantity in specified units) of a single
<Item>, which is possibly
optional." Our XML document describes the
information in the recipe in terms of recipes, instead of in
terms of how to display the recipe (as in HTML). The
semantics, or meaning of the information, is maintained in XML because
that's what the tag set was designed to do.
Notes on notation
It's important to get some nomenclature straight. In Figure 1, you see
a start tag, which begins an enclosed area of text, known as
an Item, according to the tag name. As in HTML,
XML tags may include a list of attributes (consisting of an
attribute name and an attribute value.) The
Item defined by the tag ends with the end tag.
|
Figure 1. An XML start tag and its corresponding end tag
|
Not every tag encloses text. In HTML, the <BR> tag
means "line break" and contains no text. In XML, such
elements aren't allowed. Instead, XML has empty tags, denoted
by a slash before the final right-angle bracket in the tag. Figure 2
shows an empty tag from our XML recipe. Note that empty tags may have
attributes. This empty tag example is standard XML shorthand for
<Qty units="g"></Qty>.
|
Figure 2. An empty tag
|
In addition to these notational differences from HTML, the structural
rules of XML are more strict. Every XML document must be
well-formed. What does that mean? Read on!
Ooh-la-la! Well-formed XML
The concept of well-formedness comes from mathematics: It's possible to
write mathematical expressions that don't mean anything. For example, the _expression_
2 ( + + 5 (=) 9 > 7
looks (sort of) like math, but it isn't math because it doesn't follow the notational and structural rules for a mathematical _expression_ (not on this planet, at least). In other words, the "_expression_" above isn't well-formed. Mathematical expressions must be well-formed before you can do anything useful with them, because expressions that aren't well-formed are meaningless.
A well-formed XML document is simply one that follows all of the
notational and structural rules for XML. Programs that intend to
process XML should reject any input XML that doesn't follow the rules
for being well-formed. The most important of these rules are as follows:
- No unclosed tags
You can get away with all kinds of wacko stuff in HTML. For example, in
most HTML browsers, you can "open" a list item with
<LI> and never "close" it with
</LI>. The browser just figures out where the
</LI> would be and automatically inserts it for you.
XML doesn't allow this kind of sloppiness. Every start tag must have a
corresponding end tag. This is because part of the information in an
XML file has to do with how different elements of information relate to
one another, and if the structure is ambiguous, so is the information.
So, XML simply doesn't allow ambiguous structure. This nonambiguous
structure also allows XML documents to be processed as data structures
(trees), as I'll explain shortly in the discussion of the Document
Object Model.
- No overlapping tags
A tag that opens inside another tag must close before the containing
tag closes. For example, the sequence
<Tomato> Let's call <Potato>the whole
thing off</Tomato> </Potato>
isn't well-formed because <Potato> opens inside of
<Tomato> but doesn't close inside of
<Tomato>. The correct sequence must be
<Tomato> Let's call <Potato>the whole thing
off</Potato> </Tomato>
In other words, the structure of the document must be strictly hierarchical.
- Attribute values must be enclosed in quotes
Unlike HTML, XML doesn't allow "naked" attribute values
(i.e., HTML tags like <TABLE BORDER=1>, where there are no
quotes around the attribute value). Every attribute value must have
quotes (<TABLE BORDER="1">).
- The text characters (<), (>), and (")
must always be represented by 'character entities'
To represent these three characters (left-angle bracket, right-angle
bracket, and double quotes) in the text part of the XML (not in the
markup), you must use the special character entities
(<), (>), and
("), respectively. These characters are special
characters for XML. An XML file using, say, the double quote character in the text enclosed in tags in an XML file isn't well-formed, and
correctly designed XML parsers will produce an error for such input.
'Well-formed' means 'parsable'
A generic XML parser is a program or class that can read any
well-formed XML at its input. Many vendors now offer XML parsers in
Java for free; (you'll find links to these packages in Resources at the bottom of this article). XML
parsers recognize well-formed documents and produce error
messages (much like a compiler would) when they receive input that isn't well-formed. As we'll see, this functionality is very handy for the programmer: You simply call the parser you've selected and it takes care of the error detection and so on. While all XML parsers check the well-formedness of documents (meaning, as we've seen, that all the tags make sense, are nested properly, and so on), validating XML parsers go one step further. Validating parsers also confirm whether the document is valid; that is, that the structure and number of tags make sense.
For example, most browsers will display a document that (nonsensically) has two <TITLE> elements, but how can this be? Only one title or no title makes sense.
For another example, imagine that in Listing 3 the
"cottage cheese" ingredient looked like this:
<Ingredient>
<Qty unit="ml">500</Qty>
<Qty unit="g">9</Qty>
<Item>Cottage cheese</Item>
</Ingredient>
This XML document is certainly well-formed, but it doesn't make sense. It isn't structurally valid. It is nonsense for a
<Qty> to contain a <Qty>. What's the <Qty> of this <Ingredient>?
The problem is, we have a document that's well-formed, but it isn't
very useful because the XML doesn't make sense. We need a way to
specify what makes an XML document valid. For example, how can we
specify that a <Qty> tag may contain only text (and not any other elements) and report as errors any other case?
The answer to this question lies in something called the document
type definition, which we'll look at next.
Make up a markup
While a well-formed document is well-formed because it follows rules
defined by the XML spec, a valid document is valid because it matches
its document type definition (DTD). The DTD is the grammar for a markup language, defined by the designer of the markup language. For my little XML recipe in Listing 3, for example, that designer would be me. The DTD specifies what elements may exist, what attributes the elements may have, what elements may or must be found inside other elements, and in what order.
Nonvalidating parsers read the XML and, if it's well-formed,
give you back the document structure as a tree of objects. We'll
discuss the document structure you get from a parser in the section below entitled "The Document Object Model." If the
document is well-formed but the elements are nonsensical (as was the case with the two <Qty> elements in the
<Ingredient> above), that's your problem.
This is, in fact, how HTML browsers work. Generally, HTML parsers are nonvalidating. The various "HTML checking" parsers, which report sytax errors in HTML, are essentially validating HTML parsers (with additional functionality, like link checking).
Validating parsers read XML, verify that it's well-formed
(just as nonvalidating parsers do), and then go on to determine whether the document's element tags are legal, whether the attribute names make sense, whether every element nested inside another element belongs there, and so on.
The DTD defines the document type. It accounts for the
Extensible in XML. The DTD is how you actually define a
new markup language -- what I often call a dialect of XML.
DTDs currently are being written for an enormous number of different
problem domains, and each DTD defines a new markup language. New
markup languages now exist, or are being designed, to mark up the plays of Shakespeare; to define general data resources (RDF); to model information in the health care industry (HL7 SGML/XML); to typeset, display, and actively use mathematical equations (MathML); and to perform electronic data interchange (XML/EDI). There's even a proposal for a markup language for business data in the footwear industry (FDX). (No, I'm not joking.)
Central to each of these new languages is a DTD that describes what
tags the markup language has, what those tags' attributes may be, and
how they may be combined. A DTD specifies very clearly what
information may or may not be included in a markup language. For instance, the DTD for HTML does not allow for markup tags to select paper size for printing.
Let's take a look at a DTD for the recipe XML in Listing 3. I'm going to call it JWSRML (JavaWorld Scary Recipe Markup Language). Apologies to anyone already using that acronym.
<!-- This is the example DTD for the example XML -->
<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Description (#PCDATA)>
<!ELEMENT Ingredients (Ingredient)*>
<!ELEMENT Ingredient (Qty, Item)>
<!ELEMENT Qty (#PCDATA)>
<!ATTLIST Qty unit CDATA #REQUIRED>
<!ELEMENT Item (#PCDATA)>
<!ATTLIST Item optional CDATA "0"
isVegetarian CDATA "true">
<!ELEMENT Instructions (Step)+>
Listing 4. The DTD for JWSRML
The document type definition in Listing 4 defines a language for a
validating parser to accept -- meaning, the parser will produce errors if the rules listed in the DTD aren't followed. To get a general idea of how a DTD works, let's look at what a few of the lines in this file mean.
<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
The <!ELEMENT...> statement defines a tag in the
document. This tag defines a <Recipe> tag, stating
that it can contain a <Name>, an optional
<Description> (the question mark [?]
denotes optionality), an optional<Ingredients> tag, and an optional<Instructions> tag.
-
<!ELEMENT Name (#PCDATA)>
This simply states that a <Name> tag can contain
character data and nothing else.
<!ATTLIST Item optional CDATA "0" isVegetarian
CDATA "true">
This section states that the <Item> tag has two
possible attributes: optional, whose default value is
0; and isVegetarian, whose default value is
true. Notice that attribute values aren't limited to
numbers; they can be any text.
A DTD is associated with an XML document by way of a document
type declaration, which appears at the top the XML file (after
the <?xml...?> line). The document type declaration
may contain either an inline copy of the document type definition or
contain a reference to that document as a system filename or URI
(universal resource ID). For example,
<!DOCTYPE Recipe SYSTEM "example.dtd">
tells the parser to start looking for a <Recipe> tag as the top-level tag of the document. It also declares that the DTD is in the system file example.dtd.
There are other characters and notations in the DTD, but writing DTDs
is a topic unto itself. If you're interested in learning more, check
out the DTD-related links in Resources.
You now know a lot about how XML is structured and controlled, but you haven't heard what it's good for. Why are people so excited about this technology?
So, what good is made-up markup?
Here are some benefits of representing information in XML:
- XML is at least as readable as HTML and probably more so
Anyone who understands, more or less, what HTML is probably understands just about everything in Listing 4. They're also likely to have a good idea what the markup means, since the markup uses fairly intuitive terms (<Ingredient><OBJECT
CLASSID="">).
- The tags don't have anything to do with how the
document is displayed
Listing 4 is pure content: It's information. The markup
indicates what the information means, not how to
display it. The formatting information for an XML file (if there is any need for formatting) is usually written in a style language and stored separately from the XML. (See
the sections on CSS and XSL below for more on formatting XML.)
Separation of content and presentation is a key concept
inherited from SGML.
- A lot of the programming is already done for you
If you write a DTD and use a validating parser, much of the error
checking for the validity of your input is done by the parser. There's no need to write the parser yourself, since there are so many
high-quality parsers available for free. If you want to change the language, you simply change the DTD; the parser then obeys your new rules. Moreover, if your system needs to interoperate with other systems, you can choose a standard DTD (like XML/EDI, for example),
so that other systems will automatically understand your system's
vocabulary, and vice versa.
- XML is more versatile than HTML
Let's think about all the ways a document like Listing 4 could be used:
- You could display this recipe in an online recipe database, with a page style easily modifiable across all recipes
- The recipes are automatically scalable, convenient if you're planning a dinner party for 200
- The recipe is already in a standard recipe format for transmission
to the database
- Online recipe servers could exchange recipes using this format, or recipe applications could share data
- Such recipes would be much easier to search accurately (for
example, "all recipes with lime Jello and Tabasco sauce") than HTML would be
- It would be easy, based on the contents of your on your "legacy" pantry inventory database, to produce a shopping list
In fact, CSS (Cascading Style Sheets) and XSL (the
Extensible Stylesheet Language) do precisely that: They're the style
languages for XML. Let's take a quick look at these two technologies.
In Listing 3 above, you've seen what may be your first XML document.
You've got a problem with that document, though: It's going to be
pretty difficult to convince the browser manufacturers (not to mention the W3C) to add the <Ingredient> tag to their browsers. What if there were a way to turn this XML into a text file, a PostScript document, a photo-typesetting file, or input to a
text-to-speech system for the hearing-impaired? Or what if the XML
could somehow be transformed into HTML and displayed in a browser?
The members of the appropriate committees at the W3C have addressed
these concerns with two specifications: CSS and XSL. While both are
declarative languages (meaning that there are no instructions in the first-do-this, then-do-that sense), they serve different functions. CSS exists as a current recommendation from the W3C, usable with HTML or XML, is simpler to use and less powerful than XSL, and is supported by most current-generation browsers (to varying degrees). XSL is used exclusively to format XML or SGML and is more complex and powerful than CSS.
Great strides have been made with XSL in the past year. While XSL is
still just a "working draft" (meaning its design isn't yet
complete), you can experiment today with working implementations of the
draft. Just this month (March 18, 1999), Microsoft
released Internet Explorer 5.0, which includes support for part of the
XSL specification. And Mozilla (the open source project based on the
Netscape source code) can display XML using CSS. At the XTech '99
conference in San Jose, CA, in early March, Sun Microsystems
"pre-announced" a request for proposals (for a grant) and a
contest relating to the implementation of an XSL batch-processor
and the addition of full XSL to Mozilla. (See Resources.)
Again, the purpose of creating these new standards is to make most
things very simple for most people, just like HTML has made hypertext
and structured documents attainable to your grandma (or your
nine-year-old).
Cascading Style Sheets: not just for HTML anymore
You probably already know that HTML documents have a common tree-like
structure wherein elements are nested inside other elements. Nonetheless,
take a look at Listing 5 below.
<HTML><HEAD></HEAD>
<BODY>
<H1>A Theory About the Brontosaurus</H1>
My theory about the brontosaurus is...
</BODY>
</HTML>
Listing 5. <HTML> contains <BODY> contains <H1>
contains text
As the caption says, the <H1> element is contained
inside the <BODY> element, which itself is contained
inside the <HTML> element. And, of course, the title
itself is inside the <H1> element.
The whole idea of a style sheet is to use these structural
relationships to indicate where changes in text style, spacing, and
so on should occur. Then, a style sheet can be "applied"
to a document, to change its overall look. For example,
Listing 6 shows a tiny style sheet that sets the font size, color,
and underlining for the <H1> heading in Listing 5.
<STYLE TYPE="text/css">
<!--
H1 { color: red; font-size: 16pt; text-decoration: underline }
-->
</STYLE>
Listing 6. A style sheet that sets the style for <H1> in
Listing 5
If this style sheet were to appear at the top of the document, most
HTML browsers these days would use the settings in the style sheet (or simply "style"),
and change all <H1> headings to 16-point, red-underlined type. Styled
with our style sheet, our little document would look something like this:
<SPAN STYLE="color: red; font-size: 16pt; text-decoration: underline">
A Theory About the Brontosaurus
</SPAN>
My theory about the brontosaurus is...
(If this example doesn't show up properly, you either have styles
turned off in your browser or you're using an old browser that doesn't
support styles.) A document can reference its style sheet with a
hyperlink, and some browsers allow you to switch style sheets for the
document you're viewing, effectively changing how the document looks
on the fly.
These style sheets are called cascading style sheets, because styles (like fonts, colors, and so on) for one markup element "cascade" down, and apply to all of the element's contents. For example, if a paragraph tag (<P>) is set to show its text in red,
all text and any other elements inside that paragraph will be
displayed in red, unless one of the paragraph's sub elements
specifies a color for its contents.
The example we just looked at was for HTML, but what about XML? CSS can
be used to style XML, too, and in precisely the same way. You simply
specify the style for, say, an <Ingredient>, and all
the ingredients look the same. And, interestingly enough, if you
change the style sheet, the formatting of all ingredients
changes. It's really quite powerful.
Most browsers these days (Netscape 4 and above, Internet Explorer 3 and
above, Opera 3.5 and above) implement CSS pretty consistently for HTML.
You'll be reading a lot in the next few months about CSS and XML
availability in browsers. Also, keep in mind that CSS could be used to
apply style to documents on the server and serve "straight
HTML" without the CSS markup.
As powerful as CSS is, it has one major limitation: It can't
"transform" the data it's styling. CSS can make an HTML or
XML document look different, and even hide elements, but it can't
reshuffle, cross-reference, or restructure them. For example, say you
wanted to transform the XML recipe in Listing 3 to the HTML in Listing
1. Notice that you want the title to appear both in the browser's
title bar (in an HTML <TITLE> element), and as a
heading on the page (in a <H3> element), as is shown
in Listing 1. CSS can't do that; all it can do is apply style to an
existing structure.
To take an existing XML structure and produce a new structure
of something else (in this case, HTML), you need XSL: the Extended
Style Language.
XSL: I like your style
People who work in SGML and need to format it generally use DSSSL
(Document Style Semantics and Specification Language) to do the job.
DSSSL is a dialect of Scheme, itself a venerable and popular form of
LISP (which stands either for "List Processing" or lots of
irritating, superfluous parentheses," depending on who you
ask). Of course, if you're using DSSSL, you're already an SGML god and veteran LISP hacker, and therefore should not be reading in this
article.
Fortunately, the W3C committees discussing style, HTML, and XML have
included in their design the Extensible Style Language, or XSL. XSL is
based on DSSSL (and DSSSL-O, the online version of DSSSL), and also
uses some of the style elements of CSS. It's simpler than DSSSL, while
retaining much of its power (much like the relationship between XML and
SGML). XSL's notation, however, may be surprising: it's XML. The
simplest way to say it is: XSL is an XML document that specifies how to
transform another XML document. Say, what?
Why XSL is so useful
XSL is immensely powerful. It can be used to add structure to a
document (as in CSS), and it can also completely rearrange the
input elements for a particular purpose. For example, XSL
can transform XML of one structure into HTML of a different structure.
(We'll see an example of this below.) XSL can also restructure XML
into other document formats: TeX, RTF, and PostScript.
XSL can even transform XML into a different dialect of XML! This may
sound crazy, but it's actually a pretty cool idea. For example,
multiple presentations of the same information could be produced by
several different XSL files applied to the same XML input. Or, let's
say two systems speak different "dialects" of XML but have
similar information requirements. XSL could be used to translate the
output of the first system into something compatible with the input of the second system.
These last few reasons are of special interest to Java programmers,
since XSL can be used to translate between different languages in a
distributed network of subsystems, as well as to format documents.
Understanding how to use XSL in simple applications, like transforming
XML to HTML, will help a Java developer understand XSL in general.
Let's look at an example of how to transform XML to HTML with an XSL
style sheet.
Formatting XML as HTML: An example
An XSL file is a series of rules, called templates, that are
applied to an input XML file. Each time a template matches something in
the input, the template produces a new structure in the output (often
HTML, as in the example we're about to see). The new structure is
the XML's content, with the appropriate style applied and arranged as the XSL specifies. The templates in the XSL file are written in XML, using specific tags with defined meanings.
The example below refers again to the XML recipe example in Listing 3.
We're going to look at an XSL file that transforms the XML in Listing 3
into the HTML in Listing 1.
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:template match="/Recipe">
<HTML>
<HEAD>
<TITLE>
<xsl:value-of select="Name"/>
</TITLE>
</HEAD>
<BODY>
<H3>
<xsl:value-of select="Name"/>
</H3>
<STRONG>
<xsl:value-of select="Description"/>
</STRONG>
<xsl:apply-templates/>
</BODY>
</HTML>
</xsl:template>
<!-- Format ingredients -->
<xsl:template match="Ingredients">
<H4>Ingredients</H4>
<TABLE BORDER="1">
<TR BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR>
<xsl:for-each select="Ingredient">
<TR>
<!-- handle empty Qty elements separately -->
<xsl:if test='Qty[not(.="")]' >
<TD><xsl:value-of select="Qty"/></TD>
</xsl:if>
<xsl:if test='Qty[.=""]' >
<TD BGCOLOR="#404040"> </TD>
</xsl:if>
<TD><xsl:value-of select="Qty/@unit"/></TD>
<TD><xsl:value-of select="Item"/>
<xsl:if test='Item/@optional="1"'>
<SPAN> -- <em><STRONG>optional</STRONG></em></SPAN>
</xsl:if>
</TD>
</TR>
</xsl:for-each>
</TABLE>
</xsl:template>
<!-- Format instructions -->
<xsl:template match="Instructions">
<H4>Instructions</H4>
<OL>
<xsl:apply-templates select="Step"/>
</OL>
</xsl:template>
<xsl:template match="Step">
<LI><xsl:value-of select="."/></LI>
</xsl:template>
<!-- ignore all not matched -->
<xsl:template match="*" priority="-1"/>
</xsl:stylesheet>
Listing 7. XSL used as an XML language that transforms XML into
something else
(A printable version of this file is in example.xsl).
Looking at this code you'll notice, first of all, that the file starts
with the <?xml...?> tag, indicating that this file
is XML (even though it's also XSL). Each template is bounded by the
tags <xsl:template ...> and </xsl:template
...>. Every tag that begins with <xsl:
is an XSL command.
While we won't go over all the templates in the XSL file (since this
isn't an XSL tutorial), Listing 8 provides a quick look at the
first template in the file, just to get the general idea.
<xsl:template match="/Recipe">
<HTML>
<HEAD>
<TITLE>
<xsl:value-of select="Name"/>
</TITLE>
</HEAD>
<BODY>
<H3>
<xsl:value-of select="Name"/>
</H3>
<U>
<xsl:value-of select="Description"/>
</U>
<xsl:apply-templates/>
</BODY>
</HTML>
</xsl:template>
Listing 8. The first template from the XSL style sheet in Listing 7
Notice the <xsl:template> tag: It has an attribute
match="/Recipe". This indicates that this template is to
be applied when a <Recipe> element is encountered at
the input. Everything enclosed within this
<xsl:template> element will be placed in the
output.
The XSL processor sees a <Recipe> element, so it
begins building its output by using the contents of the
<xsl:template> element in the XSL file. It adds an
<HTML> element, then a <HEAD>
element inside of that, and then a <TITLE> element.
It's actually building a new HTML document by creating HTML
from the template, based on what it sees. The
<xsl:value-of> tag instructs the XSL processor to go
get the text contained in some other element -- in this case, the sub
element <Name>. Moving a few lines down, you can
see the same thing happening, as the XSL processor again fetches and
uses the same string within the <H3> tag, and the
<Description> tag after it. (Note that we're using
the same text in more than one place in a document, something CSS
simply can't do.) Finally, we come to the
<xsl:apply-templates> command, which tells the XSL
processor to apply all the other templates in the file to the input.
The resulting HTML is very similar to the HTML we saw in Listing 1. If
you want to study the XML, XSL, and resulting HTML, and want to learn
how to use XSL to format XML yourself, see the links on XSL in the Resources section of this article.
Additional XSL capabilities
XSL isn't limited to just producing HTML. XSL also has complete
support for "native" formatting, which doesn't rely on
translation to some other format. Nobody has yet implemented this part
of XSL, though, primarily because page formatting and layout is a very tough to wrangle. (There is, however, a contest to implement
all of XSL. See Resources.)
XSL's design also includes embedded scripting. Currently, IBM's
LotusXSL package (written in Java) provides the functionality of almost
all of the current draft specification of XSL, including the ability to
call embedded ECMAScript (the European standard _javascript_) from XSL
templates.
Of course, as always, with power comes complexity. Learning to write
XSL isn't a piece of cake. But the power's there if you want it.
XML is more than just content management XSL, like
CSS, can be used on either the client or the server in a client/server
system. This fact provides immense flexibility and organization to Web
site designers and managers. So much so, in fact, that many people
think of XML, CSS, and XSL as another set of technologies for
"content management" for their Web sites. It makes styling
Web documents easier and more consistent, facilitates version control
of the site, simplifies multibrowser management (think of using a style
sheet to overcome the many differences between browsers), and so
forth. CSS is also useful for Dynamic HTML (which we'll
discuss a bit below), where much of the user interaction occurs on the
client side, where it belongs. From the point of view of people
managing Web sites, XML, CSS, and XSL are indeed big wins. And yet,
there's a whole world of applications that have nothing to do with
browsers and Web pages. The map of that world is called the
Document Object Model.
Modeling information structure in XML
So far, we've looked at XML as a way of representing data as
human-readable documents, and we've spent some time discussing
formatting. But XML's real power is in its ability to represent
information structure -- how various pieces of information relate to
one another -- in much the same way a database might.
Structured documents of the type we've been looking at have the
property that all of their elements nest inside one another, as in
Listing 5 above. Instead of looking at a document as a file, though,
consider what happens if we look at the structure of the tags as a
tree:
|
Figure 3. The recipe represented as a tree structure
|
The figure above shows the recipe as a tree of document tags. The child
nodes of a document nest within the parent node. What if there were a
way to automagically convert an XML document into a tree of
objects in a programming language -- like, oh, say, Java
maybe? And what if these objects all had properties that could be set
and retrieved -- such as the list of each element's children, the text
each object contained, and so on. Wouldn't that be interesting?
The Document Object Model (DOM) Level 1 Recommendation (see Resources), created by a W3C committee, describes
a set of language-neutral interfaces capable of representing any
well-formed XML or HTML document.
With the DOM, HTML and XML documents can be manipulated as objects,
instead of just as streams of text. In fact, from the DOM point of
view, the document is the object tree, and the XML, HTML, or
what have you is simply a persistent representation of that tree.
The availability of the DOM makes it much simpler to read and write
structured document files, since standard HTML and XML parsers are
written to produce DOM trees. If these objects have GUI
representations, it's easy to see how to create an application that
reads structured document files (XML or HTML), lets the user edit the
structure visually, and then save it in its original format. Programs
that interface with existing Web sites become much easier to write,
because once the document is parsed, you're working with objects native
to your programming language.
One of the earliest popular uses for the Document Object Model is
Dynamic HTML, where client-side scripts manipulate and display (and
redisplay) an HTML document in response to user actions. Dynamic HTML
manipulates the client-side document in terms of the scripting
language's binding to the DOM structure of the document being
displayed. For instance, a <BUTTON> object might,
when clicked, reorder a table on the same page by sorting the
<TR> (table row) nodes on a particular column.
But aside from all this browers-document-Web technology, the DOM
provides a common way of accessing general data structures from
structured documents. Any language that has a binding (that
is, a specific set of interfaces that implement the DOM in that
language) can use XML as an interface for storing, retrieving, and
processing generic hierarchical (and even nonhierarchical) object
structures.
How DOM and XML work together
The DOM opens the door to using XML as the lingua franca of
data interchange on the Internet, and even within applications. Tim
Berners-Lee, discussed earlier and commonly known as the "inventor
of the World Wide Web," says that, these days, it's important to
understand that if a system you're designing survives, it will someday
be used as a module in another system. So it's best to design it that
way from the start. The DOM is completely described in IDL, the
Interface Definition Language used in CORBA, so it's connected to
existing software interoperation standards.
Let's think a moment about how DOM with XML would be useful in
programming a database system. First, represent your database schema
as a set of DOM objects. Want a document that describes that schema? No
problem: write it out as XML. Use XSL to format the XML as HTML and
you've got a complete, browseable schema reference that's always up to
date. Want to automatically construct SQL for updating your relational
database from a record set coming into your system? Just traverse your
database's (DOM) schema tree, matching up the names of the columns from
the record set with those of the schema, and build an SQL UPDATE
statement as you go. What's that you say? The schema has changed, and
the record set you've received doesn't match up with the new schema?
You can write code to handle that, or present the user with error
messages that state exactly what's wrong. You even might be able to use
XSL to refactor the DOM tree of your record set into something matching
the new schema.
Finally, it's time to start programming in Java! In the next section,
we're going to examine the Java bindings of the DOM and see how to use
the DOM in a Java program.
XML and Java
Up to this point I've been laying out general information about XML,
without a lot of reference to Java. Now that you understand XML, it's
time to look at how to process XML in Java. Java's a great language for
XML, as you'll see. It provides a portable data format that nicely
complements Java's portable code.
SAX appeal
The easiest way to process an XML file in Java is by using the
Simple API for XML, or SAX. SAX is a simple Java interface
that many Java parsers can use. A SAX parser is a class that
implements the interface org.xml.sax.Parser. This parser
"walks" the tree of document nodes in an XML file, calling
the methods of user-defined handler classes.
To process an XML document, the programmer creates a class that
implements interface org.xml.sax.DocumentHandler. The
Parser object (that is, the object that implements
org.xml.sax.Parser) reads the XML from its input source,
calling the methods of the DocumentHandler when tags,
input strings, and so on are recognized at the input.
The methods of the DocumentHandler interface are as shown
in Listing 9.
public interface DocumentHandler {
public abstract void setDocumentLocator (Locator locator);
public abstract void startDocument () throws SAXException;
public abstract void endDocument () throws SAXException;
public abstract void startElement (String name, AttributeList atts)
throws SAXException;
public abstract void endElement (String name) throws SAXException;
public abstract void characters (char ch[], int start, int length)
throws SAXException;
public abstract void ignorableWhitespace (char ch[], int start, int length)
throws SAXException;
public abstract void processingInstruction (String target, String data)
throws SAXException;
}
Listing 9. interface org.xml.sax.DocumentHandler
Package org.xml.sax includes a utility class called
HandlerBase, which implements the interface in Listing 9
(as well as some other interfaces in the SAX package) with methods that do nothing. Programmers can create a subclass of
HandlerBase that overrides only the methods they want to
use.
For example, say we want a class that counts the elements in an XML
document. We could write a class as follows:
import org.xml.sax.*;
public class ElementCounter extends HandlerBase {
protected int _iElements = 0;
public ElementCounter() { }
// Each time the SAX parser encounters an element, it
// will call this method
public void startElement (String name, AttributeList atts)
throws SAXException {
_iElements++;
}
public void endDocument() {
System.out.println("Document contains " + _iElements +
" elements.");
}
};
Listing 10. A class that counts the elements in an XML document
To create a Java program that counts elements in an XML file, you'd
simply create a SAX parser (how you do that depends on your particular
parser package), then create an instance of your
ElementCounter class. You then call the parser's
setDocumentHandler method with the new
ElementCounter as an argument. The parser keeps a
reference to the DocumentHandler you passed to it. When
you call the parser's parse() method, the parser reads its
input source. Each time it encounters an element (that is, a tag) in
the XML file, it calls the startElement() method of your
ElementCounter object, passing the name of the tag and a
list of attributes the tag may have had.
Experimenting with SAX
An example package, com.javaworld.JavaBeans.XMLApr99, can
be downloaded for free (see Resources). The
sample main() program lets you specify (in this order):
- An XML file to parse
- The fully specified class name of the parser (optional)
- The fully specified class name of a document handler
The package includes two document handlers: the
ElementCounter from Listing 10, and a handler called
SimplePrinter, which (naturally) simply prints the XML
with an easy-to-read indentation. You can try writing your own document
handler and passing it to the main method (called
com.javaworld.JavaBeans.XMLApr99.ParseDemo.main()).
You'll need the JAR file called "XMLApr99.jar," and you'll need to
download the JAR file for IBM's excellent "XML for Java"
package (version 2). Place both JAR files in your CLASSPATH, and type
java com.javaworld.JavaBeans.XMLApr99.ParseDemo
for instructions. The XML for Java package includes excellent
documentation, a programmer's guide, and several example programs to
get you started.
The source code is also available in zip and tar.gz formats. As an
exercise, try downloading one of the other vendors' XML parsers from
the Resources section, and then overriding the
method ParseDemo.createParser() in the sample code to
create a parser from the new package.
Become a tree surgeon!
One final, somewhat more advanced topic, before we close. The SAX
interface allows you to parse an XML file and execute particular
actions whenever certain structures (like tags) appear in the input.
That's great for a lot of applications. There are times, though, when
you want to be able to cut and paste whole sections of XML documents,
restructure them, or maybe even build from scratch an object structure
like the one in Figure 3, and then save the whole structure as an XML
file. For that, you need access to the DOM API.
The DOM API allows you to represent your XML document as a tree of
nodes in your Java (or other language) program. While a SAX parser
reads an XML file, doing callbacks to a user-defined class, a DOM
parser reads an XML file and returns a representation of the file
as a tree of objects, most of which are of type
org.w3c.dom.Node This gives you immense power in
manipulating structured documents. Figure 4 is an example of what I'm
talking about.
|
Figure 4. A DOM document transformation system
|
The Document Object Model, in the package org.w3c.dom,
defines interfaces for document elements (that is, tags), DTD elements,
text nodes (where the actual text inside the tags is kept), and quite a
few other things we haven't even discussed. Figure 4 is a schematic of
a general system that can transform one XML document to some other form
programmatically. Your program uses a DOM parser to parse an XML file,
and the parser returns a tree that is an exact representation of the
XML in the file. Note that, at this point, you've read an input file,
checked it for formatting and semantic validity, and built a complex
hierarchical object structure, all in just a few lines of code. You can
then traverse the document tree in software, doing whatever you like to
the tree structure. Add nodes, delete them, update their values, read
or set their attributes -- basically anything you like. When your tree
has the new structure you desire, tell the top node to print itself to
another XML file, and the new document is created.
XML-Java synergy
One of the reasons Java and XML are so well-suited for one another is
that Java and XML are both extensible: Java through its class loaders,
XML through its DTD. Imagine a server, reading and writing XML, where
the DTD for the system input can change. When a new element is added to
the input language, a running server (written in Java) could
automatically load new Java classes to handle the new tags. You would
not only have an extensible application server -- you wouldn't even
have to take the server down to add the extensions!
One small idea points to the possible implementations of XML and Java
together. The next section is about a company whose combination of XML
and Java is its core technology.
XML with Java in the real world
You now have a handle on XML technology, including how it's implemented
in Java. You understand that a document can be viewed as a tree of
objects and manipulated using SAX or DOM. Let's have a look at a real
company that is using all of these technologies to provide solutions
for its clients.
DOM interfaces exist not only for XML, but for HTML, as well. This
means that the leftmost document in Figure 4 could be a Web page from
which you wish to extract information for manipulation in Java.
In fact, Epicentric, an Internet startup in San Francisco, does just that. Epicentric uses Java and XML in its turnkey systems to allow creation of custom portal sites. Portal sites, like the front pages of Netscape Netcenter and Excite!, are integrated aggregations of information from
various Internet sources. In a corporate Internet environment, a
portal may contain information gleaned from external Web pages (for
example, weather reports), alongside internal enterprise data. Portals
are also often customizable by each user.
Epicentric's systems read HTML from the Internet as DOM documents,
extract information from those documents, and store that information in
a standard XML format. Other information sources are also converted
into this same XML format and stored on Epicentric's server. The company
then uses the XML with XSL and Java Server Pages to create custom
portals for its clients.
"A lot of good work has been done on the basics ... like parsers
and XSL processors," says Ed Anuff, CEO of Epicentric. One benefit
of using XML is that it makes designers think through the system
structure in a very structured way, Anuff says.
When asked about concerns with XML, Anuff states that many of the
problems he runs into are architectural, such as which DTD to use, and
designating the appropriate places in the system to use XML. Systems
designers are still working out how to use this new technology most
effectively in an enterprise environment.
Also, since the technology is so new, it's often hard to know what
pieces of the system to build in-house. For example, quite a few
companies built their own XML parsers but now have little return on
investment because larger companies are developing superior XML
technology and giving it away for free. "The biggest challenge
today is figuring out when you're reinventing the wheel, and when
you're adding value," says Anuff.
Despite these challenges, the future looks bright for Epicentric, which
has several "pretty decent-sized customers" using the
company's software in beta. With clients and advertisers that include
the likes of Eastman Kodak Company, Sun Microsystems, Chase Bank, and
LIFE Magazine, Epicentric is using XML to aggregate and redistribute
information in novel ways.
Conclusion
XML is a powerful data representation technology for which Java is
uniquely well-suited. You're going to be hearing a lot about XML in the
coming months and years. Anyone working with information systems that
communicate with other systems (and what systems don't, these days?)
has a lot to gain by understanding XML technology and using it to its
full advantage.
Using XML with XSL or CSS, you can manage your Web site's content and
style, and change style in one place (the style sheet) instead of
editing piles of HTML files or, worse, editing the scripts that produce
HTML dynamically. Using SAX or DOM, you can treat Web documents as
object structures and process them in a general and clean way. Or, you
can leave browsers behind entirely and write pure-Java clients and
servers that talk to each other -- and other systems -- in XML, the new
lingua franca of the Internet. Sun Microsystems, the creator
of Java, has perhaps best described the power of XML and Java together
in its slogan: Portable Code -- Portable Data. Start experimenting with
XML in Java, and you'll soon wonder how you ever lived without it.
Thanks to Dave Orchard for his comments on drafts of this article,
and to the many helpful people I met in San Jose, CA.
|
|
![]() |
About the author
mark.johnson
Mark Johnson lives in Fort Collins, CO, and is a C++ programmer
by day and Java columnist by night. Very late night.
|
Advertisement: Support JavaWorld, click here!
|
|
(c) Copyright 1999 ITworld.com, Inc., an IDG Communications company
Resources
There are so many XML resources on the Web, I've had to categorize. The first section here is the most useful, since the documents are either high-level summaries or excellent link sites. Apologies to anyone who was omitted.
XML and Java: General XML resources
- "XML, Java and the Future of the Web." by Jon Bosak. The paper that started it all, at least from a Java programmer's point of view. Definitely worth a read, even if it's a bit dated. Jon is commonly considered to be the father of XML. Funny how all of these technologies seem to have paternity
http://metalab.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html
- "Media-Independent Publishing: Four Myths about XML" Jon Bosak
http://metalab.unc.edu/pub/sun-info/standards/xml/why/4myths.htm
- Robin Cover's XML-SGML site is, according to my SGML buddies, the bible of XML resources
http://www.oasis-open.org/cover/
- The W3C's XML resource page lets you cheer from the sidelines as XML technology proposals develop into recommendations, or join in the fray on their active mailing lists
http://www.w3.org/XML/
- OASIS, the Web site of the Organization for the Advancement of Structured Information Standards, offers general news and information about XML
http://www.oasis-open.org
- The Graphics Communications Association, host of the XTech '99 conference (March 11 to 13, 1999, San Jose, CA) and the upcoming XML Europe '99 conference in Granada, Spain, (April 26 to 30, 1999) has a Web site packed with XML information
http://www.gca.org/
- XML.com is great for watching trends and digging up XML news
http://www.xml.com
- Textuality hosts Tim Bray's site. Check it out for a look at the "big picture" of how XML fits into the structured document universe -- and for a look at Lark, Tim's nonvalidating XML processor
http://www.textuality.com/
- The XML FAQ
http://www.ucc.ie/xml/
- IBM's XML Web site is an outstanding supplement to alphaWorks
http://www.software.ibm.com/xml/index.html XML and JavaTutorials and training Cascading Style Sheets Extensible Style Language (XSL) Upcoming XSL contest
Though the details aren't yet worked out, Sun Microsystems will soon announce a call for proposals for a $30,000 grant to develop a client-side processor for full XSL implementation in Mozilla. It will also announce, in conjunction with Adobe, a contest (first prize $40,000, second prize $20,000) to develop a pure-Java, server-side processor of the entire XSL language, to format XML to PDF (Adobe's document format). Keep watching the Java Developer Connection (requires free registration), and Mozilla sites for the eventual announcements.
Simple API for XML (SAX) Document Object Model (DOM) Dynamic HTMLSoftware
- Epicentric, Inc.
http://www.epicentric.com
- More XML (and other Java) technology than you can shake a stick at is available at IBM's alphaWorks
http://alphaworks.ibm.com
- Version 2 of IBM's excellent XML parser package, xml4j, is available for download. This package includes several parsers, both validating and nonvalidating
http://www.alphaworks.ibm.com/tech/xml4j
- See also IBM's exciting Bean Markup Language project, which uses XML to represent and manipulate JavaBeans
http://www.alphaworks.ibm.com/tech/bml
- Another free Java XML parser was written by the indefatiguable James Clark, download at
http://www.jclark.com/xml/xp/index.html
- XEENA is IBM alphaWorks's DTD-guided XML editor. You want it, you need it, you gotta have it
http://www.alphaworks.ibm.com/tech/xeena
- Mozilla.org is the open source community's effort to extend the Netscape source code. Find out about it at
http://www.mozilla.org
- Information about XML and CSS in Mozilla appears at
http://www.mozilla.org/rdf/doc/xml.html
- You can read about Sun's XML and Java initiatives at
http://www.sun.com/990310/java_xml.jhtml
- In addition, Java Project X includes source code downloadable from
http://developer.java.sun.com/developer/earlyAccess/xml/index.html
- ArborText has a suite of sophisticated tools for editing SGML, XML, and XSL
http://www.arbortext.com/Products/products.html
- Oracle8i from Oracle corporation uses XML inside the Oracle core
http://www.oracle.com/xml/
- Download Oracle's free XML for Java parser
http://technet.oracle.com/direct/3xml.htm
- Microsoft's Internet Explorer 5.0, released this month, implements part of the XSL spec. You can find it on Microsoft's Web site -- and also just about anywhere else
http://www.microsoft.com/windows/ie/default.htm
- You can also download a beta release of Microsoft's XML Notepad editor (limited to running only on Microsoft Windows)
http://www.microsoft.com/xml/notepad/download.asp
- Vervet Logic of Bloomington, IN, has announced XML <PRO>, a commercial XML editor
http://www.vervet.com/
- Majix, to transform XML to HTML via XSL, is available at
http://www.tetrasix.com/
- If your French is rusty, you might want to try the English-language site at
http://www.tetrasix.com/english/default.htm History Miscellaneous links
Feedback:
jweditors@xxxxxxxxxxxxx
Technical difficulties:
webmaster@xxxxxxxxxxxxx
URL: http://www.javaworld.com/jw-04-1999/jw-04-xml.html
Last modified: Wednesday, February 23, 2000
|