Language is the medium of communication. When two people share a language, they can express sophisticated concepts, great oratory, works of the storyteller's art and the poet's, intimate conversations, and the deepest philosophy. When they don't, conversation reverts to a more primitive level, snatches of potentially shared words, wild gesticulations, grunts, and mimes, with a high chance that what is a request for directions to the nearest restaurant for one person may be the gravest of insults for the other person.
The languages that computers use to communicate with one another run into many of the same limitations as the languages that people use. Moreover, even within the same computer, different programs may use different representations to indicate the same thing. Even if all the pieces of a business card entry -- a person's name, address, primary phone number, and e-mail address -- are the same from one application to the next, differences in the programming languages used to construct these objects may mean that what is built for one application is completely useless for a second one, especially if the second was written with a different programming language from the first.
One of the main problems in developing a common format for computer communication comes from the way that most computers store information. Typically, if a programmer created a piece of code representing an address book, for instance, that code gets converted into a compressed and optimized form that could be understood if you knew exactly what kind of format was used to create it; otherwise, the code is gibberish -- not only to a human reader, but also to any other application that wanted to work with the code. In other words, you could only use the information from the address book in your own program if you had an exact description of the source code.
The alternative is to use plain text. Suppose that you wanted to set up a business card in the address book. Think of a typical business card. It has a number of different pieces of information, which might look something like this:
Sarah Tremaine 1125 NE. Oak St. Kirkland, WA 98033 (425)555-1212 sarah.tremaine@midnightrain.com
This is a little bit better, but still not very useful. You may be able to guess based upon your own cultural background that the first line is the person's name, the second an address, and so forth, but to a computer, each of these is simply a line of text with absolutely no context whatsoever.
You might think that you could use the position of each line, but that only works if every business card looks the same. If, on the other hand, someone had two phone numbers, the line count would be off and the computer might end up thinking that Sarah's e-mail address was a phone number.
Another possible solution would be to use labels to describe what each line of information is. A labeled business card might look something like this:
Name:Sarah Tremaine Street:1125 NE. Oak St. City:Kirkland, WA 98033 Phone:(425)555-1212 Email:sarah.tremaine@midnightrain.com
This is considerably clearer, but things could get complicated if you have more than one business card in your list:
Name:Sarah Tremaine Street:1125 NE. Oak St. City:Kirkland, WA 98033 Phone:(425)555-1212 Email:sarah.tremaine@midnightrain.com Name:Henry Troughton Street:1255 NE. Oak St. . . .
In this case, there's no clear boundary between different cards -- you have to know that the name is always the first entry of a card. You could use other labels and indentations to make this relationship a little clearer:
AddressBook:
BusinessCard:
Name:Sarah Tremaine
Street:1125 NE. Oak St.
City:Kirkland, WA 98033
Phone:(425)555-1212
Email:sarah.tremaine@midnightrain.com BusinessCard:
Name:Henry Troughton
Street:1255 NE. Oak St.
City:Kirkland, WA 98033
Phone:(425)555-1111
Email:henry.troughton@midnightrain.com
At this point, you have structured the information into an easier to understand format. However, this now introduces other problems. For starters, you become very dependent upon tabs and carriage returns, which means that this format is not very good for storing documents that consist of paragraphs of text. Moreover, editing that text in certain programs may convert the tabs into spaces or may break the text along arbitrary boundaries.
Consequently, the tags that we create should be in a format that obviously identifies them as being tags. For instance:
<AddressBook>
<BusinessCard>
<Name>Sarah Tremaine
<Street>1125 NE. Oak St.
<City>Kirkland, WA 98033
<Phone>(425)555-1212
<Email>sarah.tremaine@midnightrain.com <BusinessCard>
<Name>Henry Troughton
<Street>1255 NE. Oak St.
<City>Kirkland, WA 98033
<Phone>(425)555-1111
<Email>henry.troughton@midnightrain.com
Note that a relationship has been established in this information. An address book is a container -- it holds business cards. A business card, in turn, is a container that holds a name, address, phone number, and an e-mail address. Although this relationship is pretty obvious, this may not be the case if you take the indentations out:
<AddressBook> <BusinessCard> <Name>Sarah Tremaine <Street>1125 NE. Oak St. <City>Kirkland, WA 98033 <Phone>(425)555-1212 <Email>sarah.tremaine@midnightrain.com <BusinessCard> <Name>Henry Troughton <Street>1255 NE. Oak St. <City>Kirkland, WA 98033 <Phone>(425)555-1111 <Email>henry.troughton@midnightrain.com
On the other hand, if you created a closing tag that marked the end of a particular block, you develop a set of relationships where one tag contains other tags:
<AddressBook>
<BusinessCard>
<Name>Sarah Tremaine</Name>
<Street>1125 NE. Oak St.</Street>
<City>Kirkland</City>
<State>WA</State>
<PostalCode>98033</PostalCode>
<Phone>(425)555-1212</Phone>
<Email>sarah.tremaine@midnightrain.com</Email> </BusinessCard>
<BusinessCard>
<Name>Henry Troughton</Name>
<Street>1255 NE. Oak St.</Street>
<City>Kirkland</City>
<State>WA</State>
<PostalCode>98033</PostalCode>
<Phone>(425)555-1111</Phone>
<Email>henry.troughton@midnightrain.com</Email>
</BusinessCard>
</AddressBook>
The closing tag, indicated by </, tells both you and the computer that this is the end of a particular block of meaning. The relationship between a container and what it contains becomes obvious. Perhaps more significantly, the computer now has enough information from this text file to define these relationships without needing to know anything about the meaning of the labels or tags.
Why is this important? If you can express information in this manner, you can send it to a computer program that can read it, test to make sure that the information is valid (a process called, not surprisingly, validation), and if it is valid, start working with it with very generalized tools available on almost every computer operating system, from supercomputers to hand-held devices.
If the software company that creates an application you use has a compressed (and proprietary) format, these files become unreadable if the software company goes out of business and the software is no longer available or if it upgrades its product without supporting your format. With the simple text format described above, on the other hand, your address book can live on no matter how many companies go under.
This format has a name -- the Extensible Markup Language, or XML. XML is not, despite what most people may think, a collection of tags that describe something. It is instead a language for describing what notation to use to indicate that a name is in fact a tag, rather than content. In essence, it's a set of rules that establishes how you group things, how you indicate tags (known properly as elements), how you indicate modifiers to these elements (known as attributes), how you handle problematic text that may include special characters, and so on.
XML is consequently what's known as a meta-language -- a language for describing how to create other languages. The example provided above is an example of our own customized Address Book Language.
Unfortunately, as with any language, if you're the only person who can speak it, the actual utility of the language is somewhat limited. Although it is easier to decipher common terminology if the document is in XML format, that doesn't necessarily make it easy -- especially when specialized jargon or abbreviations are used. Consequently, much of the role of an XML designer is to create consensus in getting a language accepted by all parties involved in the "conversation." You can, in fact, think of XML as being something like a contract: It establishes the terms and the definitions of those terms so that everyone can be sure that when one person talks about a "name," any other person (or computer program) can use the term without ambiguity.
Although the example above illustrates how you could derive an XML-esque language, it helps to know exactly what rules determined whether a document is well formed or not. To be well formed implies that the XML that you've created or received completely follows a well-defined set of rules. Note that even well-formed XML may be meaningless or have erroneous content; well formedness just indicates that an XML parser program can read it and do something with it.
The rules themselves are straightforward, though there are enough exceptions (especially for material compatible with older standards) to keep things interesting. The following list isn't exhaustive, but it's enough to ensure that your XML is generally well formed.
Rule 1: root elements
There must always be one and only one root element in the XML document. If you envision XML as a tree, the root "node" is the junction where the first major branch of the tree splits off from the trunk. To put it another way, the root is the single ancestor from which everyone else descends in a family tree.
Rule 2: names in XML
Names in XML are case sensitive. Put another way, <address>, <Address>, <ADDRESS>, and <adDresS> are all distinct elements. Additionally, element names cannot start with a number, cannot have any white space (spaces, carriage returns, or tabs), and cannot have any nonalphanumeric character other than a dash (-) or underscore (_).
Rule 3: elements
An XML element consists of the name of an element and any attributes on that element, and it may have child elements or text blocks (or both).
Rule 4: attributes
An attribute is a name/value pair where the name appears on the left of an equal sign, and the value appears in quotation marks on the right of the same sign. For instance, the following is a p element (perhaps a paragraph in HTML) with the attributes width and height, having values of 100 and 50, respectively.
<p width="100" height="50">This is a test.</p>
Rule 5: empty elements
In certain cases, an element may not have any children or contained text. In this case, you can use one tag rather than two to indicate the element, with the closing slash indicator at the end of the tag, like this:
<horizontal_rule width="100%" height="8"/>
This is the same as
<horizontal_rule width="100%" height="8"></horizontal_rule>
Rule 6: comments
You can create a comment within the XML code by using comment tags <!-- and -->. This makes it possible both to add comments such as:
<address> <!-- This is the street address --> <street>?</street> </address>
You can also use comments to remove a section of XML from being parsed. For instance, to remove the address block, you'd surround it with comment start and end tags:
<!-- <address> <street>?</street> </address> -->
Rule 7: white space
XML in general does not preserve white space. In other words, XML generally makes no distinction between a carriage return and a space when parsing the content. There are ways of getting around them, such as placing white space sensitive text within a CDATA block, as follows:
<test><![CDATA[White Space Does Matter Here]]></test>
In this example, the expressions <![CDATA[ and ]]> indicate the start and end of a block holding specialized white space information.
Rule 8: processing instructions
A processing instruction is a specialized tag that starts with <? and ends with ?>, which serves as information to the XML processor. The most common such processing instructions include the XML declaration that appears at the top of most well-formed XML:
<?xml version="1.0"?>
and the command used to provide styling information to XML documents:
<?xml-stylesheet type="text/css" href="myStyleSheet.css"?>
Processing instructions used to be important in the predecessor to XML, SGML (Standard Generalized Markup Language), but in general its use has been deprecated -- something that is downplayed with the intention of making it obsolete, primarily because the use of processing instructions places too much intent into the XML, rather than just description.
Rule 9: entities
Certain characters cause problems when encountered in XML, including the less than (<) and greater than (>) characters that mark tags. Because these characters do occasionally pop up, it's sometimes useful to express these in alternative ways. One such way is the use of entities. An entity in XML always starts with an ampersand (&) and ends with a semicolon (;). XML itself defines three primary entities by name: < for the < symbol, > for the > symbol, and & for the ampersand character (&) itself. For instance, a company name element with the content Troughton & Sons, would be written like this:
<companyName>Troughton & Sons</companyName>
Certain other characters can also be rendered in this way, but these require a document type definition (discussed in Lesson 2). Normally, such characters use a numeric equivalent (corresponding roughly to the ANSI code standard that's used to define how letters and numbers map to computer bytes). Thus,   is an entity representing a single white space, whereas represents a carriage return.
At first, XML used many of the same conventions as SGML, including using entities to represent not just single characters but entire blocks of text (paragraphs or even whole documents). However, for a number of reasons, this use of entities has also been deprecated and is rapidly becoming obsolete as alternative ways to do the same thing emerge.
Rule 10: container/contained relationships
Many people come to HTML (Hypertext Markup Language) from XML and carry with them certain bad habits that the somewhat laxer rules of HTML make possible. One of these comes from overlapping markup. For instance, you can create an HTML passage where a bold element starts (making the text within appear bold), then an italic element starts after a few words (making the text appear both bold and italic), and then the bold gets turned off (italic only), until finally the ending-italic tag is released so that the text returns to its previous state. That code would be written like this:
This <b>is a <i>test. This</b> is only a </i> test.
That code would render like this:
This is a test. This is only a test.
The problem with this is that the bold element no longer contains the italic element. This is illegal in XML (and HTML). Instead, the same expression would have to be achieved as:
This <b>is a <i>test. This</i></b><i> is only a </i> test.
In this case, the bold (<b>) element contains one set of italic (<i>) elements, and a second set of italic elements carries on the rest of the expression.
This container/contained relationship may seem more complicated than necessary, but it actually serves a number of very vital purposes:
- The container/contained relationship makes it much easier for an application to parse the XML.
- It makes it easier to determine whether elements have been ended correctly or if an error has occurred.
- It means that the parser doesn't need to know any specialized rules about the certain types of elements.
These rules are generally easy to apply, and most XML parsers can tell you when you create XML whether you've created well-formed code (and where the errors are if you didn't). This represents another advantage to using XML; even without understanding the intent of the XML, the tools used to parse and process XML can still work remarkably well. This usually isn't the case with more traditional programming data-structures.
A short history lesson is in order. In the late 1960s, Charles Goldfarb of IBM was tasked with a fairly complex problem. Computers were beginning to reach a point of complexity where they could handle more tasks than just simply adding up rows of numbers; indeed, one of the things that made the explosion in computers so exciting was the realization that you could encode text using numbers, save that information to a storage medium such as magnetic tape, and then reproduce (or process) that text through programs.
IBM had begun to produce a large amount of documentation -- manuals, specifications, case studies, research proposals, and so on. In fact, it had produced so much documentation that it was difficult to find anything useful in all of this. Dr. Goldfarb was (with others) tasked to find a way of storing and accessing this documentation electronically. The solution that he came up with was a markup language, a notation for breaking a document into pieces, such as headers, subheads, paragraphs, lists, and so on. However, rather than just creating one markup language to handle everything, he created a language for writing various kinds of markup language. This language was dubbed GML (Generalized Markup Language).
Like so much of what IBM did, GML made its way into research material, which then fueled a number of other GMLs from different companies and governments. Eventually, to maintain some sense of order on a rapidly proliferating standard, the ISO (International Standards Organization) convened a group to establish a single standardized GML, which became known as SGML (Standard Generalized Markup Language) in the early 1980s. It was adopted by the United Nations at that point as the language for documentation.
However, the legacy of all those proliferating GMLs took its toll. No one likes to have his "standard" not be standard, and consequently, the number of ways that you could create "standardized" markup was actually pretty astonishing. Because there were so many different exceptions that had to be considered, SGML parsers were pretty hefty (and pricey) pieces of code.
In the late 1980s, a programmer by the name of Tim Berners-Lee was tasked with another fairly complex problem. Berners-Lee worked for CERN, the European high energy physics laboratory in Switzerland. The physicists there routinely wrote a large number of research papers that were produced in prestigious scientific journals. However, as with Charles Goldfarb's documentation problem, too many of these papers remained unknown because no one knew that they had been published. Berners-Lee's solution to this problem was to create, using SGML, a basic language for coding physics abstracts.
In the process of doing so, he combined three basic concepts in a very unique way. The first was to set up something called a protocol , which was simply a way of communicating between two machines. His particular protocol addressed a computer on his network by way of a label, perhaps with directories to the particular file that he wanted to access. This by itself was not terribly unique -- there were dozens of similar protocols in existence. Where Tim Berners-Lee differed from everyone else was in creating a language in which these particular labels could be embedded, such that selecting a link in the document written on that language would cause the new page to be displayed. This principle had been floating around for a while and was called hypertext, but by creating a hypertext link to some other point on the network, Berners-Lee made it possible to create an entire "space" of such links.
The protocol was called HTTP (Hypertext Transport Protocol), and the language was named HTML (Hypertext Markup Language). Having done this, Tim Berners-Lee then wrote two pieces of software: a client piece that could display the HTML files as text on a terminal (the first Web browser) and a server piece that would retrieve the HTML pages when requested via HTTP (the first Web server). Most of this would have stayed in obscurity had he not also performed one other crucial thing: He made the source code for the client, the server, and the language itself available for free.
The effect of this simple (and gracious) act was profound. In a couple of years, Web browsers and servers had proliferated across university campuses all over the world, and the HTML language had evolved into a general language for posting everything from syllabuses to biography pages to games and much, much more. At the University of Illinois, the Mosaic Web browser, produced by Mark Andreesen and others, first married HTML pages with graphics, profoundly reshaping the direction of the Internet. Entrepreneur James Clark hired these young programmers to work at Netscape, setting off the tumultuous dot-com decade.
As HTML proliferated (especially once it entered the commercial realm), it also became increasingly heavy and fractured. In 1994, a number of people, including Tim Berners-Lee and Charles Goldfarb, helped set up the W3C (World Wide Web Consortium), which rapidly became an industry group oriented toward developing the standards of the Web, starting with HTML.
The problem with HTML at that point was that it had long since ceased being SGML. Some truly silly tags were being pushed by various browser vendors trying to differentiate themselves by features, and this nonconformance of HTML made it difficult to provide a cohesive definition for the language. Thus, the first task that the W3C faced was to fix HTML so that it was valid SGML.
That done, the W3C members faced another issue. Wrestling with the baroque form that SGML had taken, it became quickly obvious that it was too complex for use on the Web. Consequently, one of the next tasks at hand was to create a "stripped-down" version of SGML that could be more readily used on the Web. It took two years, starting in 1996, for XML (Extensible Markup Language) to mature to the point of being useful. Under Tim Bray's skillful management of the specification, in February of 1998 XML was born.
It's worthwhile to take a brief diversion here to explain some terminology used by the W3C, because it comes up a great deal when discussing the various standards promoted by the W3C. If a W3C member organization or corporation wishes to push an idea to the W3C, typically, the member publishes a Note, an initial description of a technology that may be worth developing further.
When the W3C meets (which it usually does quarterly), these notes are considered and any that look like they may be worth developing are promoted to Working Drafts, with a Working Committee being created to develop that standard. Over the course several months (and often years), the Working Draft is shaped, implementations are built to test the draft, and the best ideas that promote the technology get pushed into a new version of the draft.
Eventually, the Working Draft reaches a point where it's stable enough to be promoted. At this stage, it becomes a Proposed Recommendation. It stays in this form typically for three months, or long enough for other people outside the W3C to work with the technology and provide feedback. This feedback in turn gets pushed into the Working Draft, and if the changes are sufficiently minor, the Working Draft becomes a Candidate Recommendation. At this stage, it's pretty much guaranteed that the Working Draft will be accepted as the official version, unless someone can show that it fails in some critical way. If no one raises sufficiently serious objections, the Working Draft becomes a full W3C Recommendation.
This terminology can be a little confusing to the outsider. The W3C is not a legal body. It cannot legislate any action, and it's made up principally of a mixture of companies, universities, scholars, and governmental agencies. Consequently, the W3C can merely recommend that the proposed standard be adopted. However, because of the care and meticulous attention to detail that goes into these standards, as well as the role that the W3C has played in establishing the Web in the first place, a Recommendation is a guarantee of the stability, reliability, and independence conferred upon the standard.
XML reached a milestone just a couple of years ago. In 2001, XML was formally adopted by the ISO (International Organization for Standardization), in effect ratifying the language as the foundation for Web technologies with the United Nations.
XML by itself is only so interesting. Once the language itself was defined, the W3C started to address the issues about how best to make XML a language that could be used for a variety of activities. Obviously, moving HTML to an XML format was a primary concern, but there were other, more basic activities that needed to be accomplished first. These activities resulted in the following Core standards, based upon what they were intended to do:
- A schema language (XSD): In 1998, the only way that you could specify the structure of an XML document was to use a DTD (Document Type Definition), a complex document written in a dialect of SGML that wasn't XML based. It took three years to create a standard, XML-based language for describing the schema (or structure) of an XML document.
- A presentation styling language (CSS): Most XML doesn't have any specific visual representation -- there's no way of saying in XML itself that the name of a person in a business card should be rendered as 24-point bold Helvetica text, for instance. A language intended to solve this problem does exist -- CSS (Cascading Stylesheet) -- designed originally for HTML.
- A transformation language (XSLT): CSS can't change the order of entries or otherwise manipulate XML; however, a much more powerful language, XSLT (XML Stylesheet Language for Transformations) can. It's an XML-based language that can transform XML to HTML. It can also transform XML to other XML, and it is this latter capability that makes it a powerhouse in the XML world.
- A navigation language (XPath): The structure that XML introduces is complex enough that getting from one element to another element (something very necessary in most programming) can prove difficult. The XPath (XML Path) language makes it possible to indicate not only the location to one element, but also the location to a large number of specific elements at the same time. XPath is utilized in a number of other standards.
- A linking language (XLink): If XPath deals with navigation within a single XML document, XLink (XML Linking Language) focuses on linking between documents. XLink is an extension of the simple HTML hypertext links.
- A query language (XQuery): XQuery (XML Query) is an attempt to build a language for querying databases and other data repositories using XML. It incorporates XPath, along with a basic presentation layer for generating simple output.
- A document object model (XML DOM): To work with XML, other computer languages need to have some way of representing XML in that language. Over time, a programming model for XML has emerged and become standardized. This means that regardless of whether a programmer is working with Java, C#, Perl, JavaScript, or anything else, the way that XML is accessed remains constant. This dramatically simplifies computer code across platforms.
- A message protocol language (SOAP): By providing a standard "envelope" that can be used to transfer content from one point to another on the Web, SOAP (Simple Object Access Protocol) hopes to create a friendlier XML layer on top of the existing HTTP layer of the Web.
- A language of meaning (OWL): In establishing a means of defining languages, XML has also opened up the possibility of providing a common way of defining meaning in general. OWL (Ontologies for the Web Language) combines other relational and semantic standards such as RDF (Resource Description Framework), DAML (DARPA Agent Markup Language), OIL (Ontology Interchange Language), and Topic Maps into a general language for semantics and meaning.
Most of the standards discussed in this section are covered in a little more detail throughout this course. If you really want to know the details of all these standards, visit the W3C Web site at www.w3.org/
These languages are all either W3C Recommendations or W3C Working Drafts far enough advanced in the process as to be considered likely to be Recommendations before the end of 2003.
The languages in this list are core languages; they describe the functionality of the Web. In addition to these languages, a number of presentation languages are endorsed (or created) by the W3C. These deal predominantly (though not exclusively) with GUIs (Graphical User Interfaces), which are ways of providing a visual or aural representation of Web content. These are just a few of them:
- Web page display (XHTML): This converts the venerable HTML language into an XML-based language. What's perhaps more noteworthy about XHTML is its modular nature; the language is designed to be used in pieces, as well as single cohesive units.
- Vector graphics (SVG): Vector graphics use equations rather than pixel dots to render images -- such applications as Macromedia Shockwave use vector graphics quite effectively. SVG (Scalable Vector Graphics) uses XML to describe the shapes that make up pictures, and includes interactivity and animation capabilities.
- Multimedia (SMIL): SMIL (Synchronized Multimedia Integration Language) is used to provide both information about a video, audio clip, or animation and to define and control interactivity within such media.
- Forms content (XForms): HTML Forms were first introduced in 1993 and are not even remotely XML friendly. XForms is the next generation of forms technology and has been modularized so that it can be used in a much broader cross-section of applications.
- Speech recognition and generation (VoiceXML): Voice is rapidly becoming the next major interface language. VoiceXML combines a number of emerging XML languages to create a comprehensive and sophisticated tool for building telephony applications, voice recognition systems, text synthesis applications, and more.
This chapter focused on the core XML standards as set by the W3C. Lesson 6 contains a much more exhaustive look at XML standards -- everything from e-commerce to writing poetry.
In this lesson, you found out that XML is a language designed to accommodate building rich data and document structures, universally across platform and programming language. In the half decade since XML became a standard, it has revolutionized the way that programs and operating systems are built. The W3C, the body responsible for the XML specification, has also considerably enhanced the technologies around the core XML language itself.
In the next lesson, you examine the role of schema definition languages, including XSD and DTD, and find out how to create an XML schema.



