Identifying, locating, and naming things on the Web

This article is converted per IBM developerworks author's instructions to uuu.dw, using html2dw.xsl and a Makefile.

It was published 21 Jun 2005 as Untangle URIs, URLs, and URNs.

First published by IBM developer Works at http://www.ibm.com/developerWorks/.
$Revision: 1.28 $ of $Date: 2005/06/27 20:16:07 $
Dan Connolly, Technical Staff, W3C/MIT
connolly@w3.org dwxed@us.ibm.com
The World Wide Web combines three kinds of technologies: data formats, and protocols, and identifiers that tie them together. The relationship between data formats such as XML and HTML is relatively clear, as is the relationship between protocols such as HTTP and FTP. But identifiers seem to be a bit trickier to pin down. In information management, persistence and availability are in constant tension. This tension suggests separate technologies for Uniform Resource Names (URNs) and Uniform Resource Locators (URLs). Meanwhile, Uniform Resource Identifiers (URIs) are designed to serve as both persistent names and available locations. This article explains how to use the current URI standards with XML technologies, gives a history of URNs and URLs, and gives a perspective on the tension between persistence and availability.

Web addresses were relatively obscure a dozen years ago, but now they appear not just in web browsers but also on business cards and brochures, on billboards and buses and t-shirts. They're commonly known as URLs, which stands for Uniform Resource Locators. A typical example is http://www.cisco.com/en/US/partners/index.html. But sometimes, they'll appear in a shorter form, such as www.yahoo.com/sports. Is that a URL too? What about ../noarch/config.xsd? Or guide/glossary#octothorpe?

In order to make good use of URLs in XML Namespaces, XML Schemas, and Extensible Stylesheet Language Transformations (XSLT), it's important to know the rules. But the XML family of specifications refer to Uniform Resource Identifiers (URIs) and Uniform Resource Names (URNs). What's the difference? That question has a long history.

My role in that history goes back at least as far as the Hypertext conference in 1991, where I met both Douglas Engelbart, who invented of the mouse and pioneer of networked computers and hypertext, and Tim Berners-Lee, inventor of the World Wide Web. In a 1990 summary of his 20-plus years of research (see Resources), Engelbart listed among the requirements for an Open Hyperdocument System, in principle, every object that someone might validly want or need to cite should have an unambiguous address. In his 1991 design document on naming, Berners-Lee wrote:

This is probably the most crucial aspect of design and standardization in an open hypertext system. It concerns the syntax of a name by which a document or part of a document (an anchor) is referenced from anywhere else in the world.

This article discusses the current state of the art in naming technology and standardization for the World Wide Web as well as some of the history and evolution of the terminology. It concludes with a perspective on naming in information management.

The URI standard

RFC3986: Uniform Resource Identifier (URI): Generic Syntax is an Internet Standard. The Request For Comments (RFC) series is the famous archival document series that is the backbone of the Internet Engineering Task Force (IETF) standards process. Only a few of the thousands of RFCs, such as TCP (RFC793) and the Internet Mail format (RFC821) and protocol (RFC822), have advanced to full Internet Standard status. RFC3986 advanced to this status in January 2005.

According to the URI standard, the first example above -- http://www.cisco.com/en/US/partners/index.html -- is indeed a URI, and it has several component parts:

The IETF consensus process manages the schemes. The Official IANA Registry of URI Schemes (see Resources) includes familiar schemes like http, https, and mailto, plus many others that you may or may not be familiar with.

A URI path is like a typical file pathname. URIs adopted forward slashes (a/b/c) from the UNIX® tradition, because when URIs were designed in the late 1980s, UNIX culture was more prevalent on the Internet than PC culture. At that time, there were several popular notations for accessing remote files. One of them was Ange-ftp, an extension to emacs for editing remote files. It combined host names and user names with pathnames to get something like /jbrown@freddie.ucla.edu:~mblack/ . The URI syntax that was developed for the Web used the double-slash notation for cross-machine naming (following the Apollo Domain UNIX dialect), but it also introduced the scheme syntax so that naming conventions from any number of different protocols could be unified. Some examples:

The second example in the introduction, www.yahoo.com/sports, is not really a URI. It's a convenient shorthand for http://www.yahoo.com/sports, a format that is supported by popular Web browser user interfaces. Don't make the mistake of leaving out the scheme in XSLT like this, though:

<xsl:include href="exslt.org/math/min/math.min.template.xsl" />

because it won't work as you expect, unless you really intend to refer to a file in a directory called exslt.org next to your XSLT stylesheet. The href attribute in XSLT takes a URI reference, which may be absolute or relative. A URI reference that starts with a scheme and a colon is absolute; otherwise, it's relative. A relative URI reference is much like a file path. The example ../noarch/config.xsd is also a relative URI reference.

Internationalized Resource Identifiers

In fact, it's a slight oversimplification to say that the href attribute in HTML takes a URI reference. URIs and URI references are taken from a limited set of ASCII characters, and HTML is more internationalized than that. The very next Request For Comment, RFC3987, is Internationalized Resource Identifiers (IRIs)(see Resources). This specification is not as far along in the IETF standards process as its predecessor, but the technology itself is quite mature and widely deployed. IRIs are just like URIs except that they can use the whole range of Unicode characters, not just ASCII. For each IRI, there is a corresponding encoding as a URI, in case an IRI needs to be used in a protocol (such as HTTP) that accepts only URIs.

Overriding the base URI with xml:base

Typically, a URI reference is relative to whatever document you find it in. If you're looking in a document with base URI http://exslt.org/math/min/math.min.template.xsl and you see a URI reference ../../random/random.xml, that reference would expand to http://exslt.org/random/random.xml. In HTML, you can put a base element at the top of the document to override the base URI. The XML Base specification (see Resources) provides the equivalent in XML.

Consider a document that you can access either as file:/my/doc or as http://my.domain/doc. Typically, when you access the document via the file system, you want references like #part2 to expand to file:/my/doc#part2; when you access it via HTTP, you want #part2 to expand to http://my.domain/doc#part2. But in a Resource Description Framework (RDF) schema, the expanded form needs to stay the same for some things to work. XML Base makes this easy. For example:

<rdf:RDF
  xmlns="&owl;"
  xmlns:owl="&owl;"
  xml:base="http://www.w3.org/2002/07/owl"
  xmlns:rdf="&rdf;"
  xmlns:rdfs="&rdfs;"
>

...
    <Class rdf:about="#Nothing"/>

In this case, the #Nothing reference expands to http://www.w3.org/2002/07/owl#Nothing no matter where you find that document.

Okay, so much for URIs, IRIs, and URI references -- what about URLs and URNs?

URNs and URLs

URIs are designed to serve as both names and locators. When they were brought to the IETF for standardization, they became known as Uniform Resource Locators (URLs), and a separate effort on Uniform Resource Names (URNs) was started.

For Internet hosts, names and locations have separate standards. Host names have the same syntax as domain names (for example, zork1.example.edu). These host names are connected to addresses like 192.168.300.21 by the Domain Name System (DNS) protocol. This indirection allows the names to remain stable when hosts are moved around in the network and renumbered.

The occasional broken link in the Web made Web addresses look and feel more like locations than names, and different perspectives emerged in the IETF community:

RFC1737 was followed in 1997 by Proposed Standard RFC2141, URN Syntax, which specified another scheme, urn: , to join http:, ftp:, and the rest.

The eventual URI Standard (RFC3986) clarifies the distinction in section 1.1.3, "URI, URL, and URN":

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location"). The term "Uniform Resource Name" (URN) has been used historically to refer to both URIs under the "urn" scheme [RFC2141], which are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable, and to any other URI with the properties of a name.

An individual scheme does not have to be classified as being just one of "name" or "locator". Instances of URIs from any given scheme may have the characteristics of names or locators or both, often depending on the persistence and care in the assignment of identifiers by the naming authority, rather than on any quality of the scheme. Future specifications and related documentation should use the general term "URI" rather than the more restrictive terms "URL" and "URN" [RFC3305].

Practical persistence

There is a natural tension between persistence and availability. If I have a file on a host that's connected to the Internet, the simplest way to make it available to you is to run a Web server on that host and hand you a URI that consists of whatever name the host happens to have, along with the filename (for example, http://dhcp324.coolISP.net/drafts/freeLunch.wsdl). That works fine until my Dynamic Host Configuration Protocol (DHCP) lease expires, I change ISPs, or I move the file from /drafts/ to /keepers/. And what if the service becomes popular and I decide to charge for it? The more inessential information in a name, the less likely it is to persist across changes.

But a nice persistent name like http://xyzpdq.org/2005/ls434 is not as simple to manage: I have to register a domain, maintain the mapping from the domain name to the host address, and either remember that ls434 is the file where I keep my lunch service description or set up a file mapping table on my Web server.

The PURL project and the DOI system (see Resources) represent different approaches to the persistence problem. A Persistent URL (PURL) is an ordinary HTTP URI in a domain backed by a strong persistence policy. For example, purl.org is run by the Online Computer Library Center (OCLC), a worldwide library cooperative. Anyone can apply for an account and administer his or her own set of PURLs. You publish your content on an ordinary Web server, and then connect it to your PURL with HTTP redirection. The indirection from PURLs to less-persistent HTTP URIs is much like the indirection provided by DNS, except that the source and the destination of the redirection is in the same category. When you have set up a PURL, such as http://purl.org/net/dajobe/ , you can use it like any other HTTP URI. More importantly, the people you want to communicate with can use it just like any other HTTP URI; no plug-ins or add-ons are needed.

The Digital Object Identifier (DOI) system uses its own scheme -- for example, doi:10.123/456. Web browsers can be adapted to support his scheme with a plug-in. The DOI foundation provides policies, registration services, and HTTP redirection services similar to PURL providers like OCLC. While the DOI foundation supports an alias for each DOI of the form http://dx.doi.org/10.123/456, the DOI Handbook (see Resources) states that this has significant disadvantages when compared with the resolver plug-in. Managing two different names for each object seems like a more significant disadvantage to me.

Creative tensions in information management

Despite this tension between persistence and availability, a good URI has both; it works as both a persistent name and an available location. So a URL is really just a URI with practical utility.

Proponents of the urn: scheme argue that this tension is irreconcilable within the framework of HTTP and DNS. I acknowledge that there are areas of concern, but every Webmaster faces the same issues, and the community is learning information management principles to address them. The fundamental issue is that the world changes continuously, and keeping things in sync takes effort.

Most of the time, the hierarchical nature of DNS naming is convenient, but it concentrates a lot of power in one place and raises challenging governance issues. Peer-to-peer designs such as distributed hash tables may eliminate some of the centralization issues with DNS, but who knows what governance issues they will bring with them? Various leading-edge developments show how new protocols can be used to service existing http://... names, adding value to the existing hypermedia network. This seems more likely to succeed than the deployment of new schemes for anything remotely similar to HTTP's GET/PUT/POST/DELETE operations. I expect that present-day best practices in information management and future protocol enhancements will make carefully chosen URIs built on HTTP and DNS last quite a long time.

References

About the Author

photo of Dan

Dan Connolly is the W3C URI Activity Lead, a member of the W3C Technical Architecture Group, and chair of the RDF Data Access Working Group. He joined the W3C staff in 1995. He edited the HTML 2.0 specification in 1995, chaired the Working Group that produced HTML 3.2 and 4.0 and CSS 1.0, led the XML Activity through the release of XML 1.0 in 1998 and XML Schema in 2001, and participated in the development of the Web Ontology Language, which became a W3C Recommendation in 2004. Dan is also a research scientist in the Decentralized Information Group at the MIT Computer Science and Artificial Intelligence Laboratory.