In moving to XML syntax, nine issues were identified in HTML 4.0 that would no longer be acceptable in XML. Since it is best to produce only valid code in your pages, we'll highlight the requirements that aren't immediately taken care of by the processes of validation and the use of external style sheet and script files.
- Element and attribute names must be in lowercase. Of all the decisions that came with XHTML 1.0, this one generated the most groans and teeth-gnashing in the developer community. The problem faced was that XML is a case-sensitive framework. <XML> is not the same as <xml>, <XmL>, <Xml>, or any other combination. Each of those elements would be considered different and unique. To avoid confusion between elements in XHTML, a decision on case became mandatory. Quite a few authors lobbied that using uppercase letters in elements made the HTML tags stand out in source code, therefore making them easier to locate and edit or troubleshoot. Others argued just as vehemently that uppercase letters were more difficult to type quickly when writing HTML by hand.
Narrowly avoiding a coin-toss solution, the W3C chose to go with lowercase for all element and attribute names to match the HTML Document Object Model (DOM). The DOM begins all
names in lowercase, switching to a method known as camel case when combining words. For instance, DOM attributes pertaining to input fields include tabIndex, where the second word is demarked with an initial capital letter. This approach results in an up-and-down style of case management, hence the label camel case.
- For nonempty elements, end tags are required. First, we need to define a nonempty element. Nonempty means that something, either text data or other elements, is contained between the opening and closing tags of a given element. For instance, a table is a nonempty element because it contains row and data elements. An empty element is one in HTML that never had an end tag, such as images (<img>), line breaks (<br>), and horizontal rules (<hr>).
Quite a few of us learned to write HTML without ever closing some
nonempty elements, such as paragraphs with the <p> tag. The
paragraph element and many others were explicitly defined with optional
closing tags in the HTML Recommendation. SGML, the parent of HTML,
allowed the optional state. However, XML has the rule requiring correct
form and does not permit authors to omit end tags for any nonempty
element. From here on the closing </p> is required,
as are all the other container tag closing elements.
- Attribute values must always be quoted. This rule is self-explanatory
and actually eases the job of document authors by removing the question
of whether or not an attribute had to be quoted. In HTML, single-word
or numerical attributes didn't require quotes. However, lists of
words, such as a meta element's keyword value or an href's URI value,
did need them. Now authors simply quote every attribute="value" to
get in line with this new requirement.
- Attribute minimization. An attribute is minimized when only the attribute name is written, omitting a value. Some HTML references call these Boolean attributes since they have an off/on behavior (being on when the attribute is present and off when it is not). For example, the checked attribute on an input element is minimized in HTML 4.0:
<input type="checkbox" name="mybox" value="myvalue" checked>
XML requires that every attribute have a value, so this minimized treatment
is no longer allowed. To correct for this, any attribute that was minimized
in HTML 4.0 is written in XHTML 1.0 with a value that mimics the attribute
name. Taking our input example, an author would now write: <input type="checkbox" name="mybox" value="myvalue" checked="checked" />
- Empty elements. Previously, we discussed the fact that nonempty elements were no longer able to have omitted ending tags. Empty elements now have a required end tag as well. Where we used to write <br>, the full syntax becomes <br></br>. Recognizing the redundancy in an otherwise empty element, XML
devised a shorthand syntax, combining the opening and closing tag into a single structure. The element opens as normal, with the left-angle bracket and the element name (for instance <br), but then changes to include the forward slash before the final right-angle bracket (<br/>).
- ID and name attributes. Anyone
will be familiar with the name attribute, as it has been used on
the <img> and <input> tags. The name attribute provides
an identifier that is accessible by scripting languages and other
programmatic processes to manipulate the value or content of those
ID attribute behaves in the same way but goes a step further in
requiring the value of the attribute to be unique within the document.
This uniqueness is required for manipulation in XML with the Document
Object Model (DOM). Where
the name attribute is used, an id attribute must
also appear in order to maintain the compliance required
of XHTML as an XML vocabulary.
Backward compatibility with HTML-based browsers
On its own, XHTML is more than 95 percent backward compatible with browsers designed to handled HTML 4.0. Only a few steps need to be taken in order to assure the seamless presentation of XHTML documents in these programs.
- Empty elements need an extra space. Browsers have been programmed to ignore tags that they don't understand. Furthermore, an opening or empty HTML tag has never included the slash character (/). Therefore, when a browser encounters <hr/> , it doesn't read this as an empty element using the XML shorthand for <hr></hr>. Instead, the browser thinks it's a new element <hr/> and won't display the horizontal rule. An easy workaround is available by inserting a space between the end of the element name and the slash, for example <hr />. The browser will correctly read the hr as the element name, ignore the slash as something it doesn't understand, and render the rule as desired.
- Do not minimize nonempty "container" elements. A
paragraph without any content is acceptable but it does nothing.
However, it is not an empty element (remember that empty elements
paragraph then, would be written as <p></p> and not <p
- Ampersands in attribute values. One of the most common gotchas of validating Web pages with links is the use of ampersands in URIs found in href attributes. However, an ampersand is a perfectly legal character in a URI when encountered by the browser, and the ampersand is also interpreted as the beginning of a character entity (such as the ampersand found in for a nonbreaking space). In order to avoid having the ampersands interpreted as the beginning of a character entity, they must be escaped, or written in their own character entity form: &. By replacing each occurrence of an ampersand in a URI with &, the correct end result will be served to the user by the browser, yet also not interpreted as a character entity.