XHTML Content Negotiation and Conversion PHP Script

Status: Version 1.2; Publicly Released

Ever get pissed because you want to use XHTML to build a site, but Internet Explorer doesn’t support XHTML’s MIME Type? You could send it as HTML’s MIME Type, but then you don’t get any of the XHTML features, and many pages will break, e.g. things involving empty elements. Well, now you can get the best of both worlds! I’ve written a PHP script that you can add to any XHTML page, which will check the accept-type header, and automatically choose the correct MIME Type to send it as. But, even better, if it sends the code as HTML, it will automatically convert the whole document to HTML! This isn’t one of those crappy remove-slashes-and-hope-for-the-best hacks that you’ve seen floating around the Net. This uses 14 preg_replaces, and one str_replace, to solve the most common problems in converting XHTML to HTML. These are:

  • xml PI.
  • xml-stylesheet PI (alternate=”yes”).
  • xml-stylesheet PI (alternate=”no” or not specified).
  • Automatically replaces DOCTYPE.
  • xmlns attribute on <html> tag.
  • CDATA inside style or script.
  • CDATA outside of style or script.
  • Empty elements using trailing slash, e.g. <br/>.
  • Non-empty elements using trailing slash, e.g. <p/>.
  • Empty elements not using trailing slash, e.g. <br> </br>.
  • Non-minimized boolean attributes, e.g. checked=”checked”.
  • The xml:lang attribute.
  • The &apos; entity reference.

Why is this necessary?
XHTML has been the preferred general-purpose markup language from the W3C (succeeding HTML) since 2000. It has various features that make authoring in XHTML preferable to authoring in HTML for many applications. However, some browsers (most notably Microsoft Internet Explorer) do not support XHTML. As a result, many authors write XHTML code, but send it using the HTML MIME type. This is incorrect (see Sending XHTML as text/html Considered Harmful for details on why this is a bad idea). A much better solution would be to author your content in XHTML, but send it as either XHTML or HTML (using the appropriate MIME type) depending on whether the browser supports XHTML. Some authors always send XHTML source code, but with a different MIME type depending on the browser’s support for XHTML. This is also a bad idea, because it requires that the XHTML source code be HTML-compliant as well — which is usually not the case. For example, if your XHTML source code includes an empty element that closes itself using the <br/> shorthand, when you send it as an HTML MIME type, a compliant browser would show >s all over the place. Because of this problem, it becomes necessary to have the server convert the XHTML source code to HTML when sending a web page as an HTML MIME type. Hence this script.

Wishlist:

  • Auto-detect encoding based on XML declaration; add to HTTP header (probably not for a while)
  • Support for XML Namespaces; it currently only recognizes XHTML elements in the default namespace (not happening anytime soon)
  • Add <meta/> specifying encoding when XML declaration specifies it (probably wouldn’t be too hard)
  • Add stylesheet <link/>s at end of <head> element rather than beginning (also probably wouldn’t be too hard) (Implemented in 1.2)
  • Preserve order of xml-stylesheet PI’s with mixed alternate values (moderate effort involved, and not very necessary IMHO)

Usage:
Just use a PHP include to include xhtml_mime.php into the beginning of your existing .php document. Do not remove the XML declaration, DOCTYPE, <html> tag, or any xml-stylesheet Processing Instructions. If your page was not originally a .php file, remember that the XML declaration and any xml-stylesheet Processing Instructions will have to be echoed using PHP code, otherwise they will be parsed as PHP. Note that you can also force XHTML or HTML output (ignoring the content negotiation results) by adding “?xml=yes” or “?xml=no” to the URL of your page. So if your page is http://www.example.com/index.php, you can force XHTML output by using http://www.example.com/index.php?xml=yes. That’s all!

Notes:
This software has not undergone extensive testing. The only real testing that has been done on it includes a set of nonexhastive test cases created by me, and use on a production WordPress site. No bugs are known, but due to the lack of testing, please understand that there may be various unknown bugs. Feel free to use it in a production site, but only if you are willing to make sure that it’s working properly with your XHTML code. If it screws up, I’m not responsible (but I’d love to know why it failed, if you can send me a test case). Everyone is encouraged to test it as much as possible, so that I can identify and deal with any remaining bugs.

More notes:
This script may fail to convert the markup in some cases, while correctly setting the content type. This occurs in some cases where ob_start is already used by a PHP script that this script is added to, and the use of ob_start prevents the ob_start in this script from ever being used. I observed this problem in PunBB. To fix the PunBB problem, I simply added a line of code in a PunBB file that told it to send its output through xml2html before outputting it. This may differ between different cases.

Change Log:

1.2
<link> elements now are added at end of <head>.
1.1
DOCTYPE and PI’s are now processed by xml2html; this means that the original document can have them, as opposed to them being inserted by the script.
First public release.
1.0
Works acceptably in my test cases and an installation of WordPress.
First private release.

Download version 1.2!
Download version 1.1!