HTML & XML Parsers
Any web browser must be able to parse HTML, and XML becoming
more and more important. This component would provide a pair of
parsers (or a generic SGML parser that can deal with both
subsets) that output a consistent parse tree. This would be the
foundation for the DOM work (see below).
How to deal with old-style HTML, or HTML that does not conform
to the DTD? Should we key off of the existence of a valid
DOCTYPE and use a strict parser, and fall back to something
based off of the current w3-parse.el code for DOCTYPE-less
documents? Or always use the same heuristics to guess what
the author really meant?
Do we really need two separate parsers for HTML and XML?
PSGML can parse well-formed HTML or any XML document (which is
by definition well-formed, or the parser can gleefully choke
it to death).
Can PSGML be persuaded to do what we want? It seems that
using the existing API (sgml-top-element, sgml-element-next,
sgml-element-content) would be feasible. On the plus side,
this would allow the DOM to work on arbitrary SGML documents
(LinuxDoc or DocBook anyone?).
Should the parsers be able to deal with streaming data? It
would be theoretically possible to parse the document as it
comes in off the network. Do we really care?