2.1.5 The html Element

html :lang=(en) => TEXT

The element contains an HTML document as text (or, in practice, as CDATA). In some cases, the document starts with <html> and ends with </html>; in others the html element is implied. Generally the HTML includes a head element with a CSS stylesheet. The HTML body often begins with <BR>.

The HTML document uses only the following elements:

html

Sometimes, the document is enclosed with <html></html>.

br

The HTML body often begins with <BR> and may contain it as well.

b
i
u

Styling.

font

The attributes face, color, and size are observed. The value of color takes one of the forms #rrggbb or rgb (r, g, b). The value of size is a number between 1 and 7, inclusive.

The CSS in the corpus is simple. To understand it, a parser only needs to be able to skip white space, <!--, and -->, and parse style only for p elements. Only the following properties matter:

color

In the form rrggbb, e.g. 000000, with no leading ‘#’.

font-weight

Either bold or normal.

font-style

Either italic or normal.

text-decoration

Either underline or normal.

font-family

A font name, commonly Monospaced or SansSerif.

font-size

Values claim to be in points, e.g. 14pt, but the values are actually in “device-independent pixels” (px), at 96/inch.

This element has the following attributes.

Attribute: lang

This always contains en in the corpus.