Lightweight HTML Scanner: Known Issues

  • Character escapes are not handled exactly like in MS Internet Explorer/Netscape Navigater
  • There are indeed some differences in character escape handling between both major browsers. For instance, MSIE limits the length of the numerical character escapes to 5 digits, Netscape does not. Netscape does not replace certain undefined numerical character escapes, and substitutes the Unicode character of that number for others. We chose to follow Netscape in most places.

  • Color numbers are not handled exactly like in MS Internet Explorer/Netscape Navigater
  • There is indeed a problem with interpreting ill-formatted color numbers, especially numbers longer than 6 hex digits or shorter than 4 hex digits, or color numbers containing non-hex digits. MSIE and Netscape handle them differently, and we could not puzzle out which rules either of them follows. Especially for the long numbers the omitted digits depend in a strange way on the total number of hex digits. We chose to replace non-hex digits by '0' digits, and then truncate long numbers to the first 6 hex digits.

  • Lightweight HTML Scanner does not allow incremental scanning
  • Indeed, Lightweight HTML Scanner has no mode for scanning partial HTML documents as they come in. This behaviour is intentional: several important HTML tags are handled differently depending on wether the block-closing counterpart is encountered or is not.

    One example is the comment tag opening (<!--), which is not treated as a comment tag if not closed. So if the comment tag closing is not yet in the partial HTML document, the rest of the document is parsed as tags and text content. But once the comment tag closing is read into the partial HTML document, these have to be unparsed, and become part of the comment tag.

    If you really need to parse an HTML document while it is read in, you can construct HTMLScanner objects for the intermediate partial documents, and scan these as they are.