Lightweigth HTML Scanner 2.00

Lightweight HTML Scanner Overview
The Lightweight HTML Scanner performs fast and flexible scanning of HTML documents.

See:
          Description

Packages
be.arci.html Core package for the Lightweight HTML Scanner.
be.arci.pub In the be.arci.pub package we supply add-on classes and examples for our Java libraries, together with their source code.

 

Lightweight HTML Scanner Overview


The Lightweight HTML Scanner performs fast and flexible scanning of HTML documents. It mimicks the parsing behaviour common to Netscape Navigator and Microsoft Internet Explorer, but scans only for tags that the application program is interested in.

Overview Contents

  1. Introduction
  2. Examples
  3. Packaging Lightweight HTML Scanner

Related Documents

  1. Introduction
  2. The Lightweight HTML Scanner is based on two classes. These are be.arci.html.HTMLScanner, a fast HTML scanner and parser, and the be.arci.html.HTMLTag objects, that represent HTML syntax elements, and that are retrieved throught the HTMLScanner. A HTMLScanner can be constructed from a String containing a HTML document, or from a File, an URL, or in general any InputStream that gives access to such a document.

    The HTMLScanner scans the document for HTML tags when it's method getTags() is invoked. As parameter to this method the application programmer supplies an array of the tag names he is interested in, or possibly the complete set of possible HTML tag names. A single HTMLScanner object can scan the same HTML document repeatedly for different sets of tag names, in successive calls to getTags().

    HTMLScanner.getTags() returns an array of HTMLTag objects. You can regard these HTMLTags as substrings of the HTML document, with an index (ID) into the array of tag names to identify the type of tag (e.g. <IMG> tag or <BODY> tag). Some of these HTMLTag objects will represent HTML text content; they have an ID of 0 (zero). Closing tags (e.g. </BODY> tag) are given the negative of the ID of the opening tag.

    The HTMLTag class has 2 methods of interest:

    accumulateContent()
    which operates on HTMLTags that represent HTML text content. This method appends the HTMLTag's content to a StringBuffer, following the HTML rules for replacing character escapes and combining consecutive whitespace.
    getAttribute()
    which operates on 'real' HTML tags. This methods returns the value for a named attribute of the HTML tag.

  3. Examples
    1. Retrieving all files referred to from a HTML document
    2. Retrieving the text content of a HTML document
    3. List all tags of a HTML document
    4. List formatted contents of a HTML document

  4. Packaging Lightweight HTML Scanner
  5. You need to distribute only 2 classes with your applet or application: be/arci/html/HTMLScanner.class and be/arci/html/HTMLTag.class, totalling only 4kB jarred and 7kB unjarred (production version). The class be/arci/html/HFile.class is not referred to at runtime; however it might be needed by some Java Virtual Machines (e.g. Oracle's JVM) that do extensive class file verification. The class be.arci.html.HTMLColors is a utility class for your convenience only, it is not referred by the 2 core classes. If your application uses it, you of course have to package it.


    Lightweigth HTML Scanner 2.00