Lightweigth HTML Scanner 2.00

be.arci.html
Class HTMLScanner

java.lang.Object
  |
  +--be.arci.html.HTMLScanner

public class HTMLScanner
extends java.lang.Object

Immutable class that encapsulates a complete HTML document to scan it for tags and content. The document is scanned for HTML tags on invoking getTags(). This method can be invoked multiple times on the same HTMLScanner, for instance for retrieving different tag sets.


Field Summary
static java.lang.String sCopy
          Copyright notice; none of the Lightweight HTML Scanner license types allows you to change this.
 java.lang.String sHTMLDoc
          The HTML document accessed through this HTMLScanner.
 
Constructor Summary
HTMLScanner(java.io.File flHTMLDoc)
          Convenience constructor for obtaining an HTMLScanner from a File.
HTMLScanner(java.io.InputStream isHTMLDoc)
          Reads in an HTML document from the specified InputStream, for scanning.
HTMLScanner(java.lang.String sHTMLDoc)
          Constructs an HTMLScanner for the specified document.
HTMLScanner(java.net.URL urlHTMLDoc)
          Convenience constructor for obtaining an HTMLScanner from an URL.
 
Method Summary
 HTMLTag[] getTags(java.lang.String[] asTagNames, boolean swDiscardOtherTags)
          Scans a HTML document for the requested tags.
 java.lang.String toString()
          Returns the scanned document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

sCopy

public static final java.lang.String sCopy
Copyright notice; none of the Lightweight HTML Scanner license types allows you to change this.

sHTMLDoc

public final java.lang.String sHTMLDoc
The HTML document accessed through this HTMLScanner.
Constructor Detail

HTMLScanner

public HTMLScanner(java.lang.String sHTMLDoc)
Constructs an HTMLScanner for the specified document.

Parameters:
sHTMLDoc - the document with HTML syntax to scan.

HTMLScanner

public HTMLScanner(java.io.InputStream isHTMLDoc)
            throws java.io.IOException
Reads in an HTML document from the specified InputStream, for scanning. This constructor reads in the complete document at once, using "iso-8859-1" (also known as "US-ASCII" or "Latin1"), the standard character encoding for HTML documents. The InputStream is read to EOF, but it is not closed by this constructor (that should be done where it is opened).

Note: As there is no standard among JVM's for specifying encoding names, the JVM's default encoding is used as fallback if "iso-8859-1" is not known to the JVM.

Parameters:
isHTMLDoc - the InputStream to read the HTML document from.
See Also:
HTMLScanner(String sHTMLDoc)

HTMLScanner

public HTMLScanner(java.net.URL urlHTMLDoc)
            throws java.io.IOException
Convenience constructor for obtaining an HTMLScanner from an URL. Equivalent to new HTMLScanner(urlHTMLDoc.openStream()), followed by a close() on the InputStream.

Parameters:
urlHTMLDoc - the url to read the HTML document from.
See Also:
HTMLScanner(InputStream isHTMLDoc)

HTMLScanner

public HTMLScanner(java.io.File flHTMLDoc)
            throws java.io.IOException
Convenience constructor for obtaining an HTMLScanner from a File. Equivalent to new HmtlScanner(new FileInputStream(flHTMLDoc)), followed by a close() on the InputStream.

Parameters:
flHTMLDoc - the file to read the HTML document from.
See Also:
HTMLScanner(InputStream isHTMLDoc)
Method Detail

toString

public java.lang.String toString()
Returns the scanned document.
Overrides:
toString in class java.lang.Object
Returns:
sHTMLDoc
See Also:
sHTMLDoc

getTags

public HTMLTag[] getTags(java.lang.String[] asTagNames,
                         boolean swDiscardOtherTags)
                  throws java.lang.IllegalArgumentException
Scans a HTML document for the requested tags.

This method disassembles a HTML document in a succession of HTML tag and content substrings, according to parsing behaviour common to Microsoft Internet Explorer and Netscape Navigator.

The parameter asTagNames is an array of the tag names the application is interested in. The HTMLTag.iID field in each element of the returned HTMLTag[] array is an index into this array of tag names. Be ware however: block closing tags have a negative iID.

The first array element (asTagNames[0]) corresponds to tag ID 0 (zero), which is reserved for content text. This element is not referred to by HTMLScanner, and will not be matched to any tag. But if asTagNames[0] == null, no HTMLTag objects will be constructed for text content, only for 'real' HTML tags. Other elements of asTagNames should not be null.

Invoking this method does not alter the state of a HTMLScanner, so it can be called multiple times on the same object.

Note: Regardless of the tags named in the asTagNames parameter, getTags() internally always scans the HTML document for <SCRIPT>, <PRE>, <LISTING>, <XMP>, <PLAINTEXT>, <TEXTAREA>, <!-- --> and <TITLE> elements, because they influence the handling of consecutive whitespace and the recognition of enclosed tags.

Parameters:
asTagNames - Recognized HTML tags. Recognition is fastest when most-used tag names come first.
swDiscardOtherTags - if true, no HTMLTag object is constructed for HTML tags not in asTagNames; if false, a HTMLTag object with HTMLTag.iID == asTagNames.length ('unknown tag') is constructed and included in the returned array for each tag that does not match one of asTagNames.
Throws:
java.lang.IllegalArgumentException - if asTagNames is null or empty.
See Also:
HTMLTag.accumulateContent(StringBuffer sb), HTMLTag.getAttribute(String sAttribute), HTMLTag.iID, Examples

Lightweigth HTML Scanner 2.00