|
Lightweigth HTML Scanner 2.00 | ||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||
java.lang.Object | +--be.arci.html.HTMLScanner
Immutable class that encapsulates a complete HTML document to scan it for tags and content. The document is scanned for HTML tags on invoking getTags(). This method can be invoked multiple times on the same HTMLScanner, for instance for retrieving different tag sets.
| Field Summary | |
static java.lang.String |
sCopy
Copyright notice; none of the Lightweight HTML Scanner license types allows you to change this. |
java.lang.String |
sHTMLDoc
The HTML document accessed through this HTMLScanner. |
| Constructor Summary | |
HTMLScanner(java.io.File flHTMLDoc)
Convenience constructor for obtaining an HTMLScanner from a File. |
|
HTMLScanner(java.io.InputStream isHTMLDoc)
Reads in an HTML document from the specified InputStream, for scanning. |
|
HTMLScanner(java.lang.String sHTMLDoc)
Constructs an HTMLScanner for the specified document. |
|
HTMLScanner(java.net.URL urlHTMLDoc)
Convenience constructor for obtaining an HTMLScanner from an URL. |
|
| Method Summary | |
HTMLTag[] |
getTags(java.lang.String[] asTagNames,
boolean swDiscardOtherTags)
Scans a HTML document for the requested tags. |
java.lang.String |
toString()
Returns the scanned document. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
public static final java.lang.String sCopy
public final java.lang.String sHTMLDoc
| Constructor Detail |
public HTMLScanner(java.lang.String sHTMLDoc)
sHTMLDoc - the document with HTML syntax to scan.
public HTMLScanner(java.io.InputStream isHTMLDoc)
throws java.io.IOException
Note: As there is no standard among JVM's for specifying encoding names, the JVM's default encoding is used as fallback if "iso-8859-1" is not known to the JVM.
isHTMLDoc - the InputStream to read the HTML document from.HTMLScanner(String sHTMLDoc)
public HTMLScanner(java.net.URL urlHTMLDoc)
throws java.io.IOException
urlHTMLDoc - the url to read the HTML document from.HTMLScanner(InputStream isHTMLDoc)
public HTMLScanner(java.io.File flHTMLDoc)
throws java.io.IOException
flHTMLDoc - the file to read the HTML document from.HTMLScanner(InputStream isHTMLDoc)| Method Detail |
public java.lang.String toString()
toString in class java.lang.ObjectsHTMLDoc
public HTMLTag[] getTags(java.lang.String[] asTagNames,
boolean swDiscardOtherTags)
throws java.lang.IllegalArgumentException
This method disassembles a HTML document in a succession of HTML tag and content substrings, according to parsing behaviour common to Microsoft Internet Explorer and Netscape Navigator.
The parameter asTagNames is an array of the tag names the application is interested in. The HTMLTag.iID field in each element of the returned HTMLTag[] array is an index into this array of tag names. Be ware however: block closing tags have a negative iID.
The first array element (asTagNames[0]) corresponds to tag ID 0 (zero), which is reserved for content text. This element is not referred to by HTMLScanner, and will not be matched to any tag. But if asTagNames[0] == null, no HTMLTag objects will be constructed for text content, only for 'real' HTML tags. Other elements of asTagNames should not be null.
Invoking this method does not alter the state of a HTMLScanner, so it can be called multiple times on the same object.
Note: Regardless of the tags named in the asTagNames parameter, getTags() internally always scans the HTML document for <SCRIPT>, <PRE>, <LISTING>, <XMP>, <PLAINTEXT>, <TEXTAREA>, <!-- --> and <TITLE> elements, because they influence the handling of consecutive whitespace and the recognition of enclosed tags.
asTagNames - Recognized HTML tags. Recognition is fastest when most-used tag names come first.swDiscardOtherTags - if true, no HTMLTag object is constructed for HTML tags
not in asTagNames; if false, a HTMLTag object with
HTMLTag.iID == asTagNames.length ('unknown tag') is constructed and included in the returned array for
each tag that does not match one of asTagNames.java.lang.IllegalArgumentException - if asTagNames is null or empty.HTMLTag.accumulateContent(StringBuffer sb),
HTMLTag.getAttribute(String sAttribute),
HTMLTag.iID,
Examples
|
Lightweigth HTML Scanner 2.00 | ||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||