The Lightweight HTML Scanner is a set of fast Java classes to scan or parse HTML documents. It provides applets and applications with an easy-to-handle list of the syntax elements of the HTML document. Both HTML tags and content text can be extracted for handling the way you need to.
Benefits
The Lightweight HTML Scanner enables you to scan a HTML document for only the syntax elements you need. The benefits of the Lightweight HTML Scanner approach are:
Compatibility
the Lightweight HTML Scanner closely follows HTML parsing behaviour common to Netscape Navigator and Microsoft Internet Explorer, both based on Mosaic. Even malformed HTML will be handled as it is in these browsers.
Distribution size
The essential classes of the Lightweight HTML Scanner are only 4 kB in size (jarred, production version). The set of API methods is equally small, enabling you to keep your own classes light as well.
Speed
By scanning only for the HTML syntax elements you need, no time is wasted.
The Lightweight HTML Scanner does not build a Document Object Model of some sort, because
Most HTML documents on the web are not well-formed and do not really fit a Document Object Model.
This adds weight to your applets/applications that is not needed for many uses.
Navigating the returned Document Object Model will probably be more complicated for the application programmer than running over a list of HTML tags and content.
There is no established standard Document Object Model.
There exist free classes to do this (e.g. in the standard Java 2 libraries).
The Lightweight HTML Scanner is compiled with Java 2, but has been thoroughly tested with Java 1.1.8.
How much will the Lightweight HTML Scanner cost me?
Most of you will have to pay nothing (Niente! Nada! Nichts! Nullo! Rien de knots! Nil! Nougabollen!), as we have free licenses for developers, private users and evaluators. You find more details in the license options summary and on the download page