Saturday, May 15, 2004

Visual Basic.NET Html Parser (VB.NET)

There have been numerous projects and tools that I've written in which I really needed the ability to extract a list of certain tags and their properties from an HTML document. For example, extracting all anchor (A) tags from a document and get their HREF property value.

When this became a need in a set of search engine optimization tools that I've written, I wrote a class that handles the parsing of an HTML document and the returning of an ArrayList of HtmlTag class objects. Here's an example of how it works:

Dim parser As New HtmlParser
Dim tags As New ArrayList
Dim tag As HtmlTag

tags = parser.GetTags(html, "A")

For Each tag in tags
MsgBox(tag.Properties("href"))
Next

The HtmlTag class requires some explanation:

Public Class HtmlTag
Public TagName As String = ""
Public Properties As New Hashtable
Public InnerHtml As String = ""
Public InnerText As String = ""
Public InnerTerms As String = ""
Public Sub New(ByVal TagName As String, _
ByVal Properties As Hashtable, _
ByVal InnerHtml As String, _
ByVal InnerText As String)
Me.TagName = TagName
Me.Properties = Properties
Me.InnerHtml = InnerHtml
Me.InnerText = InnerText
End Sub
Public Sub New()
End Sub
End Class

The TagName is set to the tag name (i.e. "A", "FONT", "H1", etc.).

The Properties property is a hashtable of all properties found in that tag (i.e. the "href" and "target" properties of an "A" tag). They're referenced easily with tag.Properties("[propertyname]") as shown in the example above.

The InnerHtml property contains all of the HTML code found between the opening and closing tags. So, for example, if I had a this tag in the HtmlTag object:

<a href="http://www.somesite.com">This is a<BR>link</a>

The InnerHtml property would contain This is a<BR>link

The InnerText property contains all of the text between the opening and closing tags with the HTML code stripped. So, using the example for InnerHtml, the InnerText property would contain This is a link.

The InnerTerms property may not be very useful to most people. It is a list of all of the "words" found in the InnerText property, separated by spaces. I used this when measuring keyword density in my search engine optimization tools.

The class code can be downloaded here. Just rename the file from ".txt" to ".vb" and add it to your project.