When this became a need in a set of search engine optimization tools that I've written, I wrote a class that handles the parsing of an HTML document and the returning of an ArrayList of HtmlTag class objects. Here's an example of how it works:
Dim parser As New HtmlParser
Dim tags As New ArrayList
Dim tag As HtmlTag
tags = parser.GetTags(html, "A")
For Each tag in tags
The HtmlTag class requires some explanation:
Public Class HtmlTag
Public TagName As String = ""
Public Properties As New Hashtable
Public InnerHtml As String = ""
Public InnerText As String = ""
Public InnerTerms As String = ""
Public Sub New(ByVal TagName As String, _
ByVal Properties As Hashtable, _
ByVal InnerHtml As String, _
ByVal InnerText As String)
Me.TagName = TagName
Me.Properties = Properties
Me.InnerHtml = InnerHtml
Me.InnerText = InnerText
Public Sub New()
The TagName is set to the tag name (i.e. "A", "FONT", "H1", etc.).
The Properties property is a hashtable of all properties found in that tag (i.e. the "href" and "target" properties of an "A" tag). They're referenced easily with tag.Properties("[propertyname]") as shown in the example above.
The InnerHtml property contains all of the HTML code found between the opening and closing tags. So, for example, if I had a this tag in the HtmlTag object:
<a href="http://www.somesite.com">This is a<BR>link</a>
The InnerHtml property would contain This is a<BR>link
The InnerText property contains all of the text between the opening and closing tags with the HTML code stripped. So, using the example for InnerHtml, the InnerText property would contain This is a link.
The InnerTerms property may not be very useful to most people. It is a list of all of the "words" found in the InnerText property, separated by spaces. I used this when measuring keyword density in my search engine optimization tools.
The class code can be downloaded here. Just rename the file from ".txt" to ".vb" and add it to your project.