Saturday, May 15, 2004

Visual Basic.NET Html Parser (VB.NET)

There have been numerous projects and tools that I've written in which I really needed the ability to extract a list of certain tags and their properties from an HTML document. For example, extracting all anchor (A) tags from a document and get their HREF property value.

When this became a need in a set of search engine optimization tools that I've written, I wrote a class that handles the parsing of an HTML document and the returning of an ArrayList of HtmlTag class objects. Here's an example of how it works:

Dim parser As New HtmlParser
Dim tags As New ArrayList
Dim tag As HtmlTag

tags = parser.GetTags(html, "A")

For Each tag in tags
MsgBox(tag.Properties("href"))
Next

The HtmlTag class requires some explanation:

Public Class HtmlTag
Public TagName As String = ""
Public Properties As New Hashtable
Public InnerHtml As String = ""
Public InnerText As String = ""
Public InnerTerms As String = ""
Public Sub New(ByVal TagName As String, _
ByVal Properties As Hashtable, _
ByVal InnerHtml As String, _
ByVal InnerText As String)
Me.TagName = TagName
Me.Properties = Properties
Me.InnerHtml = InnerHtml
Me.InnerText = InnerText
End Sub
Public Sub New()
End Sub
End Class

The TagName is set to the tag name (i.e. "A", "FONT", "H1", etc.).

The Properties property is a hashtable of all properties found in that tag (i.e. the "href" and "target" properties of an "A" tag). They're referenced easily with tag.Properties("[propertyname]") as shown in the example above.

The InnerHtml property contains all of the HTML code found between the opening and closing tags. So, for example, if I had a this tag in the HtmlTag object:

<a href="http://www.somesite.com">This is a<BR>link</a>

The InnerHtml property would contain This is a<BR>link

The InnerText property contains all of the text between the opening and closing tags with the HTML code stripped. So, using the example for InnerHtml, the InnerText property would contain This is a link.

The InnerTerms property may not be very useful to most people. It is a list of all of the "words" found in the InnerText property, separated by spaces. I used this when measuring keyword density in my search engine optimization tools.

The class code can be downloaded here. Just rename the file from ".txt" to ".vb" and add it to your project.

Free Chart and Graph Control for ASP.NET

I've used WebChart, a free ASP.NET charting and graphing control, in one of my ASP.NET projects. It works very well, outputs the images in PNG format, and even cleans up the storage directory by itself. Very nice.

It's not really full featured like some of the professional controls, but you can't beat the price tag. ;)

Check it out.

Ten Steps to a Secure ASP.NET Web Application

Here is a list of steps to take to ensure that your data-driven ASP.NET web application is reasonably secure on Windows 2000 (there may be some changes for Windows 2003, as Microsoft has tightened security considerably).

Windows

1. Create a new user account. Do not grant any additional permission beyond putting it into the ‘Users’ group (which is done by default).

NOTE: If SQL Server and IIS are running on different servers, the account you create must be a domain account. If SQL Server and IIS are on the same server, it can be a local account (unless ASP.NET will need to access any network resources).

2. Grant the account that you create read/write access to the folder: C:\Documents and Settings\<machine name>\ASPNET\Local Settings\Temp. This is needed for ASP.NET to be able to do its work.

IIS

3. Change the impersonated user for anonymous access to your Web Application in IIS to the newly created user. The default user is IUSR_<machine name>. You change this by right-clicking on your Web Application in IIS and going to the properties. Under the Directory Security tab, edit the account used for anonymous access.

ASP.NET

4. In your Web Application’s Web.config file, add the following line to the system.web section:

<identity impersonate="true"/>

This will cause ASP.NET to run under the credentials of the anonymous IIS user (which in step 3 was set to the user you created).

SQL Server

5. In SQL Server, grant access to the user you’ve created only for the database(s) used in your Web Application. Give as little access as is necessary. I recommend that all of your SQL Server functionality take place in stored procedures, and that access be granted only to execute those stored procedures. Do not enable any direct table access at all if you can help it.

6. With the above done, remove any credentials passed in the connection strings to SQL Server and replace them with “Trusted_Connection=Yes”.

7. Finally, since ASP.NET is impersonating the user you have created, Windows authentication is all that is necessary for SQL Server, so if SQL Server is running in mixed mode, change it to Windows only mode. This is done in Enterprise Manger. Right-click on the server and go to Properties. The setting is found under the Security tab.

NOTE: Follow this step only if you are sure that all applications accessing SQL Server have made the necessary changes to use Windows authentication! Any application not switched to using trusted connections will be denied access once the change is made.

8. If you must operate SQL Server in a mixed mode environment, make sure the sa password is not blank and is set to an alphanumeric value of at least 8 characters. When in Windows authentication mode, the sa user cannot be used, so this is not an issue.

MS Access

9. If you are connecting to an Access database from your ASP.NET application, you will need to grant the new user Read and/or Write privileges to the MDB file. Only grant Write privileges if the web application will be inserting into the database.

Network Resources

10. If accessing any network resources from your web application, do so only through objects installed in Component Services running under accounts restricted to accessing only the necessary resources. As always, if write access is not needed, don’t grant it. Never grant the ASP.NET impersonated account access to network resources directly. If your COM objects are created in .NET, remember to register them using regsvcs.exe—lazy registration will fail if ASP.NET tries to create the COM object.

A while back I wrote a blog detailing how to securely access network resources from an ASP.NET web application.

Batch Scanning for your applications using Kofax Ascent Capture

In one of my projects at work I've recently introduced a batch scanning module that uses the Kofax Ascent Capture product. The project is a VB6 application.

In the past we had attempted to handle the batch scanning on our own, with, quite frankly, limited success. Kofax is the largest manufacturer of imaging products in the world, and as such their software is incredible. The move to a third party batch scanning solution was most definitely a good idea for us.

Kofax offers two distinct benefits: speed and flexibility.

Speed


Ascent Capture is very fast. We paired the software with Fujitsu 4097D scanners that support Virtual Rescan (VRS). VRS offers many advantages, the best of which is increased speed. The performance is increased by converting a grayscale image to black and white at the hardware level instead of having to do the conversion on the PC. In my tests I found that the 4097D scanned almost twice as fast when using VRS compared to having the VRS disabled. Since Kofax is the creators of VRS, their software knows how to handle it.

Flexibility


Ascent Capture allows you to write VB scripts for almost everything! I wrote a custom OCX that allowed me to call the Scan module from within our application seamlessly, as if it were a part of our own app. That was great.

Even without writing a line of code, though, Ascent Capture is very flexible. You can have the batch information written to any DSN data source, using the fields that you supply. It supports bar code scanning as well. We use it to pull the document number off of the documents we scan, tying them back to a record in our database and updating the image accordingly--no user input necessary.

Very cool stuff. If you're looking to implement batch scanning in any of your projects, I'd definitely look into Kofax and Ascent Capture.

Oh, by the way, did I mention that it's cheap? A 5,000 scans-per-month license is a little under $1,000. A 25,000 scans-per-month license is available, as well as an unlimited license. All very reasonably priced. Our sales-people were jumping for joy when they found out how much money we were going to make off of our new batch scanning module! :)

PHP versus ASP.NET (or) How PHP compares with ASP.NET

This issue has been hashed and rehashed around the web, but I wanted to throw my two cents in regarding the issue. That's because, unlike the authors of the other articles I've read on this subject, I've actually coded two almost identical projects using both PHP with MySQL and ASP.NET with SQL Server, so I feel like I can give a fair and unbiased point of view.

Why did I create the same project using both technologies? Quite simply, money. Some people wanted to buy the project I'd written in ASP.NET, but they needed it in PHP.

All that said, I'll compare the technologies from three vantage points: Availability, Performance and Development Speed.

Availability


PHP wins hands-down in the availability department. That's because most web servers still run some form of UNIX platform (a popular one being Linux), and PHP's birthplace is the UNIX world. ASP.NET, however, is only available for web servers running Microsoft's Internet Information Server (IIS).

So if you're wanting to sell the project you're creating to as many webmasters as possible, and if it's possible to create the project in either ASP.NET -or- PHP, I would suggest using PHP. A search for PHP at Google.com as of today returns 319,000,000 results compared to 5,520,000 for ASP.NET.

Performance


You PHP coders out there will probably scoff at this, but guess what--ASP.NET is faster. ASP.NET's caching and native SQL Server database access can't be beat if the bottom line is performance. But don't take my word for it, check out the benchmarks at eWeek.com. At first glance MySQL seems faster than SQL Server, but don't stop at the first graph! Click 'next' twice and look at the ASP.NET / IIS / SQL Server performance graph. The Microsoft trio was actually able to serve up 33% more page views per second than PHP/MySQL.

I've found this to be the case in my own tests as well. But I will say this: ASP.NET and SQL Server may perform better, but it's no "tortoise and the hair" comparison. PHP with MySQL still performs very well.

Development Speed


In the arena of development speed, it's a toss up. On the one hand, ASP.NET's code-behind makes handling the events surrounding form input very easy and fast. So that's great. ASP.NET also provides a large number of controls that make form validation and database connectivity a breeze.

However, in my opinion, the syntax of the PHP language is more powerful, including it's (oh-my-God-I-love-this) ability to do this:
print "Writing a variable’s value ($val1) to the output stream.";

Instead of this:
Response.Write("Writing a variable's value (" & val1 & ") to the output stream.")

You would be amazed at how much time and coding effort that one little syntax difference saves, not to mention how much more readable it makes the code. PHP is a full featured, mature language now (and an object-oriented one at that if you care to use classes). Also, PHP has a large number of very useful little functions that ASP.NET lacks, and other functions that are just easier to use. For example, the fopen() function has the ability to open a URL and read from the web page as it were a normal file. To do the same thing with ASP.NET takes about 5 or 6 lines of code.

PHP also has the ability to overlook various errors and drudge on despite problems, whereas ASP.NET will just die when a problem occurs and is not properly handled. On the one hand it can be argued that this is a bad thing that allows badly written code, but on the other hand if you're just trying to slam out some code to test an idea, it's much faster. I've written a web crawler for my little mini search engines (such as the weight loss search engine) in PHP, and frankly you just can't plan for everything when parsing and indexing vastly different web page designs. PHP gracefully slips past the problems and continues indexing. I can look at the warnings and modify the crawler accordingly, but I didn't have to loose hours of work because of it.

Is that your final answer?


In conclusion, I say that both PHP and ASP.NET are great development platforms. If you need wide availability, use PHP. If you want absolutely the best performance possible, use ASP.NET/SQL Server/IIS. If you need to get your project done in a hurry, either one is fine.

MySQL Full Text Indexing

I've been toying with the idea of writing my own "mini search engines" on targeted topics for some time now, primarily so that I can put advertisements on them and maybe make a few extra bucks. I initially thought I would use ASP.NET and MS SQL Server, but then I found out something wonderful:

MySQL 4.0 and later supports full text indexing!

And boy, talk about a breeze to implement. Using MySQL Control Center, you simply edit the table and create a new index, selecting the FULL TEXT radio button. Tada! Instant full text indexing. In fact, MySQL offers some significant advantages over SQL Server in the arena of full text indexing.


  1. In MySQL, full text is activated automatically (you have to turn it on for SQL Server, which is a pain if your site is hosted by somebody else).
  2. In MySQL, the full text indexes are updated automatically as records are inserted (in SQL Server you have to rebuild the indexes every time--yuck).


Querying your full text indexes is also simple. Let's say I've created a full text index on a table called "documents", with the full text index on two fields "title" and "body". You can query it like this:

SELECT title, body FROM documents WHERE MATCH(title, body) AGAINST ('keyword phrase')


The results are automatically sorted by relevancy. For an example of one of my mini search engines in action, click here.

FINAL NOTE: By default, MySQL considers any term that shows up in 50% or more of the indexed records as a stop term. What that means is that any term showing up in 50% or more of the records, if used as the search term, will return zero results. For an example, try searching for "diet" at my weight loss search engine. No results, because every page in the index is about weight loss.

Back to the Blog

After a dozen posts this blog stalled, but my interest has been rekindled, and so I've decided to pick it back up. I've actually been getting a few hundred visitors a month to the blog without doing anything, and it's hurting my feelings. If the masses want some information, well darnit, I'll give it to them. :)

I've revamped the look of the blog, too, and I like it much better this way.

I plan on creating some new posts for today so my visitors can have some extra code to chew on as well.

As always,
Jonathan