Empty URLs

June 22nd, 2010

I have briefly discussed the problem of word noise in a previous post.  This is when a text based classification system is stymied by too much content.  An over-abundance of content – especially content from varying topics – creates an impossibility for classification.  If there is business related content and video game related content and gardening related content and sports related content, how in the world can you accurately classify the site?

Well, empty URLs create the opposite (but equal) problem for URL classification.  How could a URL classifying software program correctly classify a site such as wbj.org or wsj.com as a business site based wholly on the URL.  It can’t.  And the reason is because the URL gives no evidence to its true content.  This is what is called an empty URL.

However, most of your empty URLs will still have other classifiable pages that are indexed.  wbj.org/business-in-washington is an example.  So, even your empty URLs can be classified eventually – once other pages from within the site are classified themselves. 

Still, it does warrant mentioning that empty URLs are a potential limitation of the URL classifying method.