Word Noise and Site Classification

June 12th, 2010

In a previous post I discussed the problems that traditional website categorization methods have with sites that lack text or links or anchor text.  Here is the other extreme:  word noise.

Some websites simply have too much content that is all over the spectrum.  There is so much text, so much content, that the categorization of the website is muddled.  This phenomena is known as word noise, too many words.

It is true that some form of text summarization techniques have proven to be useful in these types of situations; however, an equally effective (if not more so) and less expensive method for classification in this situation would be URL categorization.

You may wonder if an incredibly large URL address might be considered a form of word noise when employing this form of classification.  Yet, even your large and ungainly website URL addresses are generally broken up into your root and then either post names, categories, archives, pages, blog titles, and tags.  Rather than creating word noise, these website URL practically categorize themselves due in large part to modern SEO techniques.