Blog Identification

The program Blog Identification tries to identify a blog (or any other webpage) by comparing it with some selected blogs, some of my private favorites and from the articles Världens bästa bloggar ("The world greatest blogs") and Bästa svenska bloggarna ("The best swedish bloggers/blogs") in the swedish magazine Internetworld. (Some of the listed blogs were not included since I had problem accessing them. Sorry.) The only parameter right now is the URL to the webpage. In later versions there may be more bells and whistles, such as different weighting schemes.

The program uses n-gram distance/similarity as the similiarity metric, quite close to the method described in the paper N-Gram Based Text Categorization by William B. Cavnar and John M. Trenkle. For a simple crossvalidation of all the selected blogs, i.e. when all the blogs is compared to the other, see crossvalidation.txt. Last in the file is the "total closeness" of the blogs.

Blog identification was presented and explained in some details in my swedish blog post Bloggidentifiering (Blog Identification). There may be some further comments on the program/subject.



URL to compare: (default hakank.blogg)

Back to my homepage
Created by Hakan Kjellerstrand hakank\@bonetmail.com
Last modified: Tue Aug 21 18:49:45 CEST 2012