« Power Laws: Hyper or Revelation? | Main | Mapping Cyberspace »
januari 13, 2004
Senaste SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining) Explorations
SIGKDD (ACM Special Interest Group on Knowledge Discovery and Data Mining) kommer några gånger per år ut med skriften Explorations. Det senaste (Vol 5, Issue 2) är ett temanummer om Microarray Data Mining.
Det finns även andra intressanta papers (samtliga PDF-filer), t.ex.
Tom Fawcett: 
In vivo" Spam Filtering:  A Challenge Problem for KDD
Abstract: 
Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classi cation problem. True in vivo spam  ltering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically di cult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam  ltering as an accessible domain for investigating them.
Tom Fawcett är en gammal favorit. Se t.ex. Bibliography on Fraud Detection, What does music look like? samt publications.
S. Sarawagi, S. Srinivasan, V. G. Vinod Vydiswaran, K. Bhudhia:
Resolving citations in a paper repository
Abstract:
In this paper, we describe our process of creating a citation graph from a given repository of physics publications in LATEX format. The task involved a series of information extraction, data cleaning, matching and ranking steps. This paper describes the challenges we faced along the way and the issues involved in resolving them.
Shawndra Hill, Foster Provost:
The Myth of the Double-Blind Review?  Author Identification Using Only Citations
Abstract: 
Prior studies have questioned the degree of anonymity of the double-blind review process for scholarly research articles. For example, one study based on a survey of reviewers concluded that authors often could be identified by reviewers using combination of the author s reference list and the referee s personal background knowledge. For the KDD Cup 2003 competition s  Open Task,  we examined how well various automatic matching techniques could identify authors within the competition s very large archive of research papers. This paper describes the issues surrounding author identification, how these issues motivated our study, and the results we obtained. The best method, based on discriminative self-citations, identified authors correctly 40-45% of the time. One main motivation for doubleblind review is to eliminate bias in favor of well-known authors. However, identification accuracy for authors with substantial publication history is even better (60% accuracy for the top-10% most prolific authors, 85% for authors with 100 or more prior papers).
Posted by hakank at januari 13, 2004 11:57 FM Posted to Machine learning/data mining