Badge problem

Author: Hakan Kjellerstrand, hakank@bonetmail.com)

One of the best collections of machine learning data sets (ics.uci.edu/pub/machine-learning-databases/) includes this "recreational" data set called "Badge Problem". I haven't seen any solutions to the problem anywhere, so here is my take.
(The full URL to the badge files is ics.uci.edu/pub/machine-learning-databases/badges.)

In this problem I used the open source machine learning/data mining tool Weka, (written in Java). Go to http://www.cs.waikato.ac.nz/~ml/weka/ to read more. Also read the very inspiring book "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations" (written by Ian H. Witten and Eibe Frank) which describes machine learning and the Weka tool: http://www.cs.waikato.ac.nz/~ml/weka/book.html.

Presentation of the problem

There was not much information about the problem (which makes it fun!). From the informations file badges.info:

1. Title: ML94/COLT94 Badge Problem

2. Source Information
   -- Creator: Haym Hirsh, after an idea by Rob Schapire
   -- Donor: Haym Hirsh (hirsh@cs.rutgers.edu)
   -- Date: September, 1994
 
3. Past Usage:
    Every pre-registered attendee at the 1994 Machine Learning
    Conference and 1994 Computational Learning Theory Conference received
    a badge labeled with a "+" or "-".  The labeling was due to some
    function known only to the badge generator (Haym Hirsh), and it
    depended only on the attendee's name.  The goal for conference
    attendees was to identify the unknown function used to generate the
    +/- labeling.

4. Relevant Information:
    Part of the problem in using an automated program to discover the
    unknown target function is to decide how to encode names such that
    the program can be used.  The data below are presented in the form
    of a +/- label followed by the person's name.  It is up to the
    learning-system user to decide how to convert this data into something
    usable by the system (e.g., what attributes to use if your favorite
    learner requires feature-vector data).

The problem data:

as plain text: badges.data
as ARFF file badges_plain.arff (the file format often used in Weka)

Now, please take some minutes and try it yourself!

My solution

The first thing to realize is that just the data in the original file will not get us any long way in our machine learning project. It must be completed with other attributes such as different lengths, counts etc. But what lengths or counts? Of course, we could sit down for a time and try to solve it in our head (it is very well possible), but my personal object was to use machine learning tools.

Note: I assumed that the solution was to be found by treating the names as string, i.e. it could not be a very far fetched solution such as information not given in the data set, such as the person's length, age or time they talked at the conference. (Later note: The description above states this clearly: it depended only on the attendee's name.)

It took some minutes to think of and generates a couple of possible solutions. As you may notice by the number of attributes I tried, I didn't get the solution directly.

The attributes I tried was:

Attribute name, and type	Explanation
name {...}	all the names (given in the original)
length numeric	length of name
even_odd {0,1}	length of name even or odd?
first_char_vowel {0,1}	is first character a vowel?
second_char_vowel {0,1}	is second character a vowel?
vowels numeric	number of vowels in the name
consonants numeric	number of consonants
vowel_consonant_ratio numeric	the ratio of vowels / consonant (this is not farfetched???)
spaces numeric	number of spaces
dots numeric	number of "." in the name, i.e. name initials
words numeric	number of words, i.e number of names, including name initials..
class {+,-}	the badge labels (given in the original)

I used this Perl program to generate the ARFF data file. Perl is very handy for this kind of stuff!


#!/usr/bin/perl -w
# 
# Sun Oct 14 08:55:57 2001/hakank@bonetmail.com
# 
# Generate attributes for the badges problem
# 
$|=1;
use strict;

my $file = "badges.data";

open FILE, $file or die "Cannot open $file: $!";

my %data = ();
while (<FILE>) {
  next if /^\s*$/;
  chomp;
  $data{$2} = $1 if m/([+-])\s+(.+)$/;
}

my @names = join ",", map {"\"$_\""} sort keys %data;

print<<EOF;
\@relation badges
\@attribute name {@names}
\@attribute length numeric
\@attribute even_odd {0,1}
\@attribute first_char_vowel {0,1}
\@attribute second_char_vowel {0,1}
\@attribute vowels numeric
\@attribute consonants numeric
\@attribute vowel_consonant_ratio numeric
\@attribute spaces numeric
\@attribute dots numeric
\@attribute words numeric
\@attribute class {-,+}
\@data
EOF

foreach my $name (sort keys %data) {
  my @name              = split //, $name;

  my $length            = length $name;
  my $even_odd_len      = ($length % 2 == 0) ? 1 : 0;
  my ($vowels, $consonants, $spaces, $ratio, $dots)
                        = vowels_consonants_spaces($name);
  my $first_char_vowel  = (is_vowel($name[0]) == 1) ? 1 : 0;
  my $second_char_vowel = (is_vowel($name[1]) == 1) ? 1 : 0;
  my @words             = split /\s+/, $name;
  my $words             = scalar @words;

  print "\"$name\", $length, $even_odd_len, $first_char_vowel, $second_char_vowel, $vowels, $consonants, $ratio, $spaces, $dots, $words, $data{$name}\n";
}


sub is_vowel {
  my ($char) = @_;

  return 1 if lc $char =~ /^[aeiou]$/i;

  return 0;
}


sub vowels_consonants_spaces {
  my ($name)     = @_;

  my $vowels     = 0;
  my $consonants = 0;
  my $spaces     = 0;
  my $dots       = 0;

  for (split //, $name) {
    $_ = lc $_;
    if (/[aeiou]/i)     {
      $vowels++;
    } elsif ($_ eq " ") {
      $spaces++;
    } elsif ($_ eq ".") {
      $dots++;
    } else              {
      $consonants++
    }

  }

  my $ratio = sprintf "%0.2f", $vowels/$consonants;
  return ($vowels, $consonants, $spaces, $ratio, $dots);

}

For all the different trials I made (i.e. versions of the Perl program), I then tested the generated ARFF file (mainly) with Weka's J48 decision tree algorithm, to see if there was some sort of indication of patterns or regularities.
(The reason I used J48 was that it's one of my favorite in Weka since it's very fast and it give a good overview of the solution, as a decision tree. Luckily the "badge function" was of the right kind.)

And then the following Weka command [I used the command line commands of Weka; there is also a very nice GUI]:

java weka.classifiers.trees.j48.J48 -t badges2.arff

produced the this (somewhat stripped) result:

J48 pruned tree
---------------

second_char_vowel = 0: - (84.0)
second_char_vowel = 1: + (210.0)

....
=== Error on training data ===

Correctly Classified Instances         294              100      %
.....

=== Confusion Matrix ===

   a   b   <-- classified as
  84   0 |   a = -
   0 210 |   b = +
....

Ah, so the labels "+" and "-" corresponds exactly with the occurence of a vowel in the second position in the name!

Here is the final generated ARFF file badges2.arff.

Final note

This was a fun little problem! And I actually learn a thing or two by twiddling with it.

How long did the complete task took? I didn't clock the time exactly, but it probably took about the same time as it did for me to write (and re-edit) this HTML report, somewhere between 30-45 minutes.

I did get astray in my pursuit for a while since I temporarily forgot that in English the letter "Y" is not considered a vowel (which it is in Swedish, my native language). And once I remember that, it was then an easy way forward.

Back to Data Mining, Machine Learning etc
Back to the Weka page
Back to my homepage
Created by Hakan Kjellerstrand hakank@bonetmail.com