Innehåll

Exempel med Weka

Weka

Det verktyg jag nu använder för att förstå olika tekniker och leka.
Skrivet i Java och GNU-licensierat.

Betonar här förståelsen av datamängden.

Arff-format

Arff-formatet är ett vanligt flat file-format.
Problem med normalisering.
Det finns möjlighet att koppla till databaser.

Många andra Data Mining-verktyg kräver antingen flat-filer eller
onormaliserade tabeller.

Exemplen

Det finns mändger av standardexempel samlade på UCI.
Många av dessa används i artiklar, benchmarks etc.

Weka-exempeln som följer med är de vanligaste om algoritmer
- weather (golf)
- iris
- labor
- contact lenses
- CPU
- soybean
- zoo

Weather

Skolexemplet i machine learning.
Tanken med exemplet är att man ska utföra någon form av
aktivitet och beroende vissa faktorer

outlook
temperature
humidity
windy

ska systemet avgöra om man ska utföra aktiviteten.

Starta Weka
Ladda in data/weather_nominal.arff

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

1R (One Rule)

Börja alltid med det enkla!
1R gör ett beslutsträd med attribut som ger minsta antalet
felklassifikationer.

Välj Classify, OneR, Use Training Set

Resultat:

  outlook:
	sunny	> no
	overcast	> yes
	rainy	> yes
(10/14 instances correct)

Correctly Classified Instances          10               71.4286 %
Incorrectly Classified Instances         4               28.5714 %

 a b   <- classified as
 7 2 | a = yes
 2 3 | b = no

Dvs denna regel/beslutsträd ger 10 av 14 instanser korrekt, dvs cirka 71 %.
En korsvalidering ger 28.5%.
Korsvalidering ger mycket mer troligt resultat av modellen för okänd data. Låt oss jämföra med en mer avancerad klassifierare:

C4.5 (J48 i Weka)

Välj j48.J48, ändra inga parametrar

Resultat:


outlook = sunny
|   humidity = high: no (3.0)
|   humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
|   windy = TRUE: no (2.0)
|   windy = FALSE: yes (3.0)


Test på hela datamängden:
  Correctly Classified Instances          14              100      %
  Incorrectly Classified Instances         0                0      %

=== Confusion Matrix ===

 a b   <- classified as
 9 0 | a = yes
 0 5 | b = no

100% korrekt!
Korsvalidering ger dock 42.8% korrekt.

Apriori (Associationsregler)

Associate, Apriori (ändra inga parametrar)

De 10 bästa associationsreglerna ger nedanstående.
Notera att de finns både 1 och 2 attribut till höger om implikationspilen.

Talet efter ett attribut är hur många instanser som täcks av det specifika
attribut-värde-paret.
Konfidensen är förhållandet mellan antal instanser som täcker
premissen som också täcker konsekvensen.

Best rules found:

 1. humidity=normal windy=FALSE 4 ==> play=yes 4    conf:(1)
 2. temperature=cool 4 ==> humidity=normal 4    conf:(1)
 3. outlook=overcast 4 ==> play=yes 4    conf:(1)
 4. temperature=cool play=yes 3 ==> humidity=normal 3    conf:(1)
 5. outlook=rainy windy=FALSE 3 ==> play=yes 3    conf:(1)
 6. outlook=rainy play=yes 3 ==> windy=FALSE 3    conf:(1)
 7. outlook=sunny humidity=high 3 ==> play=no 3    conf:(1)
 8. outlook=sunny play=no 3 ==> humidity=high 3    conf:(1)
 9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2    conf:(1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2    conf:(1)

Iris

Ett annat skolexempel på datanalys.
Togs fram av statistikern Fisher i mitten på 1930-talet.

Det är 50 exempel på vardera 3 olika typer (klasser) av plantor:

Iris setosa
Iris versicolor
Irsis virginica

Och fyra attribut:

sepal (foderblad) length
sepal width
petal (kronblad) length
petal width

Allt i centimeter, dvs är numeriska förutom klasserna som är nominella.

1R

Låt oss göra samma sak som för weather, dvs börja enkelt.

Regel:

petallength:
	< 2.45	> Iris-setosa
	< 4.75	> Iris-versicolor
	>= 4.75	> Iris-virginica
(143/150 instances correct)

Correctly Classified Instances         143               95.3333 %

=== Confusion Matrix ===

  a  b  c   <- classified as
 50  0  0 |  a = Iris-setosa
  0 44  6 |  b = Iris-versicolor
  0  1 49 |  c = Iris-virginica

Korsvalidering ger 92.6%

j48

Beslutsträdet

petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
|   petalwidth <= 1.7
|   |   petallength <= 4.9: Iris-versicolor (48.0/1.0)
|   |   petallength > 4.9
|   |   |   petalwidth <= 1.5: Iris-virginica (3.0)
|   |   |   petalwidth > 1.5: Iris-versicolor (3.0/1.0)
|   petalwidth > 1.7: Iris-virginica (46.0/1.0)


Correctly Classified Instances         147               98      %
Incorrectly Classified Instances         3                2      %

=== Confusion Matrix ===

  a  b  c   <- classified as
 50  0  0 |  a = Iris-setosa
  0 49  1 |  b = Iris-versicolor
  0  2 48 |  c = Iris-virginica

Korsvalidering ger 95.3%.
Inte så stor stor skillnad mellan j48 (korsval: 95.3)
och 1R(korsval: 92.6%)!
J48:s träd är rätt mycket mer komplicerad än 1R, vilket
har betydelse för effektivitet, förståelse etc.

Occams rakkniv!

Regler, PRISM

Ladda in contact-lenses.arff

Innehåller alla kombinationer av olika parametrar för att bedöma om
en person ska tillåtas att bära kontaktlinser eller inte.

Kör Classifier, PRISM
Ingen cross-validation.

Regel:

If astigmatism = no
   and tear-prod-rate = normal
   and spectacle-prescrip = hypermetrope then soft
If astigmatism = no
   and tear-prod-rate = normal
   and age = young then soft
If age = pre-presbyopic
   and astigmatism = no
   and tear-prod-rate = normal then soft
If astigmatism = yes
   and tear-prod-rate = normal
   and spectacle-prescrip = myope then hard
If age = young
   and astigmatism = yes
   and tear-prod-rate = normal then hard
If tear-prod-rate = reduced then none
If age = presbyopic
   and tear-prod-rate = normal
   and spectacle-prescrip = myope
   and astigmatism = no then none
If spectacle-prescrip = hypermetrope
   and astigmatism = yes
   and age = pre-presbyopic then none
If age = presbyopic
   and spectacle-prescrip = hypermetrope
   and astigmatism = yes then none

Correctly Classified Instances          24              100      %
Incorrectly Classified Instances         0                0      %

=== Confusion Matrix ===

  a  b  c   <- classified as
  5  0  0 |  a = soft
  0  4  0 |  b = hard
  0  0 15 |  c = none

Apriori

Kan inte hantera numeriska attribut/klasser.
Måste diskretisera datamängden.
Kan göra detta via Preprocess, Filter.
Ett litet smakprov på en diskretiserad fil:


@relation iris-weka.filters.DiscretizeFilter-O-Rfirst-last

@attribute sepallength {'\'(-inf-5.5]\'','\'(5.5-6.1]\'','\'(6.1-inf)\''}
@attribute sepalwidth {'\'(-inf-2.9]\'','\'(2.9-3.3]\'','\'(3.3-inf)\''}
@attribute petallength {'\'(-inf-1.9]\'','\'(1.9-4.7]\'','\'(4.7-inf)\''}
@attribute petalwidth {'\'(-inf-0.6]\'','\'(0.6-1.7]\'','\'(1.7-inf)\''}
@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}

@data

'\'(-inf-5.5]\'','\'(3.3-inf)\'','\'(-inf-1.9]\'','\'(-inf-0.6]\'',Iris-setosa
'\'(-inf-5.5]\'','\'(2.9-3.3]\'','\'(-inf-1.9]\'','\'(-inf-0.6]\'',Iris-setosa
'\'(-inf-5.5]\'','\'(2.9-3.3]\'','\'(-inf-1.9]\'','\'(-inf-0.6]\'',Iris-setosa
'\'(-inf-5.5]\'','\'(2.9-3.3]\'','\'(-inf-1.9]\'','\'(-inf-0.6]\'',Iris-setosa
'\'(-inf-5.5]\'','\'(3.3-inf)\'','\'(-inf-1.9]\'','\'(-inf-0.6]\'',Iris-setosa
'\'(-inf-5.5]\'','\'(3.3-inf)\'','\'(-inf-1.9]\'','\'(-inf-0.6]\'',Iris-setosa
.......

Det är alltså tre "hinkar" för respektive attribut.
Kan tweaka med lite olika parametrar.

Ladda in filen
Kör Apriori (rakt av):

Resultat:

Best rules found:

 1. petallength='(-inf-1.9]' 50 ==> petalwidth='(-inf-0.6]' class=Iris-setosa 50    conf:(1)
 2. petalwidth='(-inf-0.6]' 50 ==> petallength='(-inf-1.9]' class=Iris-setosa 50    conf:(1)
 3. class=Iris-setosa 50 ==> petallength='(-inf-1.9]' petalwidth='(-inf-0.6]' 50    conf:(1)
 4. petallength='(-inf-1.9]' petalwidth='(-inf-0.6]' 50 ==> class=Iris-setosa 50    conf:(1)
 5. petallength='(-inf-1.9]' class=Iris-setosa 50 ==> petalwidth='(-inf-0.6]' 50    conf:(1)
 6. petalwidth='(-inf-0.6]' class=Iris-setosa 50 ==> petallength='(-inf-1.9]' 50    conf:(1)
 7. petalwidth='(-inf-0.6]' 50 ==> class=Iris-setosa 50    conf:(1)
 8. class=Iris-setosa 50 ==> petalwidth='(-inf-0.6]' 50    conf:(1)
 9. petallength='(-inf-1.9]' 50 ==> class=Iris-setosa 50    conf:(1)
10. class=Iris-setosa 50 ==> petallength='(-inf-1.9]' 50    conf:(1)

EM (Kluster)

Klusterhanteringen i Wekas GUI är inte speciellt bra jämfört med
t.ex. statistikprogram såsom R.
Vi gör dock ett försök.

Ladda in den diskretiserade iris-filen
Clusters, EM, Klicka på "Classes to clusters evaluation"
Högerklicka och välj Visa kluster

Här finns mycket information, och det är inte helt att tolka det.
Intressanta är:

  X: Cluster
  Y: Class

Där vågrät axis (och färgerna) representerar de olika klustrerna och
logrät axis är klasserna.
Gör man lite Jitter så ser man antalet instanser tydligare.

Gör lite Jitter

Statistiken för den diskretiserade filen:

Kluster:
0       52 ( 35%)
1       48 ( 32%)
2       50 ( 33%)

Class attribute: class
Classes to Clusters:

  0  1  2  <- assigned to cluster
  0  0 50 | Iris-setosa
  5 45  0 | Iris-versicolor
 47  3  0 | Iris-virginica

Detta är EM'algoritmens slutsats:

Cluster 0 <- Iris-virginica
Cluster 1 <- Iris-versicolor
Cluster 2 <- Iris-setosa

Incorrectly clustered instances :	8.0	  5.3333 %

5.3% är rätt bra.
Jämför detta med 7.3% fel för J48 korsvaliderat.

Nackdelen är att vi inte får en explicit modell.
Men gäller problemet förutsägelser spelar det mindre roll.

Regler med numerisk data (M5')

Ladda in cpu.arff
allting är numeriskt, inklusive klassifikationen
vilket J48 har problem med (om man inte diskretiserar).

Det finns en del olika tekniker

regression ("vanlig statistisk")
kombination av regler och regressioner -> M5

välj metoden M5' som klassificerare

Pruned training model tree:

MMAX <= 14000 : LM1 (141/4.18%)
MMAX >  14000 : LM2 (68/51.8%)

Models at the leaves:

  Smoothed (complex):

    LM1:  class = 4.15
                  - 2.05vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl
                  + 5.43vendor=adviser,sperry,amdahl - 5.78vendor=amdahl
                  + 0.00638MYCT + 0.00158MMIN + 0.00345MMAX + 0.552CACH
                  + 1.14CHMIN + 0.0945CHMAX
    LM2:  class = -113
                  - 56.1vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl
                  + 10.2vendor=adviser,sperry,amdahl - 10.9vendor=amdahl
                  + 0.012MYCT + 0.0145MMIN + 0.0089MMAX + 0.808CACH + 1.29CHMAX


Correlation coefficient                  0.9763

Detta är ett bra samband (+1 är max).

Feature Selection

labor.arff
Detta är ett riktigt exempel (med missing data) där ett antal
experter bedömt olika löneförhandlingar.
'good' är sådana förhandlingar som båda parter godkänt (eller borde godkänt).

Det finns 16 stycken attribut (som jag inte tänker på in på i detalj).

J48 ger följande med korsvalidering:

wage-increase-first-year <= 2.5: bad (15.27/2.27)
wage-increase-first-year > 2.5
|   statutory-holidays <= 10: bad (10.77/4.77)
|   statutory-holidays > 10: good (30.96/1.0)


Correctly Classified Instances          42               73.6842 %
Incorrectly Classified Instances        15               26.3158 %

=== Confusion Matrix ===

  a  b   <- classified as
 12  8 |  a = bad
  7 30 |  b = good

Finns väldigt många attribut tar det mycket lång tid att köra programet.
Då kan det vara bra att göra en "feature selection".

Det finns en mängd (traditionellt statistiska) tekniker för detta:

factor analysis
principal components etc

vilket man naturligtvis kan använda.

Här ska visas en av DM-varianterna:

Gå in i Feature Selection
Attribute Evaluator: InfoGainAttributeEval
(i princip det som Id3 och C45 använder)
Search Method: Ranker
Cross-validation: 10 folds

Resultat:

average merit      average rank  attribute
 0.301 +- 0.019     1   +- 0       2 wage-increase-first-year
 0.19  +- 0.017     2.3 +- 0.46    3 wage-increase-second-year
 0.164 +- 0.025     2.9 +- 0.7    11 statutory-holidays
 0.135 +- 0.012     4.2 +- 0.6    14 contribution-to-dental-plan
 0.117 +- 0.012     5.4 +- 0.8    16 contribution-to-health-plan
 0.112 +- 0.015     5.8 +- 0.98   12 vacation
 0.086 +- 0.016     7.2 +- 0.6    13 longterm-disability-assistance
 0.056 +- 0.012     9.2 +- 0.98    7 pension
 0.061 +- 0.032     9.4 +- 3.1     9 shift-differential
 0.05  +- 0.011     9.5 +- 0.67    5 cost-of-living-adjustment
 0.033 +- 0.009    11.5 +- 1.12   15 bereavement-assistance
 0.032 +- 0.005    11.9 +- 0.54    4 wage-increase-third-year
 0.025 +- 0.012    12.7 +- 1.27   10 education-allowance
 0.019 +- 0.003    13.4 +- 0.66    8 standby-pay
 0.013 +- 0.04     14.7 +- 3.9     6 working-hours
 0     +- 0        14.9 +- 0.3     1 duration

Detta är en sorterad lista hur "bra" attributet är enligt det valda
kriteriet.

Nu minskar vi datamängden till endast 3 attribut (plus klassen) och testar igen.

Gå till Classify
Välj Meta, AttributeSelectedClassifier
Välj classifier: j48
Välj evaluator: InfoGainAttributeEval
Välj search: Ranker, Num to select: 3
Klicka OK

Testa modellen med cross-validation. Trädet:

wage-increase-first-year <= 2.5: bad (15.27/2.27)
wage-increase-first-year > 2.5
|   statutory-holidays <= 10
|   |   wage-increase-first-year <= 4: bad (5.52/0.52)
|   |   wage-increase-first-year > 4: good (5.26/1.0)
|   statutory-holidays > 10: good (30.96/1.0)

Resultatet: 77.2% korrekt.
Jämför med 73.6% vid j48 med alla attributen.
Men det är inte alls säkert att det blir bättre med feature selection.
De stora vinsterna med FS är:

mindre tid och datakraft
får en förståelse för vilka attribut som är viktigast
minskad komplexitet i modellen (snabbare att köra sedan)

Java

Alla teknikerna finns (relativt) väldokumenterade och med källkod (GNUad).
Kan användas direkt i egna applikationer.

Innehåll

created by hakank