Innehåll

Lite om algoritmerna

Gå inte in på algoritmerna i detalj, snarare antyda lite olika principer.

1R

(Bok: sid 78)

Pseudo-code for 1R

For each attribute,
  For each value of the attribute, make a rule as follows:
        count how often each class appears
        find the most frequent class
        make the rule assign that class to this attribute-value
    Calculate the error rate of the rules
  Choose the rules with the smallest error rate

Note: "missing" is always treated as a separate
 attribute value

Dealing with numeric attributes

Numeric attributes are discretized: the range of the
   attribute is divided into a set of intervals
   Instances are sorted according to attribute's values
   Breakpoints are placed where the (majority) class
       changes (so that the total error is minimized)

Decision trees (Id3/C4.5)

(Bok: sid 89)

Constructing decision trees, Pseudo-algorithm

Normal procedure: top down in recursive divide-and-conquer fashion
    First: attribute is selected for root node and branch
           is created for each possible attribute value
    Then: the instances are split into subsets (one for
          each branch extending from the node)
    Finally: procedure is repeated recursively for each
             branch, using only instances that reach the branch
    Process stops if all instances have the same class

A criterion for attribute selection

  Which is the best attribute?
     The one which will result in the smallest tree
     Heuristic: choose the attribute that produces the "purest" nodes
  Popular impurity criterion: information gain
     Information gain increases with the average purity
     of the subsets that an attribute produces
  Strategy: choose attribute that results in greatest
    information gain

Rule based/Covering algorithms (PRISM)

(Bok: 103)
Beslutsträd kan konverteras till en mängd regler.
Enkel operation: läs regeln rakt av trädet!
Mer effektive konverteringar är inte triviala.

Pseudo-code for PRISM


For each class C
  Initialize E to the instance set
  While E contains instances in class C
    Create a rule R with an empty left-hand side that predicts class C
    Until R is perfect (or there are no more attributes to use) do
       For each attribute A not mentioned in R, and each value v,
          Consider adding the condition A = v to the left-hand side of R
          Select A and v to maximize the accuracy p/t
           (break ties by choosing the condition with the largest p)
       Add A = v to R
   Remove the instances covered by R from E

t = total instances
p = positive instances

Association Rules (Apriori)

(Bok: 108)
Naïve method for finding association rules:
    Using the standard separate-and-conquer method,
    treating every possible combination of attribute
    values as a separate class
Two problems:
    Computational complexity
    Resulting number of rules (which would have to be
    pruned on the basis of support and confidence)
But: we can look for high support rules directly!

Item sets

  Support: number of instances correctly covered by association rule
        The same as the number of instances covered by
        all tests in the rule (LHS and RHS!)
  Item: one test/attribute-value pair

  Item set: all items occurring in a rule

  Goal: only rules that exceed pre-defined support
        We can do it by finding all item sets with the given
        minimum support and generating rules from them!

Idea of the algorithm

  Finding one-item sets easy!

  Idea: use one-item sets to generate two-item sets,
    two-item sets to generate three-item sets, ...
        If (A B) is frequent item set, then (A) and (B) have
         to be frequent item sets as well!
        In general: if X is frequent k-item set, then all (k-1)-
         item subsets of X are also frequent
        Compute k-item set by merging (k-1)-item sets
Exempel:
 1. humidity=normal windy=FALSE 4 ==> play=yes 4    conf:(1)
 2. temperature=cool 4 ==> humidity=normal 4    conf:(1)
 3. outlook=overcast 4 ==> play=yes 4    conf:(1)
 4. temperature=cool play=yes 3 ==> humidity=normal 3    conf:(1)
 5. outlook=rainy windy=FALSE 3 ==> play=yes 3    conf:(1)
 6. outlook=rainy play=yes 3 ==> windy=FALSE 3    conf:(1)
 7. outlook=sunny humidity=high 3 ==> play=no 3    conf:(1)
 8. outlook=sunny play=no 3 ==> humidity=high 3    conf:(1)
 9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2    conf:(1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2    conf:(1)

Latent Semantic Analysis

Kallas även "Latent Semantic Indexing".
Används för textextrahering, sökning i text, basket case analysis,
recommender system etc.
(Är tyvärr patenterad!)
Exempel: Enkel databas över några ord i 17 dokument.
               B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 B17
algorithms      0  0  1  0  1  0  1  0  0   0   0   0   0   0   0   0   0
application     0  0  1  0  0  0  0  0  0   0   0   0   0   0   0   0   1
delay           0  0  0  0  0  0  0  0  0   0   1   1   0   0   0   0   0
differential    0  0  0  1  0  0  0  1  0   1   1   1   1   1   1   0   0
equations       1  1  0  1  0  0  0  1  0   1   1   1   1   1   1   0   0
implementation  0  0  1  0  0  0  1  0  0   0   0   0   0   0   0   0   0
integral        1  0  0  0  0  0  0  0  0   0   0   0   0   0   0   1   1
introduction    0  0  0  0  1  1  0  0  0   0   0   0   0   0   0   0   0
methods         0  0  0  0  0  0  0  1  0   0   0   0   0   1   0   0   0
nonlinear       0  0  0  0  0  0  0  0  1   0   0   0   1   0   0   0   0
ordinary        0  0  0  0  0  0  0  1  0   1   0   0   0   0   0   0   0
oscillation     0  0  0  0  0  0  0  0  0   0   1   1   0   0   0   0   0
partial         0  0  0  1  0  0  0  0  0   0   0   0   1   0   0   0   0
problem         0  0  0  0  0  1  1  0  0   0   0   0   0   0   0   1   0
systems         0  0  0  0  0  1  0  1  1   0   0   0   0   0   0   0   0
theory          0  0  1  0  0  0  0  0  0   0   1   1   0   0   0   0   1

Vi kan nu göra en enkel SVD för att erhålla "latent connections" över olika dokument (shoppingkorgar, rekommendationer, etc).
Om vi söker på "applications" och "theory", får vi följande signifikanser för
respektive ord.
Normalt visar man bara de relevanta, dvs de som överstiger en viss gräns.
1 0.9933647 algorithms
2 0.9966932 application
3 0.7899557 delay
4 -0.0428252 differential
5 -0.02678696 equations
6 0.993845 implementation
7 0.9971754 integral
8 0.995632 introduction
9 -0.4715554 methods
10 -0.5130393 nonlinear
11 -0.4715554 ordinary
12 0.7899557 oscillation
13 -0.4359513 partial
14 0.9942236 problem
15 -0.2422558 systems
16 0.9784504 theory   

Idéen är att använda den välkända SVD (Singular Value
Decomposition)-tekniken som ger egenvärden och egenvektorer.
Koden nedan är förenklad S-plus/R syntax.
# set the proper words ("applications" and "theory") 
# to 1 in an otherwise 0-vector: the query vector
query = rep(0, len)
for (w in words)
{
  query[find.row(w,data)] = 1
}

# do the SVD
(U, V, D) = SVD(data)

# Create the query vector to use
query.vec = 
   transpose(query) %*% U[,1:k] %*% inverse(diag(D[1:k]))
  
#
# The query vector is then compared to all the 
# documents using Cartesian plane closeness
# just accepting cosines over a certain threshold
# 
for (i in 1:nrow(U))
{
   x<-cos.vec(query.vec, V[i,1:k]);
   if (x > threshold)
   {
      print(i, x, names(DATA)[i], "\n");
   }
}

# definition of cos.vec 
cos.vec(A,B) = A . B / |A| |B|
             = SUM(Ai * Bi) / LENGTH(A) * LENGTH(B)

Innehåll


created by hakank