1R
(Bok: sid 78)
Pseudo-code for 1R
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Note: "missing" is always treated as a separate
attribute value
Dealing with numeric attributes
Numeric attributes are discretized: the range of the
attribute is divided into a set of intervals
Instances are sorted according to attribute's values
Breakpoints are placed where the (majority) class
changes (so that the total error is minimized)
Decision trees (Id3/C4.5)
(Bok: sid 89)
Constructing decision trees, Pseudo-algorithm
Normal procedure: top down in recursive divide-and-conquer fashion
First: attribute is selected for root node and branch
is created for each possible attribute value
Then: the instances are split into subsets (one for
each branch extending from the node)
Finally: procedure is repeated recursively for each
branch, using only instances that reach the branch
Process stops if all instances have the same class
A criterion for attribute selection
Which is the best attribute?
The one which will result in the smallest tree
Heuristic: choose the attribute that produces the "purest" nodes
Popular impurity criterion: information gain
Information gain increases with the average purity
of the subsets that an attribute produces
Strategy: choose attribute that results in greatest
information gain
Association Rules (Apriori)
(Bok: 108)
Naïve method for finding association rules:
Using the standard separate-and-conquer method,
treating every possible combination of attribute
values as a separate class
Two problems:
Computational complexity
Resulting number of rules (which would have to be
pruned on the basis of support and confidence)
But: we can look for high support rules directly!
Item sets
Support: number of instances correctly covered by association rule
The same as the number of instances covered by
all tests in the rule (LHS and RHS!)
Item: one test/attribute-value pair
Item set: all items occurring in a rule
Goal: only rules that exceed pre-defined support
We can do it by finding all item sets with the given
minimum support and generating rules from them!
Idea of the algorithm
Finding one-item sets easy!
Idea: use one-item sets to generate two-item sets,
two-item sets to generate three-item sets, ...
If (A B) is frequent item set, then (A) and (B) have
to be frequent item sets as well!
In general: if X is frequent k-item set, then all (k-1)-
item subsets of X are also frequent
Compute k-item set by merging (k-1)-item sets
Exempel:
1. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. outlook=overcast 4 ==> play=yes 4 conf:(1)
4. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
5. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
8. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 conf:(1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 conf:(1)
Latent Semantic Analysis
Kallas även "Latent Semantic Indexing".
Används för textextrahering, sökning i text, basket case analysis,
recommender system etc.
(Är tyvärr patenterad!)
Exempel: Enkel databas över några ord i 17 dokument.
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 B17
algorithms 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
application 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
delay 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
differential 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0
equations 1 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0
implementation 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
integral 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
introduction 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
methods 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0
nonlinear 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
ordinary 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
oscillation 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
partial 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
problem 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0
systems 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0
theory 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1
Vi kan nu göra en enkel SVD för att erhålla "latent connections"
över olika dokument (shoppingkorgar, rekommendationer, etc).
Om vi söker på "applications" och "theory", får vi följande signifikanser för
respektive ord.
Normalt visar man bara de relevanta, dvs de som överstiger en viss gräns.
1 0.9933647 algorithms
2 0.9966932 application
3 0.7899557 delay
4 -0.0428252 differential
5 -0.02678696 equations
6 0.993845 implementation
7 0.9971754 integral
8 0.995632 introduction
9 -0.4715554 methods
10 -0.5130393 nonlinear
11 -0.4715554 ordinary
12 0.7899557 oscillation
13 -0.4359513 partial
14 0.9942236 problem
15 -0.2422558 systems
16 0.9784504 theory
Idéen är att använda den välkända SVD (Singular Value
Decomposition)-tekniken som ger egenvärden och egenvektorer.
Koden nedan är förenklad S-plus/R syntax.
# set the proper words ("applications" and "theory")
# to 1 in an otherwise 0-vector: the query vector
query = rep(0, len)
for (w in words)
{
query[find.row(w,data)] = 1
}
# do the SVD
(U, V, D) = SVD(data)
# Create the query vector to use
query.vec =
transpose(query) %*% U[,1:k] %*% inverse(diag(D[1:k]))
#
# The query vector is then compared to all the
# documents using Cartesian plane closeness
# just accepting cosines over a certain threshold
#
for (i in 1:nrow(U))
{
x<-cos.vec(query.vec, V[i,1:k]);
if (x > threshold)
{
print(i, x, names(DATA)[i], "\n");
}
}
# definition of cos.vec
cos.vec(A,B) = A . B / |A| |B|
= SUM(Ai * Bi) / LENGTH(A) * LENGTH(B)
Innehåll
created by
hakank