New Markov words II - (experimental)
This is just an experiment of Markov generation of words.
Given some (random maybe) choosen words from a dictionary or list, this
program generates new words (a "new word" is a word not in the dictionary).
It has a lot of "word distance" parameters which can be used.
Please note, this program is highly experimental with a lot of (undocumented)
features and parameters. There is two different "modes" connected to the
use of "source word": If you type in a source word, the system tries to
generate words that are as close this word as possible. This could be used for
generating funny words ("puns"), rhyme words etc. If there is no source word
the system will just generate words, and will not use any distance function at
all. Note also that some of the examples are slightly biased to the Swedish
language.
The parameter n in n-gram, default is 3 (must be a number less than or
equals 6) . This parameter sets how many letters there will be in each n-gram.
The program generates a list of words, the exact number depends on all the
parameters, but at least about 10 words will be generated.
The larger n-value is, the lesser chance there is to generate a word not
in the dictionary, i.e. a new word. (The generated words which was in the
dictionary is shown as a HTML comment in the result page, i.e. see the HTML
source of that page.)
Feel free to mail me (hakank@gmail.com) if you have questions, comments, recommendations, or maybe an
interesting collection of related words.
Some comments about metrics etc
Markov chains, n-grams
For an overall explanation of Markov chains, see
google search.
Edit distance, Levensthein
For an overall explanation of Edit distance see
google search.
LCS (Longest Common Subsequence)
The LCS metric is the length of the longest common subsequence (maybe with
other words inbetween) of two strings, divided by the length of the longer
string. This gives a score between 0 (no match) and 1 (perfect match, same
word).
For LCS and Dice scores (below), see e.g.
Linguistics isn't always the answer: Word comparison in
computational linguistics by Lars Borin, for more context and info.
Dice scores
Quote from Borin (op. cit.):
"The Dice score for an n-gram comparison of two strings is calculated as:
C/(A+B) where C is the number of unique n-grams common to the two strings,
and A and B the total number of unique n-grams in each string."
Common letters metric
This is the metric for the Common letters weight.
Let L, C, V be the common letters, consonants and vowels respectively
of the two words. The metric (distance) is then the sum of L + C + V
weighted by the length of the words. 1 is best score (same word), 0 is worst.
(Maybe the components should be split up?).
Back to my other useless programs
Back to my homepage
Created by Hakan Kjellerstrand hakank@gmail.com