New Markov words II - (experimental)

This is just an experiment of Markov generation of words. Given some (random maybe) choosen words from a dictionary or list, this program generates new words (a "new word" is a word not in the dictionary). It has a lot of "word distance" parameters which can be used.

Please note, this program is highly experimental with a lot of (undocumented) features and parameters. There is two different "modes" connected to the use of "source word": If you type in a source word, the system tries to generate words that are as close this word as possible. This could be used for generating funny words ("puns"), rhyme words etc. If there is no source word the system will just generate words, and will not use any distance function at all. Note also that some of the examples are slightly biased to the swedish language.

The parameter n in n-gram, default is 3 (must be a number less than or equals 6) . This parameter sets how many letters there will be in each n-gram. The program generates a list of words, the exact number depends on all the parameters, but at least about 10 words will be generated. The larger n-value is, the lesser chance there is to generate a word not in the dictionary, i.e. a new word. (The generated words which was in the dictionary is shown as a HTML comment in the result page, i.e. see the HTML source of that page.)

Feel free to mail me (hakank@bonetmail.com) if you have questions, comments, recommendations, or maybe an interesting collection of related words.

Parameters

Source word: (default "")    (if no source word: just generate words, no distance function is used)
Repeat source word times (max 1000, default 20)
Support word: (default "") (highly experimental)
Support word weight: (max 5, default 1 if any support word, else 0) (also highly experimental)

Select language/dictionary:
(Instead of using a wordlist, you can enter words in the list below)
Swedish words longer than 20 chars English words longer than 13 chars Swedish 2000 (fixed) random words

Swedish work titles Programming languages Philosophers Authors My friends Countries (english names)

Swedish 'Första Moseboken' (Genesis, of Bible fame)
Just a test.... Sindarin (Tolkien)
Instead of a word list you can use your own word list. Type in (or copy) the words you want to use. Use space or new line as word separator.
(Se e.g.
http://www.ruf.rice.edu/~pound/#datasets for some small word lists.)



Select metric:
Edit distance (+ all weights) Reverse edit distance Ngram similarity Just prefix/suffix metric Just length metric Just LCS metric Just Dice score metric Just Common letters No distance (just random)

Set n: (max 6, default 3)
Num ngrams to use/generate: (max 2000, default 1000)
Prefix weight: (max 5, default 0)
Suffix weight: (max 5, default 0)
LCS weight: (max 5, default 0) (Longest Common Subsequence)
Dice score weight: (max 5, default 0)
Common letters weight: (max 5, default 0)
Length weight: (max 5, default 0)
n-gram metric n: (min1, max 10, default 3)

Number of sample words from dictionary: (max 2000, default 1000)
Minimum words forced to generate: (max 6, default 1) (Mostly it will be many more words generated.)
Sample method: Random sample from dictionary (recommended) Use word melded words (just for swedish and english dictionaries)
Show only new words: Yes No
Add subsets of source word to word list: (i.e. give some "hints"): Yes No
Show distances: Yes No
Show info in HTML comment: Yes No
Generate word list link Yes No    repeat best words (max 30, default 0)
Prune long words: (size, ratio of source word) (min 2, max 100, default 5. 100 is approx. to no pruning at all)



Some comments about metrics etc

Markov chains, n-grams

For an overall explanation of Markov chains, see google search.

Edit distance, Levensthein

For an overall explanation of Edit distance see google search.

LCS (Longest Common Subsequence)

The LCS metric is the length of the longest common subsequence (maybe with other words inbetween) of two strings, divided by the length of the longer string. This gives a score between 0 (no match) and 1 (perfect match, same word). For LCS and Dice scores (below), see e.g. Linguistics isn't always the answer: Word comparison in computational linguistics by Lars Borin, for more context and info.

Dice scores

Quote from Borin (op. cit.): "The Dice score for an n-gram comparison of two strings is calculated as: C/(A+B) where C is the number of unique n-grams common to the two strings, and A and B the total number of unique n-grams in each string."

Common letters metric

This is the metric for the Common letters weight. Let L, C, V be the common letters, consonants and vowels respectively of the two words. The metric (distance) is then the sum of L + C + V weighted by the length of the words. 1 is best score (same word), 0 is worst. (Maybe the components should be split up?).


Back to my other useless programs
Back to my homepage
Created by Hakan Kjellerstrand hakank@bonetmail.com