General Information and Overview of Operation
Wuggy is a pseudoword generator that uses an innovative approach for generating pseudowords, combining the best of existing approaches.
- Traditionally, lists of pseudowords have been available that are based on combining subsyllabic elements that are legal in the language of choice. For instance, by combining the legal onset b (as in bat) with a legal nucleus u (as in fun) and a legal coda p (as in ship), we get the pseudoword bup, which is legal (pronounceable) in English. The problem with this approach is that it leads to a combinatorial explosion. For monosyllabic words, the list is still tractable (hundreds of thousands of pseudowords), but combining elements into polysyllabic strings quickly leads to billions of possibilities. Choosing an appropriate pseudoword for your needs becomes an impossibility because there are too many options to match.
- Other approaches are based on guessing good pseudowords by combining high frequency letter sequences. These approaches make it possible to produce longer sequences, but these sequences are not necessarily legal in the language, and, by design, the generated sequences do not contain low frequency letters.
The core algorithm in Wuggy is able to generate all possible pseudowords in the language (depending on the quality of its input, it may make a few impossible ones). However, by employing some smart tricks, Wuggy doesn’t have to generate all these pseudowords before it knows which pseudoword is good for you. In fact, the tougher the restrictions you give Wuggy, the faster it will find the solution. Wuggy does this by restricting the model of the language before it starts generating candidates instead of searching through the list afterwards.
Overview of operation
Wuggy has a native look-and-feel on the different platforms (Mac OS X, Windows, Linux). The figure below shows Wuggy’s main window on OS X. After starting the program, a language module should be chosen from the ‘General Settings’ on the right. This loads a syllabified language lexicon, which allows the program to compute the model for the language. The lexicon is also used to syllabify input and to test the lexicality of generated forms. Loading a language module may take a few minutes on older computers.
Then, reference words can be input by typing them in the appropriate column or reading them from a file. In the figure above, the words ‘milk’ and ‘sentence’ have been input and then syllabified by choosing ‘Tools>Syllabify‘ from the main menu.
When input is given, the program is ready to generate candidates. The default values for pseudoword generation are the ones we found most appropriate for our own research. By default, Wuggy outputs only pseudowords and searches for up to 10 seconds, or until 10 candidates are generated. Additionally, the candidates are required to match the subsyllabic structure of the input word, to have the same length (in letters) as the input word, to have the smallest possible deviations in transition probabilities from the input word, and to match two out of three subsyllabic segments.
Choosing ‘Generate>Run’ from the main menu opens the Results window. The figure below shows the output for the words ‘milk’ and ‘sentence’ using the default output restrictions and with all output options checked.
Overview of Options
First column (Word)
Reference words can be entered manually or read from a text file by going to the ‘File>Open Input Sequences’ from the main menu; the input file needs to be in tab-delimited format. To ensure maximal flexibility and compatibility, Wuggy reads Unicode (UTF-8) encoded files.
Second Column (Syllables)
Wuggy will automatically syllabify all words it finds in its lexicon. Choosing ‘Tools> Syllabify’ from the main menu fills the second column with the syllabified versions of the input in the first column. For input words that are not found in the lexicon, a syllabified version should be entered manually.
Third column (Matching Expression)
Typing a regular expression here will require all generated pseudowords to match that regular expression. For instance, if only pseudowords ending in –ing are required, one would type .+ing$ in this column. Information about regular expressions is widely available online (e.g., http://en.wikipedia.org/wiki/Regular_expression).
Currently, there are language modules available for Basque, Dutch, English, French, Serbian, and Spanish
This option determines whether Wuggy outputs only pseudowords, only words, or both. Choosing ‘word’ makes Wuggy find the closest word neighbors of a target word.
Maximal number of candidates
The maximum number of candidates that will be generated for each word.
Maximal search time per word
The maximal time that will be spent on trying to find candidates.
Match length of subsyllabic segments
Checking this option will output only candidates with the same subsyllabic structure as the input word. This option speeds up the output because there are fewer candidates to consider.
Match letter length
Checking this option will generate candidates with the same number of letters as the input word. This option is redundant if the option above is checked.
Match transition frequencies (concentric search)
This option operates the concentric search algorithm as described above. First, the algorithm will try to generate candidates that exactly match the transition frequencies of the reference word. Then the maximal allowed deviation in transition frequencies will increase by powers of 2 (i.e., +/-2, +/-4, +/-8, etc.). Not checking this option will generate pseudowords without consideration for transition frequencies. However, because the problem space is less well defined in that case, it may take longer
Match subsyllabic segments
Here, a particular ratio of overlapping segments can be specified. The default value (2/3) generates candidates that are very word-like but not easily identifiable as related to an existing word.
This will give syllabified output. Unchecking this option will give plain
Indicates whether the generated form is a Word[W] or a Nonword [N]. This is particularly useful with the ‘Output Type>Both’ option in General Settings.
Checking this option will compute the average Orthographic Levenshtein Distance between the generated candidate and its twenty most similar words in the lexicon. This gives a good indication of the neighborhood size and density of the nonword. A small value of OLD20 indicates that many words can be made by changing a single letter (either by substitution, deletion, or insertion). The difference in OLD20 between the generated nonword and the reference word Is also shown. Lower values indicate that the candidate has a denser neighborhood. Setting this option considerably slows down Wuggy.
Neighbors at edit distance 1
This option outputs the number of orthographic neighbors at edit distance 1. This is the number of words that can be made from the candidate by substituting, deleting, or inserting a single letter. Setting this option slows down the algorithm considerably.
The figure above shows the output when both OLD20 and Neighbors at edit distance 1 have been selected for the target word milk. This output clearly shows that all but one of the proposed nonwords have fewer neighbors than the target word milk. For instance misk has eight neighbors of edit distance 1, which is three less than milk. Similarly, the average edit distance to the 20 closest neighbors is 1.6, which is .25 more than milk. Mife looks like a better choice than misk, as it has one neighbor more at edit distance 1 than milk, rather than three less. Given that OLD20 is an important variable in lexical decision RTs, researchers may prefer to keep this as close to the word value as possible, as long as it does not change the difference in transition frequency too much. This shows the advantage of having more than one candidate proposed by Wuggy.
Number of overlapping segments
With this option checked, the number of segments that overlap in the generated sequence and the reference sequence will be shown, expressed as a fraction.
This option shows the largest difference in transition frequencies between the subsyllabic segments in the generated sequence and those in the reference sequence. For instance, if this measure is 14, the generated sequence contains a transition that occurs in 14 more words than the equivalent transition in the reference sequence. The frequencies of all other transitions are closer to the frequencies of the transitions in the reference sequence. Checking this option will also output the sum of all transition frequency deviations (absolute values) and a column showing where in the string the maximally deviating transition is situated.