Like practically every conlanger who knows how to code a little, I have written my own word generator, Lexifer. It is written in Python 2.7 (it will not work with Python 3), and assumes you have some basic familiarity with using command line tools. It is known to work under Linux and OSX, but the Python is written in a way that I'd expect it to work under Windows. Someone will need to try sometime and let me know.
Lexifer differs from other such tools in a few ways:
- It is all Unicode, all the time.
- By default, it chooses phonemes randomly according to the Gusein-Zade distribution rather than a flat distribution, which is highly unnatural.
- It has basic phonological knowledge, and can do basic assimilations and coronal metathesis if you want it to.
- If you want it to, it can handle whatever sort order you prefer, including the correct sorting of digraphs.
The software package is distributed with an example phonology in test.def, so you can play around starting with that file to get a feel for how it behaves.
Command Line Arguments
By default, the command lexifer spits out a fake paragraph of text, with the words generated according to the rules in the definition file. I shamelessly lifted this feature from Mark Rosenfelder's gen word generator. It's a great way to get a quick impression of how the phonology you're developing is coming along.
~/lexifer $ ./lexifer test.def Ten mád mam lona rene haraku, nak medutin lár had keta tenatá. Ya sata, ʔar dedi bat lan ʔet keti binnane saga ra kuni. Ka tama keʔ táta mami ma, titale nar. Nan mate wat neti rinu nil. Náʔila met rasati mak. Natriki kak hida lon ʔawa tana ras samaki nená, nas. Loli ki ʔi nil ket han. Kor het denelep mema. Tad bil ketáká ta mir le. Kánáta ti ku weki mel, sa nake si deya. Taki ta maʔ liheni ta sasi ti, keni táʔ. Sut redati siti wana lir nise ni. Lam tápnak hina táse. Kel kin tuk ta nat tek lin dina tantáki tili, nek. Kok mal lak nenumá kiʔera, ka nala lak hiko. Mo ʔeret yad wáhi tálkisa ʔa. Háka dan reʔ deteme kod, natánu min man wád ka sa lihá. Wes lam ʔam diʔ káʔ nan mireli, liʔa nika ma. Numa sati laʔá, sána tál renka tan tar ʔenna. Nika tam lan ka, te sim put wád ná ma sina. Riliʔten hiʔataʔ, mal saʔ nilá kanate ta tatine táte. Naná karo náye, sek lim reti hedi te nin. Ra ná nita wir waki sanomi ʔin ták, ʔá. Ko hur raná ʔán wadi, ke rár káʔ ʔa. Gáne lase ta ʔin, nur raʔisu bir dát misi ki lal lik.
Note that I'm calling the command with ./lexifer (rather than just lexifer), which may be required if you don't have $CWD in your path.
If you just want a sorted list of words, use the command line argument -n NUMBER to specify a list:
~/lexifer $ ./lexifer -n 5 test.def gam mita náb nim tat
By default, the tool uses the sort order specified by the letters directive in the definition file. If for some reason you don't want that, the argument -u (for unsorted) will inhibit sorting:
~/lexifer $ ./lexifer -n 5 -u test.def táʔ mik nápira momi nekena
Finally, you can ask for one word per line with the option -o,
~/lexifer $ ./lexifer -n 5 -o test.def ʔusi his meya ne tát
The Phonology Definition File
Have a look at the file test.def to see nearly all these options at work:
with: std-ipa-features std-assimilations coronal-metathesis letters: ʔ a á b d e g h i k l m n o p r s t u w y C = t n k m l ʔ s r d h w b y p g D = n l ʔ t k r p V = a i e á u o words: CVCV? CVD?CVD? CVD? CVD?CVD?CVD? reject: wu yi w$ y$ h$ ʔʔ (p|t|k|ʔ)h filter: nr > tr; mr > pr; ŋ > n
The first line in test.def is a with directive which sets up Lexifer's more advanced options. First, std-ipa-features loads up the phonology database with features keyed on IPA. The other option is std-digraph-features which uses my crazy di- and trigraphs for when I can't (or don't want to) use IPA. See the file SmartClusters.py to see what those are. Sensible people will want to stick with IPA.
Note that you can mix IPA with other, non-standard stuff. For example, in general I tend to use the graph y for IPA /j/. Since the phonetics engine does no work on vowels or the glides at all, this is no problem. I also tend to use diacritics freely on vowels, for the same reason. But for the assimilations and metathesis engine to work, your consonants need to be represented more strictly in IPA.
The next option is std-assimilations, which tells the system to filter all generated words through both nasal assimilation and voicing assimilation for stops. That is, Lexifer will automatically convert the generated word anpa to ampa and akda to agda.
Finally, the option coronal-metathesis takes a word shape like atka or atpa and turns it into akta and apta. In plenty of natural languages coronals are dispreferred as the first part of a consonant cluster, and in general I personally don't care for it. This filter fixes such clusters of stops, but will also fix nasal clusters, from nm to mn.
The letters directive must be used if you want assimilations or coronal metathesis to work. It contains a list of all characters and digraphs (or ngraphs) used in the phonology.
In addition, the letters directive determines the sort order. It can also sort digraphs correctly.
Warning! Make sure that all letters and digraphs that occur in phoneme classes occur in the letters directive, or you will get some very odd output. You will be warned if there is a mismatch.
Now we get to the heart of the matter. These should be single uppercase letters, followed by an equals sign, and then the phonemes in that class, separated by spaces. They all have to be on one line. They are then used in the words directive to build up your words.
C = t n k m l ʔ s r d h w b y p g D = n l ʔ t k r p V = a i e á u o
By default, phonemes are selected according to the Gusein-Zade distribution, so that the first phoneme will be picked most often, the last one least often, according to a fairly natural pattern. Shuffling the order of these is one way to change the general feel of your generated words pretty substantially without resorting to odd things in the words directive.
Instead of setting the weights identically every time, each run of Lexifer will jitter the automatically assigned Gusein-Zade weights by up to 10%. This probably doesn't have much obvious impact, but it was easy to implement.
If you really want to set the phoneme frequencies by hand, you must do so for all the phonemes in a single phoneme class definition. You do that by suffixing a colon and then a number after the phoneme (no spaces), such as V = a:33 i:15 u:7. I designed Lexifer specifically to address the question of natural phoneme distributions. It's easier to use if you don't fight against it.
The words directive is where you craft how the phonemes defined in the phoneme classes are collected into word shapes. As in the phoneme classes, the word shapes that come first are selected more often than the last ones, so your very shortest words and very longest words should be closer to the end than the front. (For those who care, this uses a Zipf distribution.)
Within word definitions you can use a question mark after a phoneme class to indicate that it may or may not occur. So, CVCV? from test.def means that this word shape sometimes ends in a vowel, sometimes not. By default, a class marked with a question mark will occur about 10% of the time. You can change that with the random-rate directive, with random-rate: 25 upping the rate to 25%.
Also within word definitions the exclamation mark, '!', is used to prevent repeats. For example, if your language allows all consonants as a syllable codas as well as onsets, but you don't allow double consonants, use CVCC!V to indicate that the the second C in the cluster cannot be identical to the first.
There is probably a better character to use to prevent repeats than the exclamation point, but I haven't decided on one yet. I just wanted the feature working quickly. If you have a suggestion, let me know. (Apr 12 2015)
Finally, you can use normal letters within the definition of a word, such as CwVCV. For single letters used like this, the question mark will work to randomize the appearance, but that will not work for digraphs.
I have found that overuse of the last two features (explicit phonemes in word definitions, the question mark random operator) leads to unsatisfying results more often than not. It may be better to add another phoneme class or two, and define more word shapes.
If you find that certain syllable shapes keep coming up, you can use macros. These are defined like phoneme classes, except that instead of being a single upper case letter, they are a single upper case letter with a dollar sign in front, $S, for example.
The words section from test.def could be simplified by using a single macro:
$S = CVD? words: CVCV? $S$S $S $S$S$S
Macros must be defined before the words they are used in.
If there are certain combinations you don't want to occur, you can remove them with the reject directive. You can have as many reject directives as you like, or you can have multiple patterns on a single line separated by a space.
Both the reject directive and the filter directive explained below use regular expressions to do their work. Regular expressions are an entire field of study, but I'll mention a few relevant things:
- The end of a word is indicated by a dollar sign symbol, '$'.
- The start of a word is indicated by a caret, '^'.
- A collection of options is indicated by separate items separated by a pipe symbol, '|'. For reasons I won't go into here, these should always have parentheses around them.
So, among the things I reject in test.def are the combinations /wu/ and /yi/. Then I say that I dont want /w y h/ at the end of a word, w$ y$ h$. After saying I don't want double glottal stops, I forbid any voiceless stop or the glottal stop from occurring before /h/, (p|t|k|ʔ)h.
Here are some rejections of things often dispreferred in natural languages:
- reject: (t|d)l
- reject: (p|b|f|v|m)w
You can transform some patterns into others with the filter directive. Again, you can have as many filter lines as you like, or you can have several on a line, separated by semicolons. The pattern you want to change occurs first, followed by a greater than sign, '>', and then the replacement. For example, I could automate palatalization with filter: ki > tʃi.
Warning! If you introduce new symbols with filters, you must make sure they occur in the letters directive (if you're using it or the assimilation engine), or you will get nonsense results.
Filters are applied in the order they are defined.
If the target of the filter is the exclamation point, !, then the pattern is deleted. For example, filter: s > ! would delete the letter s from every output word. Using this to delete letters doesn't make much sense, but you could use a special character to mark syllable boundaries, for example, do particular filters based on that, and then remove the syllable character at the end. I recommend a single quote or an at-sign for this syllable marking purpose — many other characters have special meanings to regular expressions and would confuse the rest of the filter and reject rules. Such a special character also needs to be in the letters directive.
If you use the assimilation engine you are very likely to get the velar nasal /ŋ/ in your output. Most people do not, however, have that in as a separate symbol in their conlangs, so the filter filter: ŋ > n will undo this change. I imagine it will be a regular feature of most people's definition files.
Finally, there are cluster tables. These combine filters and rejections in a clear and concise way based on a layout sometimes seen in descriptions of languages' phonologies. For example:
% a i u a + + o i - + uu u - - +
This table defines what do do about various vowel combinations. The start of a cluster table must have a percent sign, '%', as the very first character of the line. The phonemes following that represent the second element in a cluster. The first phoneme in each following row represent the first element in a cluster. Within the table, '+' means the combination is fine, '-' means the combination should be rejected, and anything else is a substitution. So 'a' + 'i' is fine, but 'a' + 'u' becomes an 'o', while 'i' + 'a' is forbidden.
A blank line marks the end of a cluster table. These are just a notational convenience, to lay out a bunch of rejects and filters. Regular expressions can be used. As with normal rejects and filters, these are processed in order when a word is created. The simple rejects and filters can come before or after a cluster table as makes sense for what you're trying to accomplish.
See examples/hungarian.def for an extended example using cluster tables.
- July 27 2016: Version 2.0, with cluster tables.
- July 12 2016:
- Fixed a bug with random-rate, as well as '!' in word definitions (thanks to Amanda Babcock Furrow for bringing these to my attention).
- Added a subtlety to '!' to make it work properly after an optional class (something like V?V! works as expected now).
- Added macros.
- Added another example definition to the distribution.
- April 11 2015: fixed a bug with affricates; added the '!' special character to the syllable shape rules
- April 1 2015: added some sanity checking and warnings about mismatches between graphs appearing in phoneme classes and the contents of the 'letters' directive
- March 31 2015: minor bugfixes
- March 30 2015: initial release