Vulgar: Language generator

Vulgar is capable of generating over 100 quadrillion unique and usable conlangs using pseudo-randomness. However the languages it produces are far from random strings of letters; a great deal of research has gone into ensuring the languages are as naturalistic as possible. Fantasy languages may be created for worlds where anything is possible, but naturalism is a often goal for language creators. If you want your characters’ culture to feel real, their language should feel real too.

The vocabulary

The Pro version of Vulgar generates about 4000 unique words and matches them to a list of English's 4000 most common words. This data comes from an English word frequency list by linguist Mark Davies at Wordfrequency.info. Davies’ research groups inflected words with their non-inflected forms, example: ‘dogs’ is counted as ‘dog’, so there are no double-ups in the basic meanings of words.

Because this research comes from contemporary American English, a certain level of artistic licence has been taken to tailor the vocabulary towards a more fantasy fiction genre. Certain highly culturally specific words have been removed (‘Catholic’, ‘Republican’), as well as most technological terms (‘internet’, ‘e-mail’).

Another finding is that after about the 2000th word, most English words start to be derived from other words, such as ‘investigation’, which is just the noun form of ‘investigate’, or ‘sleepy’, an adjective form of ‘sleep’. These kinds of words don't really add exciting new content to the overall vocabulary, so Vulgar reaches much further into the frequency list to give you more in the way of unique concepts. For instance, the 4000th word in Vulgar is ‘cellar’, which is actually the 7805th word in English.

Vulgar also simulates derived words with its own affix system. If the word for ‘investigate’ is generated as kalar, Vulgar will generate an affix that turns verbs into nouns so that ‘investigation’ that still resembles its verb form. Example: if the affix it comes up with for verb-to-noun is -at, ‘investigation’ would become kalarat. This affix is then applied to other words, so amir ‘nominate’, would turn into amirat ‘nomination’. Vulgar creates nine different affixes for different word changes, and more can added and edited in the custom settings.

Not all languages divide concepts up into the same words as English. Vulgar simulates this by not doing a one-to-one mapping of conlang word to English word. Instead, it draws from cross-linguistic research about concepts that languages often use the same word for. Here is just a fraction of words that Vulgar may combine:

- There is a 50% chance in every generated language the word for ‘tongue’ also means ‘language’
- There is a 60% chance in every generated language the word for ‘white’ is also the word for ‘blank’
- There is a 30% chance in every generated language the word for ‘girl’ is also the word for ‘girlfriend’
- There is a 10% chance in every generated language the word for ‘air’ is also the word for ‘wind’

Both the 2000 and 4000 word versions of Vulgar give you enough vocabulary to be able to talk about just about anything, and if the vocab doesn't have a word that you need, you can add it in the custom settings. But you might not have to add as much as you think. According to Zipf’s law, the 10 most common English words make up 25% of spoken language, and the most common 50,000 make up 95%. The remaining 5% is made up by another million words, according to some estimates. Using this calculation, the 2000 word version covers 82% of language, and the 4000 version covers 86%. And there are always ways to talk about things even if you don't have a dedicated word for it. Vulgar doesn’t generate a word for ‘backpack’, but it does have words for ‘bag you wear on your back’.

The sounds

Vulgar allows you to have as much control as you want over the sounds of the each language, or be as lazy as you want. If you let Vulgar choose everything for you, it chooses a plausible phoneme inventory. Consonants like m, p, t, k and n are common across all languages, so you are likely to find most of them in most randomly chosen inventories. However, there is also a decent degree of randomness baked into the algorithm, as we shouldn't expect all phonemes inventories to be the same.

Next Vulgar chooses what consonants are allowed to go next to each other. Again, this is based on real-world data. Most languages do not like combinations like of consonants like tp, or kp, as in atpa or akpa. You are more likely to find apta and apka. All the consonant cluster it chooses are observed in real languages.

Phonemes are never distributed evenly in a language. Vulgar also accounts for this. Typically, phonemes that are more common across languages, like the aforementioned m, p, t, k and n, also appear more frequently within a language. In fact, research shows that the distribution of ranked phonemes follows what's known as a Yule distribution. This formula tells you that if /m/ is the most frequent phoneme in your language, then /m/ should appear X% of the time in the vocabulary (with X changing depending on how many phonemes there are in total). It also tells you the percentages of all other phonemes based on their rank. The finding is, if you have a lot of phonemes in your language, the least common ones are really quite uncommon, maybe 40 times less common, while smaller inventories are a little bit more evenly spread.

Vulgar will apply a Yule distribution to your phonemes if you select the "naturalistic" phoneme frequency in the options. The Yule formula also has some random variables (read: wiggle room) to account for real world data. In other words, the highest ranked phoneme of Language A might appear 15% in the vocab, while in Language B it appears 16% despite having the same number of overall phonemes. You won't actually notice this wiggle room in Vulgar at all, but it fun to know it's there.

Grammar

The grammar output of Vulgar draws on statistics from real world languages. Example: about 70% of world languages put the adjective after the noun, so Vulgar chooses this option 70% of the time. Much of this data comes the excellent research at World Atlas of Language Structures.

Vulgar doesn't yet generate all the possible kinds of things that can occur in real languages, as the possibilities are extremely vast. However we are improving with every update.