Vulgar: Language generator

Vulgar is capable of generating over 100 quadrillion unique and usable conlangs using pseudo-randomness. However the languages it produces are far from completely random strings of letters; a great deal of research has gone into ensuring the languages are as naturalistic as possible. Even though fantasy languages are ostensibly created for fantasy settings, naturalism is often a goal that language creators keep in the forefront of their minds; ultimately our characters are human -- at least humanoid. Unnatural language is occasionally a goal when attempting to achieve a particularly alien feel -- as was the case for Star Trek's Klingon language, which has very unnatural throaty sequences of sounds, and unusual grammar features. But more often than not, audiences have an appetite for something that could actually be real.

The vocabulary

The Pro version of Vulgar generates about 4000 unique words and matches them to a list of English's 4000 most common words. This data comes an English word frequency list by linguist Mark Davies at Wordfrequency.info. Davies’ research groups inflected English words into their non-inflected dictionary forms, example: ‘dogs’ is counted as ‘dog’, so there are no double-ups in the basic meanings of words.

Because this research comes from a corpus of contemporary American English, a certain level of artistic licence has been taken to tailor the vocabulary towards a more fantasy fiction genre. Certain highly culturally specific words have been removed (‘Catholic’, ‘Republican’), as well as most technological terms (‘internet’, ‘e-mail’).

Another finding is that after about the 2000th word, most English words start to be derived from other words. Example, you start to see a lot of words like ‘investigation’, which is just the noun form of ‘investigate’, or ‘sleepy’, an adjective form of ‘sleep’. As these kinds of words don't add much new content to the overall vocabulary, Vulgar reaches much further into the frequency list to give you unique senses. Example: the 4000th word in Vulgar is ‘cellar’, which is actually the 7805th word in English.

On top of this, Vulgar simulates the derived words with its own affix system. So if the word for ‘investigate’ is generated as kalar, Vulgar will generate an affix that turns verbs into nouns so that ‘investigation’ that still resembles its verb form. Example: if the affix it comes up with for verb-to-noun is -at, investigation would become kalarat. This affix is then applied to other words, so ‘nominate’ amir, would turn into ‘nomination’ amirat. Vulgar comes up with nine different affixes for different word changes, and more can added and edited in the custom settings.

Not all languages divide concepts up into the same words as English. Vulgar attempts to simulate this by not doing a one-to-one mapping of conlang word to English word. Instead, it draws from cross-linguistic research about concepts that languages often use the same word for. Here is just a fraction of words that Vulgar may combine:

- There is a 50% chance in every generated language the word for ‘tongue’ also means ‘language’
- There is a 60% chance in every generated language the word for ‘white’ is also the word for ‘blank’
- There is a 30% chance in every generated language the word for ‘girl’ is also the word for ‘girlfriend’
- There is a 10% chance in every generated language the word for ‘air’ is also the word for ‘wind’

Both the 2000 and 4000 word versions of Vulgar give you enough vocabulary to be able to talk about just about anything, and if the vocab doesn't have a word that you need, you can always add it in the custom settings. But you might not have to add as much as you think. According to Zipf's law, the 10 most common English words make up 25% of spoken language, and the most common 50,000 make up 95%. The remaining 5% is made up by about another million words. Using this calculation, the 2000 word version covers 82% of language, and the 4000 version covers 86%. And there are always ways to talk about things even if you don't have a dedicated word for it. Vulgar doesn’t generate a word for ‘backpack’, but it does have words for ‘bag you wear on your back’.

The sounds

Vulgar allows you to be as lazy as you want about the choosing the sounds of the language, or have as much control as physically possible. If you let Vulgar choose everything for you, it chooses a "sensible" phoneme inventory. Consonants like m, p, t, k, n are very common across all languages, so you are likely to find most of them in most randomly chosen inventories. However, there is also a decent degree of randomness, as we shouldn't expect all phonemes inventories to be the same.

Next it chooses what consonants are allowed to go next to each other, again, this is based on real-world data. Example: most languages do not like combinations like of consonants like tp, or kp, as in atpa or akpa. You are more likely to find apta and apka. All the consonant cluster it chooses are observed in real languages.

Vulgar also thinks about the frequency of your phonemes. Phonemes are never distributed evenly in a language. Typically, phonemes that are more common across languages, like m, p, t, k, n, also appear more frequently within the language.

Research shows that the distribution of phonemes follows a Yule distribution based on the rank of each phoneme. Basically, the formula tells you that if /m/ is the most frequent phoneme in your language, then /m/ should appear X% of the time in the vocabulary (with X changing depending on how many phonemes there are in total). It also tells you the percentages of all other phonemes based on their rank. The finding is, if you have a lot of phonemes in your language, the least common ones are really quite uncommon, maybe 40 times less common, while smaller inventories are a little bit more evenly spread.

Vulgar will apply a Yule distribution to your phonemes if you select the "naturalistic" phoneme frequency in the options. The Yule formula also has some random variables (read: wiggle room) to account for real world data, and Vulgar also simulates this. In other words, the highest ranked phoneme of Language A might appear 15% in the vocab, while in Language B it appears 16% despite having the same number of overall phonemes. You won't actually notice this wiggle room in Vulgar at all, but it fun to know it's there.

Grammar

The grammar output of Vulgar draws on statistics from real world languages. Example: about 70% of world languages put the adjective after the noun, so Vulgar chooses this option 70% of the time. Much of this data comes the excellent research at World Atlas of Language Structures.

Vulgar doesn't yet generate all the possible kinds of things that can occur in real languages, as the possibilities are extremely vast. This is currently a developing piece.