What is RegEx?

RegEx (regular expression) is a system for doing search patterns in text, used widely in programming languages. In Vulgar, RegEx can be used in the custom spelling option to mimic spelling idiosyncrasies of natural languages. It can also be used in the illegal combinations field to prohibit combinations of phonemes, and in custom affix rules to create more complex sound changes.

This page covers the basics of RegEx. Also try our RegEx Builder tool.

Matching at the beginning or end of a word

# signifies a word boundary. The pattern #dʒ will match at the beginning of a word, but not the middle or end.

Try it yourself:

InputRuleOutput

This also works at the end of the word:

InputRuleOutput

Match A or B

The bar symbol | will match anything either side of it. For example, if you want both əʊ and ɔ to turn into o, use əʊ|ɔ > o. Multiple bar symbols can be used: əʊ|ɔ|ɒ > o. If you wanted to match əʊ or ɔ at the end of a word, you will need to group the OR section in round brackets: (əʊ|ɔ)#. Without the brackets it will match ɔ# at the end of a word or əʊ anywhere in the word.

Alternatively, square brackets [ ] match any character inside them. For instance, [ɔəʊ] > o changes any of those characters to o. The disadvantage to this method is it treats everything inside the brackets as individual characters. See the difference:

InputRuleOutput

Take home point: If any of your patterns are more than one symbol, don't use square brackets! Some people erroneously think that [əʊ ɔ] works, however this matches ə or ʊ or a space or ɔ.

A carat symbol ^ inside the square brackets matches anything not inside the brackets:

InputRuleOutput

Lookahead

Lookahead allows you to match a pattern but only replace it if it comes before another pattern. Example: you want to change k to c but only if there is an a after it. The lookahead pattern is placed inside brackets with ?= at the beginning, like this (?=a):

InputRuleOutput

Notice how when a is not in a lookahead it gets replaced. We don't want this!

InputRuleOutput

Negative lookahead is the same principle, but the rule is applied if the there is no match ahead of it. It uses the ?! symbol inside brackets:

InputRuleOutput

Lookbehind

Lookbehind is same is same principle as lookahead, but checking for a pattern behind the main pattern. It uses ?<= inside brackets. The following example replaces vowels if they come after consonants:

InputRuleOutput

Negative lookbehinds use ?<! inside the brackets. (Note: this may not work for some older browsers. Try latest version of Firefox/Chrome/Edge.)

InputRuleOutput

Shorthand symbols

Vulgar uses various shorthand abbreviations for classes of phonemes, such as C for "any consonant" or V for "any vowel". This allow us to simplify some of the previous examples:

InputRuleOutput

Here is a complete list:

Shorthand codeCategory
AAffricates
BBack vowels
CConsonants
DAny IPA letter (does not match diacritics)
ᴰ (superscript D)Any diacritic symbol
EFront vowels
FFricatives
HLaryngeals
KVelars
LLiquids
ʟ (small capital L)Any IPA letter (does not match diacritics)
MDiphthongs
NNasal consonants
OObstruent
PLabials
QUvulars
RSonorant/resonant
SStops
U or σSyllable
VVowels, including diphthongs
WSemivowels
XAny phoneme
ZContinuant

Backreferences

Numbers refer back to whatever was captured inside brackets. The number 1 refers to whatever was matched in the first brackets. The following pattern matches a vowel at the end of the word, and doubles it:

InputRuleOutput

The following pattern matches two consonants in a row and swaps them:

InputRuleOutput

Zero refers to the entire match:

InputRuleOutput

Replace with nothing

Creating a rule with nothing on the right side of the > symbol will simply delete everything on the left side of the rule; [aeiou] > will replace all vowels inside the brackets with nothing. Arabic and Hebrew are examples of languages that do not have letters for their vowels.

InputRuleOutput

Replace any character

The dot symbol . matches any character. The rule . > x would change every character in the word to an x. While this is probably not useful in isolation, it can be useful as part of larger patterns.

InputRuleOutput

Dealing with stress symbols

If you want to make spelling rules that are sensitive to stress, you first need to check the Make spelling rules sensitive to stress symbol option. (The default setting is to apply the RegEx patterns with the stress symbol already removed, so that you don't have to worry about the stress symbol making your patterns more complicated.) Let's say you want stressed a to turn into á, like in Spanish spelling. The stress symbol could come right before an a, as in ˈama, however it could also come before any number of consonants and then an a, as in ˈdrama. To capture any number of consonants you can put all consonants in square brackets and use the star symbol after it: [mdr]*. The star symbol means match any number of whatever is before it, including zero instances. The consonants will need to wrapped inside a Lookbehind group (?<=) so that you don't replace them, and the a will go outside the Lookbehind so that you do replace it. And don't forget about the stress symbol too: (?<=ˈ[mdr]*).

InputRuleOutput

Finally, you will need a second rule to replace stress symbols with nothing.

Non-Latin alphabets

Custom orthography also supports all Unicode alphabets and scripts, such as Japanese, Chinese, Cyrillic, Georgian and even Unicode Emojis.

Order of rules

The order of your custom spelling rules matter. Vulgar will find-and-replace the first spelling rule to a word, then apply the next rule over the top of what it just did. This can be a problem if an IPA symbol appears again in a consonant cluster in a later rule. For instance, the following rules are problematic:

ʃ > sh
tʃ > ch

The intent here is for /tʃ/ to change to ch. However, in a word such a /tʃar/, the first rule will find /ʃ/ and change the orthography to tshar. Then when it moves to the second rule it will fail to find /tʃ/. The easiest solution is to reverse the order of the rules:

tʃ > ch
ʃ > sh

Another solution is to use the single Unicode character versions if it exists, such as ʧ. Lookahead patterns may be another option.