Info Guide for power users

Searching with regular expressions

A regular expression is a text string that tells which pattern a word in the corpus must meet in order to be found as a result of the search. Regular expressions are used in two places on this site: in simple search and in advanced search (CQL). This article will tell you all you need to get the most out of regular expressions in those two places.

EXAMPLE

In both cases, this will find words starting with ceist, gceist or cheist.

Each regular expression is a combination of regular characters (a, b, c...) and special characters that have a specific function (*, ?, |...). In the rest of this article, each of these special characters and their function will be explained.

Any character: .

The dot stands for any character. The dot can be used if you don't care which character is there in the word you are looking for.

EXAMPLES

  • ma.
    Will find mac, mar, mag and others.

  • .éar
    Aismeofar béar, géar, féar and others.

  • ma..acht
    Will find mallacht, maslacht, marfacht and others.

Choice of characters: [...]

If you put a list of characters in square brackets, that's the same as saying: I want one of these characters in this position.

EXAMPLES

  • ma[cr]
    Will find mac and mar but not mag.

  • [bg]éar
    Will find béar and géar but not féar.

  • ma[ls].acht
    Will find mallacht and maslacht but not marfacht.

Negative choice: [^...]

If you put the ^ symbol at the beginning of the square brackets, that means: I want any character here except these ones.

EXAMPLES

  • ma[^cr]
    Will find mag and others, but not mac or mar.

  • [^bg]éar
    Will find féar and others, but not béar or géar.

  • ma[^ls].acht
    Will find marfacht and others, but not mallacht or maslacht.

The parentheses are often just one character. In that case it can be understood as: I want any character here other than this one.

EXAMPLE

  • [^f]éar
    Will find béar andgéar but not féar.

Choice between two strings: |

If this symbol is placed between two words, it is equivalent to saying: I am looking for either of these words.

EXAMPLE

This is useful in multi-word searches.

EXAMPLE

The same symbol can also be placed between two parts of a word (rather than between two whole words). The whole thing must be enclosed with curly brackets.

EXAMPLE

Optional character: ?

If you put a question mark ? after a character, it means that that character is optional.

EXAMPLES

Enclosing a sequence of characters between curly brackets followed by a question mark means that the entire sequence is optional.

EXAMPLE

Inside the sequence of characters enclosed in parentheses, the symbol | can be used to indicate there is a choice.

EXAMPLE

Any number of characters: *

If you put an asterisk * after a character, it means that there can be any number (including zero) of that character.

EXAMPLE

  • ab*c
    Would find ac, abc, abbc, abbbc...

If you put the asterisk after a sequence of characters enclosed in parentheses, you are saying that there can be any number (including zero) of that sequence.

EXAMPLE

  • a(bc)*d
    Would find ad, abcd, abcbcd, abcbcbcd...

The asterisk is often placed after a group in square brackets [...]. In that case, it means: any sequence of these characters.

EXAMPLE

  • a[bcd]*e
    Would find ae, abe, acde, accbbbcbe...

The asterisk can also be placed after a [^...] group. In that case, it means: a sequence of any characters other than these.

EXAMPLE

  • a[^bcd]*e
    Would find ae, afe, aghe, ajhgtggtfge...

The asterisk is often placed after the dot .. In that case, it means: any number of any character.

EXAMPLE

  • a.*b
    Would find ab, ahb, afgb, adhfsjb...

One or more characters: +

If you replace the asterisk with the plus sign +, it will have a slightly different meaning. Where the asterisk means "any number of characters, including zero", the plus means "any number, at least one".

EXAMPLES

  • ab+c
    Would find abc, abbc, abbbc... but not ac.

  • a(bc)+d
    Would find abcd, abcbcd, abcbcbcd... but not ad.

  • a[bcd]+e
    Would find abe, acde, accbbbcbe... but not ae.

  • a[^bcd]+e
    Would find afe, aghe, ajhgtggtfge... but not ae.

  • a.+b
    Would find ahb, afgb, adhfsjb... but not ab.

Character escaping

In the corpora on this site, punctuation marks (full stops, question marks and others) are treated as if they were words, making them searchable. They have grammar tags starting with F, and the punctuation symbol itself is the value of its word attribute.

EXAMPLE

All forms of the word maith followed by an exclamation mark can be found with this CQL query:

[lemma="maith"] [word="!"]

If you want to search for one of the punctuation symbols that have a special function in regular expressions – that is: . ? and others – you have to escape the symbols with a backslash \ first.

EXAMPLE

If you're trying to find every form of the word maith followed by a full stop, don't do it this way, because this means: the word maith followed by any word consiting of one character:

[lemma="maith"] [word="."]

To search for the full stop, it must be escaped with a backslash. The backslash tells the search engine that it should understand whatever character follows it literally rather than as a special symbol:

[lemma="maith"] [word="\."]