A regular expression is a text string that tells which pattern a word in the corpus must meet in order to be found as a result of the search. Regular expressions are used in two places on this site: in simple search and in advanced search (CQL). This article will tell you all you need to get the most out of regular expressions in those two places.
EXAMPLE
A regular expression in simple search:
(c|gc|ch)eist.*
A regular expression in advanced search (inside quotation marks):
[word="(c|gc|ch)eist.*"]
In both cases, this will find words starting with
ceist
,gceist
orcheist
.
Each regular expression is a combination of regular characters (a
, b
, c
...) and special characters that have a specific function (*
, ?
, |
...). In the rest of this article, each of these special characters and their function will be explained.
.
The dot stands for any character. The dot can be used if you don't care which character is there in the word you are looking for.
EXAMPLES
[
...]
If you put a list of characters in square brackets, that's the same as saying: I want one of these characters in this position.
EXAMPLES
ma[cr]
Will findmac
andmar
but notmag
.
[bg]éar
Will findbéar
andgéar
but notféar
.
ma[ls].acht
Will findmallacht
andmaslacht
but notmarfacht
.
[^
...]
If you put the ^
symbol at the beginning of the square brackets, that means: I want any character here except these ones.
EXAMPLES
ma[^cr]
Will findmag
and others, but notmac
ormar
.
[^bg]éar
Will findféar
and others, but notbéar
orgéar
.
ma[^ls].acht
Will findmarfacht
and others, but notmallacht
ormaslacht
.
The parentheses are often just one character. In that case it can be understood as: I want any character here other than this one.
EXAMPLE
[^f]éar
Will findbéar
andgéar
but notféar
.
|
If this symbol is placed between two words, it is equivalent to saying: I am looking for either of these words.
EXAMPLE
fear|buachaill|duine
Will findfear
,buachaill
,duine
.
This is useful in multi-word searches.
EXAMPLE
fear|buachaill|duine cróga
Will findfear chróga
,buachaill cróga
,duine cróga
.
The same symbol can also be placed between two parts of a word (rather than between two whole words). The whole thing must be enclosed with curly brackets.
EXAMPLE
ma(sl|ll|rf)acht
Will findmaslach
,mallacht
,marfacht
.
?
If you put a question mark ?
after a character, it means that that character is optional.
EXAMPLES
Enclosing a sequence of characters between curly brackets followed by a question mark means that the entire sequence is optional.
EXAMPLE
(ph)?éist
Will findphéist
,éist
.
Inside the sequence of characters enclosed in parentheses, the symbol |
can be used to indicate there is a choice.
EXAMPLE
(p|ph|bp)?éist
Will findpéist
,phéist
,bpéist
,éist
.
*
If you put an asterisk *
after a character, it means that there can be any number (including zero) of that character.
EXAMPLE
ab*c
Would findac
,abc
,abbc
,abbbc
...
If you put the asterisk after a sequence of characters enclosed in parentheses, you are saying that there can be any number (including zero) of that sequence.
EXAMPLE
a(bc)*d
Would findad
,abcd
,abcbcd
,abcbcbcd
...
The asterisk is often placed after a group in square brackets [
...]
. In that case, it means: any sequence of these characters.
EXAMPLE
a[bcd]*e
Would findae
,abe
,acde
,accbbbcbe
...
The asterisk can also be placed after a [^
...]
group. In that case, it means: a sequence of any characters other than these.
EXAMPLE
a[^bcd]*e
Would findae
,afe
,aghe
,ajhgtggtfge
...
The asterisk is often placed after the dot .
. In that case, it means: any number of any character.
EXAMPLE
a.*b
Would findab
,ahb
,afgb
,adhfsjb
...
+
If you replace the asterisk with the plus sign +
, it will have a slightly different meaning. Where the asterisk means "any number of characters, including zero", the plus means "any number, at least one".
EXAMPLES
ab+c
Would findabc
,abbc
,abbbc
... but notac
.
a(bc)+d
Would findabcd
,abcbcd
,abcbcbcd
... but notad
.
a[bcd]+e
Would findabe
,acde
,accbbbcbe
... but notae
.
a[^bcd]+e
Would findafe
,aghe
,ajhgtggtfge
... but notae
.
a.+b
Would findahb
,afgb
,adhfsjb
... but notab
.
In the corpora on this site, punctuation marks (full stops, question marks and others) are treated as if they were words, making them searchable. They have grammar tags starting with F
, and the punctuation symbol itself is the value of its word
attribute.
EXAMPLE
All forms of the word maith followed by an exclamation mark can be found with this CQL query:
If you want to search for one of the punctuation symbols that have a special function in regular expressions – that is: .
?
and others – you have to escape the symbols with a backslash \
first.
EXAMPLE
If you're trying to find every form of the word maith followed by a full stop, don't do it this way, because this means: the word maith followed by any word consiting of one character:
To search for the full stop, it must be escaped with a backslash. The backslash tells the search engine that it should understand whatever character follows it literally rather than as a special symbol: