Info About this website

National Corpus of Irish

Corpas Náisiúnta na Gaeilge (CNG) is a balanced corpus of 100 million words, both written and spoken. All texts compiled for the corpus span the 2000-2024 period with the intention of being representative of contemporary Irish. It contains a large variety of genres, sources, and dialects, weighted in a way that avoids any one person, genre, or particular work having an excessive influence on the corpus as a whole. CNG will be used to investigate general questions of language, for example, how often a particular word or phrase is used, which preposition is most commonly used with a particular verb etc. Such questions often occur to us while learning, writing or translating and a search in CNG will provide reliable evidence that may lead to a satisfactory answer. Not only that, computer scientists can also process the corpus data in various ways to generate language models or make frequency lists.

Balance

The corpus was balanced in two main ways: 1) the original material was collected from as many sources and genres as possible and 2) the word count of certain text types (e.g. legislative content) was adjusted to avoid the overrepresentation of certain sources and genres in the search results. However, it is worth noting that balancing a corpus is not a precise mathematical process but rather a measured effort to create a corpus that is representative of the state of the language. The reasons for this include certain types of text being hidden from the researcher because they are personal or sensitive (e.g. conversations among friends and family, certain religious rites), other types of text being costly to process (speech transcription) and, in some cases, copyright holders not being willing to share certain content. In addition to these factors, which are relevant for each language, there are additional issues surrounding domain loss in the context of a minority language – there are many spheres of activity that are not dealt with much through the medium of Irish.

Structure

The pie chart below shows a breakdown of CNG by medium (written/spoken):

Mediums

It is clear from the pie chart that there is much less spoken material in CNG than written material. The simple reason for this is that there was a limited amount of transcribed spoken material available. The project was able to exploit a collection of transcribed texts created by Foras na Gaeilge’s lexicography department as part of the New English-Irish Dictionary project and project researchers newly transcribed a number of TV programmes and other recordings but manual transcription is a slow and costly process and the project could only afford to do a limited amount. Machine learning is expected to solve this problem in the coming years but speech-to-text technology has not yet been developed to a satisfactory level of accuracy in the case of Irish to make it a feasible option for the project.

A breakdown of CNG according to genre can be seen below:

Genres

It is obvious that there are two genres on the above graph, News and Opinion, which are significantly larger than the others . News includes the main Irish-language news media (Tuairisc, Foinse, Gaelscéal...) while Opinion contains both magazines (Comhar, Feasta, An tUltach...) and blogs. Both genres include a multitude of different topics (sport, politics, fashion, arts etc.) and therefore are not as monolithic as they seem. It is planned to add further 'topic' metadata to these genres at the document level during the next phase of the project in 2025.

Examples of the different genres are given in the table below:

Genre Example(s)
Academic Journals, dissertations
Weather Aimsir TG4
Folklore Bailiúchán Béaloidis Árann, Béaloideas Beo
Corporate Annual reports from State bodies, language plans
Law Acts of the Oireachtas, EU Directives
Information Wikipedia, https://www.citizensinformation.ie/
Forum Gaeilge Amháin on Facebook, web forums
Literature Publications from Cló Iar-Chonnacht, Cois Life, Leabhar Breac etc.
Lifestyle Material from NÓS, various blogs
News Material from Tuairisc, Foinse, Gaelscéal etc.
Education Exam papers, textbooks, educational resources
Children Cartoons from Cúla4
Parliament Transcripts of Dáil debates
Religion Prayers, religious publications
Entertainment Ros na Rún TG4, Iris Aniar RnaG
Technical Directions for installation of software etc.
Opinion Material from Comhar, Feasta etc.