Corpas Náisiúnta na Gaeilge (CNG) is a balanced corpus of 100 million words, both written and spoken. All texts compiled for the corpus span the 2000-2024 period with the intention of being representative of contemporary Irish. It contains a large variety of genres, sources, and dialects, weighted in a way that avoids any one person, genre, or particular work having an excessive influence on the corpus as a whole. CNG will be used to investigate general questions of language, for example, how often a particular word or phrase is used, which preposition is most commonly used with a particular verb etc. Such questions often occur to us while learning, writing or translating and a search in CNG will provide reliable evidence that may lead to a satisfactory answer. Not only that, computer scientists can also process the corpus data in various ways to generate language models or make frequency lists.
Balance
The corpus was balanced in two main ways: 1) the original material was collected from as many sources and genres as possible and 2) the word count of certain text types (e.g. legislative content) was adjusted to avoid the overrepresentation of certain sources and genres in the search results. However, it is worth noting that balancing a corpus is not a precise mathematical process but rather a measured effort to create a corpus that is representative of the state of the language. The reasons for this include certain types of text being hidden from the researcher because they are personal or sensitive (e.g. conversations among friends and family, certain religious rites), other types of text being costly to process (speech transcription) and, in some cases, copyright holders not being willing to share certain content. In addition to these factors, which are relevant for each language, there are additional issues surrounding domain loss in the context of a minority language – there are many spheres of activity that are not dealt with much through the medium of Irish.
Structure
The pie chart below shows a breakdown of CNG by medium (written/spoken):
It is clear from the pie chart that there is much less spoken material in CNG than written material. The simple reason for this is that there was a limited amount of transcribed spoken material available. The project was able to exploit a collection of transcribed texts created by Foras na Gaeilge’s lexicography department as part of the New English-Irish Dictionary project and project researchers newly transcribed a number of TV programmes and other recordings but manual transcription is a slow and costly process and the project could only afford to do a limited amount. Machine learning is expected to solve this problem in the coming years but speech-to-text technology has not yet been developed to a satisfactory level of accuracy in the case of Irish to make it a feasible option for the project.
A breakdown of CNG according to genre can be seen below:
It is obvious that there are two genres on the above graph, News and Opinion, which are significantly larger than the others . News includes the main Irish-language news media (Tuairisc, Foinse, Gaelscéal...) while Opinion contains both magazines (Comhar, Feasta, An tUltach...) and blogs. Both genres include a multitude of different topics (sport, politics, fashion, arts etc.) and therefore are not as monolithic as they seem. It is planned to add further 'topic' metadata to these genres at the document level during the next phase of the project in 2025.
Examples of the different genres are given in the table below:
Genre | Example(s) |
---|---|
Academic | Journals, dissertations |
Weather | Aimsir TG4 |
Folklore | Bailiúchán Béaloidis Árann, Béaloideas Beo |
Corporate | Annual reports from State bodies, language plans |
Law | Acts of the Oireachtas, EU Directives |
Information | Wikipedia, https://www.citizensinformation.ie/ |
Forum | Gaeilge Amháin on Facebook, web forums |
Literature | Publications from Cló Iar-Chonnacht, Cois Life, Leabhar Breac etc. |
Lifestyle | Material from NÓS, various blogs |
News | Material from Tuairisc, Foinse, Gaelscéal etc. |
Education | Exam papers, textbooks, educational resources |
Children | Cartoons from Cúla4 |
Parliament | Transcripts of Dáil debates |
Religion | Prayers, religious publications |
Entertainment | Ros na Rún TG4, Iris Aniar RnaG |
Technical | Directions for installation of software etc. |
Opinion | Material from Comhar, Feasta etc. |