This is a project to develop a major national corpus of contemporary Irish that combines written and spoken material. The site also hosts subsidiary and related specialised corpora and will be a central resource for corpus-based research on the Irish language.
Compilation of this corpus is being undertaken by the Gaois research group, Fiontar & Scoil na Gaeilge, DCU. The project is being funded for the period 2022-2025 by the Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media, with support from the National Lottery.
What is a corpus?
A corpus is a large collection of texts used for linguistic research. A corpus can include different types of text such as books, news articles, internet texts (e.g. social media posts), transcripts of speech content etc. A corpus can be balanced or unbalanced. An unbalanced corpus is a collection of texts that have been produced in a random way without attempting to be representative of the language as a whole. That is not necessarily a bad thing (Corpas na Gaeilge Labhartha and Corpas na Gaeilge Scríofa are unbalanced corpora), but it can lead to the over-represent ation of certain types of text and the under-representation of others. On the other hand, a balanced corpus is an attempt to provide an accurate and measured representation of the target language as a whole, and encompasses a wide range of media, genres and sources that are aligned in a manner that is broadly consistent with their distribution in the language itself. The creation of such a corpus requires the inclusion of representative examples of different types of text (spoken, written, internet) spread across a wide range of genres (e.g. news, literature, religion, law). Such has been the approach in the case of the National Corpus of Irish, the main corpus of this project.
Project stages
More than 150 million words of Irish were collected from over 170,000 documents over the course of the project. A copyright agreement was established with the owners of the various texts, where necessary, on the understanding that only short extracts of any text would appear in the search results. All of this data would then need to be processed in order to be cleaned and formatted before it could be included in the corpus.
Each word of the corpus was then tagged for part of speech. Tagging is an automated process where corpus material is analysed and relevant information (in this case grammatical information) is added to certain elements. As a result of this process, a distinction can be made in the search between ‘leis’ in the sense ‘also’ and ‘leis’ as a preposition, for example, not to mention the other meanings of that word. More complex searches can also take advantage of this feature such as "give me examples of the verb ‘cuir’ followed by a preposition". This allows the teacher or linguist, for example, to compile a comprehensive list of examples.
Ultimately, it was necessary to decide on the best way to present all this data to the public. The main goal of the project was Corpas Náisiúnta na Gaeilge, a balanced corpus representative of contemporary Irish. However, it was also understood that some users would prefer to focus only on spoken Irish and Corpas na Gaeilge Labhartha was designed to cater for that cohort. Corpas Monatóireachta na Gaeilge is a resource for identifying patterns and trends in the language over time, and will be added on an annual basis in the future to keep it up to date. Finally, it was recognised that Corpas na Gaeilge Comhaimseartha has a loyal number of users who use it to find examples of the ‘correct’ or ‘normal’ use of written Irish. Corpas na Gaeilge Scríofa is a successor to that resource and offers more content and a much more powerful search functionality than its predecessor. More information on the four corpora can be found elsewhere in these information pages.
The corpus resources on this site are designed to provide a clean and easy-to-understand interface for an average user looking for a simple search along with a more complex inferface for the expert, enabling them to make more specific queries. The interface is fully bilingual and available to all, without registration or subscription. The technology that enables the user to search and query the corpus is based on NoSketchEngine and NoSketchEngine Docker.
Every effort has been made to design and build this site so that it is to be accessible to all visitors regardless of ability. The site has been built to comply with the WCAG 2.0 AAA standard, an international accessibility standard issued by the World Wide Web Consortium (W3C).
In order to benefit from the best possible user experience, it is recommended that you use a modern, recently-updated browser corpas.ie can be used without Javascript capability being enabled in the browser. It should be noted, however, that the interactive maps used on logainm.ie cannot be displayed when Javascript is disabled.
If you believe a portion of this website is not fully accessible, or that certain aspects of the user experience could be improved, we would welcome your feedback via e-mail to gaois@dcu.ie.
This information applies to the corpas.ie public website.
What usage data we collect
We collect certain usage data with the aid of services such as Plausible while users browse the website. We use Plausible to record information about where our users come from and what they do while on this site. Plausible collects and stores technical details about the browser and the computer used to visit the site. We use a cookie to store the user’s preferred user language. The cookie is not linked to any personal data.
How we use this usage data
We use the data collected by Plausible to construct aggregate reports on user numbers, location and behaviour. These reports inform our future development plans. This data is not personal data, as defined by EU GDPR, and it does not allow us identify individual users.
We reserve the right to report and/or publish aggregated usage metrics.
How long we store this usage data for
As this data is anonymised, we store it indefinitely. This allows us to chart the development of our user base over the lifetime of the project.
What search data we collect
Every time you search the site, we store the following data:
How we use this search data
Counting the number of searches carried out by users of our website constitutes the best measure of the success of our project. In addition, we use 1–4 to help us monitor our performance and improve our service. For example, if we notice that a particular search term is causing problems for our system, we will endeavour to apply a fix. Or if we notice that all searches are taking too long, we will investigate if something needs to be optimised or if our services need more computing power.
We use 5 (i.e. ‘Your device IP address’) to verify that the search is being carried out by a person and not a computer program such as a web scraper. For example, if we notice that large numbers of searches are being carried out by a particular IP address, we can exclude those searches from our search statistics, if deemed necessary. We never use this data to identify individual users.
We reserve the right to report and/or publish aggregated search data.
How long we store this search data for
As 1–4 is not personal data, we store it indefinitely. We delete IP addresses associated with searches more than 1 year old on the 1st day of every calendar month.
Consent
By accessing our website you agree to abide by all policies and practices outlined on this page.
Contact
If you have any queries in relation to this policy you can contact us at logainm@dcu.ie.