Homepage » Greek Corpus 20

Greek Corpus 20

The project involves the compilation and analysis of a diachronic Corpus of Greek of the 20th Century (Greek Corpus 20). Greek, unlike other languages, has nοt extensively benefited so far by the huge advances in the wider research field of linguistic corpus development and exploitation. The project is intended to fill this gap by developing a 20 million word corpus of Greek for the first nine decades of the 20th century, which will be integrated with the existing 30 million word Corpus of Greek Texts (CGT) that includes texts from the 1990s onwards.

Research area and objectives:

Corpus linguistics has considerably improved the description of language by allowing access to large bodies of authentic texts. It has also had a broad range of applications in lexicography, the writing of grammars, translation, language teaching, the study of language and ideology, lexical semantics, media studies etc. (see, among else, Hunston 2002: 13-14, Meyer 2002: 1-29, Baker et al. 2006). Unlike other languages, Greek has nοt benefited as expected by the development of corpus linguistics. Only two large Greek corpora have been created so far: the Hellenic National Corpus (HNC) of 47 million words at present, which contains texts published from 1976 to 2007, and the Corpus of Greek Texts (CGT) of 30 million words, which contains texts from two decades (1990 to 2010). (For more details, see Hatzigeorgiu et al. 2001 for HNC and Goutsos 2010 for CGT). Both are synchronic corpora, in the sense that they offer a view of a specific period of the Greek language.

Greek Corpus 20 is intended to fill this gap by developing a 20 million word corpus of Greek for the first nine decades of the 20th century, to be integrated with the existing 30 million word Corpus of Greek Texts (CGT) that includes texts from the 1990s onwards. The corpus will be designed with the purpose of studying areas of recent grammatical and lexical change in the Greek language through the analysis of authentic texts. (For the narrow definition of recent change used here, see Mair 2009: 1120, Davies 2011, 2012).

Historical or diachronic corpora have been compiled or are under preparation for other languages or language varieties like the Helsinki Corpus of English Texts, which covers Old, Middle and Early Modern English, the Corpus of Historical English Registers (ARCHER), which contains British and American English texts from 1650 to the present, the four corpora including Brown and Frown, LOB and FLOB, which can together supply evidence for change in the two varieties of English between 1961 and 1991-1992, COHA and COCA for American English, DiaCoris for Italian etc. (For more details on existing diachronic corpora, see Onelli et al. 2006, Beal et al. 2007, Mair 2009, Baker 2010: 57 ff., Partington 2010, Aarts et al. 2013). Greek has not had a similar diachronic corpus for a number of reasons relating to the collection of data. Extra-linguistic factors, such as the socio-historical background in Greece of the 20th century, can account for the lack of data or the occurrence of minimal data for several periods. In addition, linguistic factors such as the persisting diglossia, which is connected with important socio-historical events throughout the 20th century, complicate issues of data collection and analysis.

The project will initially explore the issues relating to the availability of data, which are necessary in order to first design a pilot diachronic corpus of Greek of the 20th century. This pilot corpus will form the basis for Greek Corpus 20, which will be used in conjunction with the Corpus of Greek Texts in order to explore linguistic change at various levels. Following the compilation of the corpus, the research will focus on particular aspects of linguistic change such as the study of semantic, morphological and orthographic neologisms (cf. Fischer 1998: 10), productive morphology across decades (cf. Baayen & Renouf 1996), syntactic and vocabulary change. Research team members will include a post-doctoral researcher, postgraduate and undergraduate students and will involve external co-operation with Universities inGreece and abroad so as to enhance the visibility of Greek language and research. Our intention is to produce high quality, innovative research, which will put Greek in the map of current corpus linguistics through both the exemplary compilation of a diachronic corpus and the promotion of international co-operation in its design and evaluation.

Aims of the project:

The main aims of the project are the following:

1. to examine the issues involved in the compilation of a diachronic corpus of Greek of the 20th century. Issues which have to be explored and resolved include:

  • the availability of data across decades,
  • the availability of text types (e.g. spoken data such as conversations may be difficult to find) and the continuity of text types (e.g. several text types may only be found in certain decades),
  • the issue of representativeness. In particular, dimensions of representativeness such as the ratio of data included in each time period have to be addressed.
  • wider sociolinguistic issues such as the role of diglossia, the formal and demotic registers of Greek, different spelling systems (“polytonic”-“monotonic”) etc.

The design of the corpus is aimed at including both spoken and written texts. The analysis of spoken texts is crucial for the study of language change, because spoken language is always less conservative and more open to novelties than written language. However, as is often stated in the relevant literature (see e.g. Baker 2010: 57), spoken data are more difficult to be accessed and more time-consuming to manage. Sources of spoken data could include the archive of the Greek state radio and television, the archives of Parliament speeches or archives of academic speeches. Written texts can be found more easily and include journalistic texts from newspapers and magazines, literature (books and literary magazines), academic texts (books, dissertations, articles in journals) and other official texts (e.g. legal texts). For reasons of compatibility it is preferable to include text types that can also be found in the two synchronic corpora of Greek.    

2. on the basis of exploration of data sources, to collect data for a diachronic corpus of Greek of the 20th century that will contribute to the largest available resource for the language. This will involve the compilation of a pilot corpus at a first stage and the construction of Greek Corpus 20 at a following stage. The compilation will involve the following:

  • collection of data
  • transcription of spoken data (if any)
  • digitization or conversion between formats (e.g. pdf to txt)
  • cleaning data, and
  • organizing the metadata.

3. to analyze the corpus with a view to drawing basic conclusions on linguistic change and particularly neologisms across the decades of the 20th century.

Greek Corpus 20 will be designed to be freely accessible to researchers in the field in order to promote research on recent grammatical and lexical change in the Greek language through the analysis of authentic texts.

Structure of the project:

The basic principles underlying project organization are recursive continuous evaluation and international co-operation. For this reason, a pilot corpus is first to be completed before the full corpus is compiled. In addition, two individual work packages are devoted to evaluation of the pilot and the final corpus and a network of researchers working in bothGreeceand abroad on similar issues will be set up through the organization of focused workshops with a specific agenda.

            In particular, the project work packages are the following:

WP1: Design of the project, including review of the literature on compiling a diachronic corpus.

WP2: Organization of an international workshop on diachronic corpora, involving researchers who have already compiled such corpora in their respective languages with the purpose of discussing principles and best practices.

WP3: Organization of a workshop on Greek language resources.

WP4: Design and compilation of the pilot corpus.

WP5: Evaluation of the pilot corpus.

WP6: Design and compilation of Greek Corpus 20.

WP7: Evaluation of Greek Corpus 20.

WP8: Analysis of the main lexical and grammatical aspects of Greek on the basis of the diachronic corpus.

WP9: Design of the website for the diachronic corpus and the basic search tools for on-line search.

WP10: Dissemination of results, including the upload of the corpus on the website.

            Project deliverables will include the organization of two workshops with respective technical reports, the Greek Corpus 20 to be accessed through a dedicated site, research articles and notes on recent change in Greek by the researche team, as well as on issues of language data resources and evaluation.



Aarts, B., Close, J., Leech, G. & Wallis, S. (eds) (2013). The Verb Phrase in English: Investigating Recent Language Change with Corpora. Cambridge: Cambridge University Press.

Baayen, H. R. & Renouf, A. (1996). Chronicling the Times: Productive lexical innovations in an English newspaper. Language 72 (1), 69-96.

Baker, P. (2010). Sociolinguistics and Corpus Linguistics.Edinburgh:EdinburghUniversity Press.

Baker, P., Hardie, A. & McEnery, T. (2006). A Glossary of Corpus Linguistics.Edinburgh:EdinburghUniversity Press.

Beal, J., Corrigan, K. & Moisl, H. (eds) (2007). Creating and Digitizing Language Corpora. Volume 2: Diachronic Databases. Basingstoke: Palgrave Macmillan.

Davies, M. (2012). Examining recent changes in English: Some methodological issues. In T. Nevalainen & E. Closs Traugott (eds) The Oxford Handbook of the History of English. Oxford: Oxford University Press, 263-287.

Davies, M. (2011). Synchronic and diachronic uses of corpora. In V. Viana, S. Zyngier & G. Barnbrook (eds) Perspectives on Corpus Linguistics. Amsterdam/Philadelphia: John Benjamins, 63-80.

Fischer, R. (1998). Lexical Change in Present-Day English. A Corpus Study of the Motivation, Institutionalization, and Productivity of Creative Neologisms. Tübingen: Gunter Narr.

Goutsos, D. (2010). The Corpus of Greek Texts: a reference corpus for Modern Greek. Corpora 5 (1), 29-44.

Hatzigeorgiu, N., Spiliotopoulou, S., Vakalopoulou, A., Papakostopoulou, A., Piperidis, S., Gavriilidou, M. & Carayannis, G. (2001). National Thesaurus of Greek Texts: a corpus of Modern Greek on the internet. Studies in Greek Linguistics 21, 812-821. [In Greek].

Hunston, S. (2002). Corpora in Applied Linguistics.Cambridge:CambridgeUniversity Press.

Mair, C. (2009). Corpora and the study of recent change in language. In A. Lüdeling & M. Kytö (eds) Corpus Linguistics. An International Handbook. Volume 2.Berlin/New York: Walter de Gruyter, 1109-1125.

Meyer, C. F. (2002). English Corpus Linguistics: An Introduction. Cambridge: Cambridge University Press.

Onelli, C. Proietti, D. Seidenari C. & F. Tamburini (2006). The DiaCORIS project: A diachronic corpus of written Italian        Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa. Available at: http://hnk.ffzg.hr/bibl/lrec2006/pdf/611_pdf.pdf.

Partington, A. (2010). Modern Diachronic Corpus-Assisted Discourse Studies (MD-CADS) on UK newspapers: An overview of the project. Corpora 5 (2), 83-108.