2016 - Institute of Informatics and Telecommunications | National Centre for Scientific Research "Demokritos"
The IIT Newspaper Corpus set is a collection of roughly 75,200 newspaper articles, crawled under permission from the publishers.
Articles are associated with the editors' own classification and its mapping to the top-level domains of the IPTC taxonomy.
Newspaper | Site | Crawled Articles | License |
---|---|---|---|
Anagnostis | http://www.anagnostis.org | 2723 | Permission |
Avgi | http://www.avgi.gr | 38255 | Creative Commons |
Lefkaditika Nea | http://www.kolivas.de | 5305 | Permission |
Machitis | maxitisthrakis.blogspot.com | 11893 | Permission |
Methorios | http://www.methorios.gr | 1961 | Permission |
Samiakos Tupos | http://www.samiakostypos.gr | 2551 | Permission |
Tharros Vioton | http://www.tharrosvioton.gr | 1179 | Permission |
Thraki News | http://www.e-thrakinews.gr | 11365 | Permission |
To Vima tis Aigialias | Not available yet | Not available yet | Permission |
Foni tou Nestou | Not available yet | Not available yet | Permission |
Konstantinos Pechlivanis (kpechlivanis21@gmail.com) in the context of his MSc thesis at Technical University of Crete and NCSR "Demokritos", on Corpus Based Methods for Learning Models of Metaphor in Modern Greek.
Eirini Florou (eirini.florou@gmail.com) in the context of her PhD at University of Athens, on Metaphor Detection in Greek.
Both under the supervision of Stasinos Konstantopoulos.
There are two kinds of data available for downloading. First option contain the whole corpus that is available either in text type, or in html type. Text type files are parsed including a header with important information such as "Newspaper Name", "Domain", "Date" and "Title" of article. Moreover, html files are the initial files as downloaded from the web.
Second option contains the annotated data that is part of the implementation of Thesis in Metaphor Detection. The texts have been selected from the whole corpus and are annotated from greek native speakers with expertise in linguistics. The annotated files contain not only the phrases that have non literal meaning at a certain text but more annotations such as, the domain of the document, the domain of the metaphor phrase, the type of metaphor and delexical phrases. Annotation procedure was applied using Ellogon. Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment, developed in order to aid both researchers who are doing research in computational linguistics, as well as companies who produce and deliver language engineering systems. The annotation guidelines can be downloaded here.
In the context of Metaphor Detection system we created an annotated Part of Speech (PoS) corpus from the newspapers Avgi, Lefkaditika Nea and Thraki News. This corpus contains the Part of Speech for each word according to the Ellogon Part of Speech tagger. PoS corpus is available here.
In the context of Metaphor Detection system we estimated a vocabulary from the newspapers Avgi, Lefkaditika Nea and Thraki News that contains 43,812 single words. Vocabulary is available here.
The crawler is also available. Crawler consists of two parts. The first part contains a bash script that downloads the html files of the whole site for the available newspapers and separates the html files. The second part is responsible for extracting the pure text of files, such as plain text, title, etc from the html files. In order to obtain the Crawler and get more details, the whole implementation is available on Github
.In the context of Metaphor Detection system we implemented a Greek Stemmer. In order to obtain the Stemmer and get more details, the whole implementation is available on Bitbucket.
2016 - Institute of Informatics and Telecommunications | National Centre for Scientific Research "Demokritos"