Newspaper Data Collection




THE IIT NEWSPAPER DATA

The IIT Newspaper Corpus set is a collection of roughly 75,200 newspaper articles, crawled under permission from the publishers.

Articles are associated with the editors' own classification and its mapping to the top-level domains of the IPTC taxonomy.


Newspaper Site Crawled Articles License
Anagnostis http://www.anagnostis.org 2723 Permission
Avgi http://www.avgi.gr 38255 Creative Commons
Lefkaditika Nea http://www.kolivas.de 5305 Permission
Machitis maxitisthrakis.blogspot.com 11893 Permission
Methorios http://www.methorios.gr 1961 Permission
Samiakos Tupos http://www.samiakostypos.gr 2551 Permission
Tharros Vioton http://www.tharrosvioton.gr 1179 Permission
Thraki News http://www.e-thrakinews.gr 11365 Permission
To Vima tis Aigialias Not available yet Not available yet Permission
Foni tou Nestou Not available yet Not available yet Permission



CONTRIBUTORS

Konstantinos Pechlivanis (kpechlivanis21@gmail.com) in the context of his MSc thesis at Technical University of Crete and NCSR "Demokritos", on Corpus Based Methods for Learning Models of Metaphor in Modern Greek.

Eirini Florou (eirini.florou@gmail.com) in the context of her PhD at University of Athens, on Metaphor Detection in Greek.

Both under the supervision of Stasinos Konstantopoulos.



AVAILABLE DATA

There are two kinds of data available for downloading. First option contain the whole corpus that is available either in text type, or in html type. Text type files are parsed including a header with important information such as "Newspaper Name", "Domain", "Date" and "Title" of article. Moreover, html files are the initial files as downloaded from the web.

Second option contains the annotated data that is part of the implementation of Thesis in Metaphor Detection. The texts have been selected from the whole corpus and are annotated from greek native speakers with expertise in linguistics. The annotated files contain not only the phrases that have non literal meaning at a certain text but more annotations such as, the domain of the document, the domain of the metaphor phrase, the type of metaphor and delexical phrases. Annotation procedure was applied using Ellogon. Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment, developed in order to aid both researchers who are doing research in computational linguistics, as well as companies who produce and deliver language engineering systems. The annotation guidelines can be downloaded here.


Corpus:





Annotations:




PoS Tags

In the context of Metaphor Detection system we created an annotated Part of Speech (PoS) corpus from the newspapers Avgi, Lefkaditika Nea and Thraki News. This corpus contains the Part of Speech for each word according to the Ellogon Part of Speech tagger. PoS corpus is available here.




Vocabulary

In the context of Metaphor Detection system we estimated a vocabulary from the newspapers Avgi, Lefkaditika Nea and Thraki News that contains 43,812 single words. Vocabulary is available here.




Crawler

The crawler is also available. Crawler consists of two parts. The first part contains a bash script that downloads the html files of the whole site for the available newspapers and separates the html files. The second part is responsible for extracting the pure text of files, such as plain text, title, etc from the html files. In order to obtain the Crawler and get more details, the whole implementation is available on Github

.



Greek Stemmer

In the context of Metaphor Detection system we implemented a Greek Stemmer. In order to obtain the Stemmer and get more details, the whole implementation is available on Bitbucket.



2016 - Institute of Informatics and Telecommunications | National Centre for Scientific Research "Demokritos"