Newspaper Data Collection

THE IIT NEWSPAPER DATA

The IIT Newspaper Corpus set is a collection of roughly 75,200 newspaper articles, crawled under permission from the publishers.

Articles are associated with the editors' own classification and its mapping to the top-level domains of the IPTC taxonomy.

Newspaper	Site	Crawled Articles	License
Anagnostis	http://www.anagnostis.org	2723	Permission
Avgi	http://www.avgi.gr	38255	Creative Commons
Lefkaditika Nea	http://www.kolivas.de	5305	Permission
Machitis	maxitisthrakis.blogspot.com	11893	Permission
Methorios	http://www.methorios.gr	1961	Permission
Samiakos Tupos	http://www.samiakostypos.gr	2551	Permission
Tharros Vioton	http://www.tharrosvioton.gr	1179	Permission
Thraki News	http://www.e-thrakinews.gr	11365	Permission
To Vima tis Aigialias	Not available yet	Not available yet	Permission
Foni tou Nestou	Not available yet	Not available yet	Permission

CONTRIBUTORS

Konstantinos Pechlivanis (kpechlivanis21@gmail.com) in the context of his MSc thesis at Technical University of Crete and NCSR "Demokritos", on Corpus Based Methods for Learning Models of Metaphor in Modern Greek.

Eirini Florou (eirini.florou@gmail.com) in the context of her PhD at University of Athens, on Metaphor Detection in Greek.

Both under the supervision of Stasinos Konstantopoulos.

AVAILABLE DATA

There are two kinds of data available for downloading. First option contain the whole corpus that is available either in text type, or in html type. Text type files are parsed including a header with important information such as "Newspaper Name", "Domain", "Date" and "Title" of article. Moreover, html files are the initial files as downloaded from the web.

Second option contains the annotated data that is part of the implementation of Thesis in Metaphor Detection. The texts have been selected from the whole corpus and are annotated from greek native speakers with expertise in linguistics. The annotated files contain not only the phrases that have non literal meaning at a certain text but more annotations such as, the domain of the document, the domain of the metaphor phrase, the type of metaphor and delexical phrases. Annotation procedure was applied using Ellogon. Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment, developed in order to aid both researchers who are doing research in computational linguistics, as well as companies who produce and deliver language engineering systems. The annotation guidelines can be downloaded here.

PoS Tags

In the context of Metaphor Detection system we created an annotated Part of Speech (PoS) corpus from the newspapers Avgi, Lefkaditika Nea and Thraki News. This corpus contains the Part of Speech for each word according to the Ellogon Part of Speech tagger. PoS corpus is available here.

Vocabulary

In the context of Metaphor Detection system we estimated a vocabulary from the newspapers Avgi, Lefkaditika Nea and Thraki News that contains 43,812 single words. Vocabulary is available here.

Crawler

The crawler is also available. Crawler consists of two parts. The first part contains a bash script that downloads the html files of the whole site for the available newspapers and separates the html files. The second part is responsible for extracting the pure text of files, such as plain text, title, etc from the html files. In order to obtain the Crawler and get more details, the whole implementation is available on Github

Greek Stemmer

In the context of Metaphor Detection system we implemented a Greek Stemmer. In order to obtain the Stemmer and get more details, the whole implementation is available on Bitbucket.

2016 - Institute of Informatics and Telecommunications | National Centre for Scientific Research "Demokritos"