CORPORA

contrast_it: italiano in prospettiva contrastiva / Italian in a contrastive perspective

CONTRAST-IT and COMPARE-IT: two comparable corpora

CONTRAST-IT and COMPARE-IT belong to the typology of comparable corpora. They both include collections of similar texts (see below: "General features of the corpora").

CONTRAST-IT and COMPARE-IT are a multilingual and monolingual corpus, respectively. They have been created during two research projects funded by the Swiss National Science Foundation (ICOCP and ISAaC) with the aim of investigating Italian in a contrastive and comparative perspective. The publication output of these projects can be found by clicking here: CONTRAST-IT. Specific references.

The CONTRAST-IT and COMPARE-IT corpora include the following language components:

CONTRAST-IT corpus: Italian (from Italy), French (from France), Spanish (from Spain), English (from the UK), German (from Germany)
COMPARE-IT corpus: Italian from Italy; Italian from Switzerland; Italian from Canada

The CONTRAST-IT and COMPARE-IT corpora allow working on a wide array of language pairs or groups, as well as on a single language. The two corpora can be used for instance to investigate the following language combinations:

Italian vs French and/or Spanish
Italian vs German
Italian from Italy vs Italian from Switzerland and/or Canada
Italian-French-Spanish vs English-German
English vs German
French vs German vs English
…

General features of the corpora

The comparable CONTRAST-IT and COMPARE-IT corpora are based on:

- comparable text collections

the text collections are comparable in terms of genre: they include articles published electronically in online daily newspapers
the text collections are comparable in terms of date: they have been published between 2011 and 2012
the text collections are to some extent also comparable in terms of content; they include at least the following thematic sections of the newspapers: politics, economy and sports

- original text collections

the text collections do not include translated texts

- full-length text collections

the text collections do not include text samples
the text collections include newspaper articles, their title, author(s), and date of publication

- commonly occurring text collections

the text collected are produced by the mass media industry and represent one of the most commonly occurring forms of written texts

- small to medium size text collections

the CONTRAST-IT corpus is based on a text collection of ca. 1.5 million words
the COMPARE-IT corpus is based on a text collection of ca. 500,000 words

CONTRAST-IT and COMPARE-IT are high quality corpora. All the texts have been manually checked to ensure that all the components are present and that they belong to the correct news section (politics, economy, sports etc.).

More information on the design of the CONTRAST-IT and COMPARE-IT corpora are available on the pages devoted to each corpus.