CORPORA

 

CONTRAST-IT and COMPARE-IT: two comparable corpora

 

CONTRAST-IT and COMPARE-IT belong to the typology of comparable corpora. They both include collections of similar texts (see below: "General features of the corpora").

CONTRAST-IT and COMPARE-IT are a multilingual and monolingual corpus, respectively. They have been created during two research projects funded by the Swiss National Science Foundation (ICOCP and ISAaC) with the aim of investigating Italian in a contrastive and comparative perspective. The publication output of these projects can be found by clicking here: CONTRAST-IT. Specific references. 

The CONTRAST-IT and COMPARE-IT corpora include the following language components:

  • CONTRAST-IT corpus: Italian (from Italy), French (from France), Spanish (from Spain), English (from the UK), German (from Germany)
  • COMPARE-IT corpus: Italian from Italy; Italian from Switzerland; Italian from Canada

The CONTRAST-IT and COMPARE-IT corpora allow working on a wide array of language pairs or groups, as well as on a single language. The two corpora can be used for instance to investigate the following language combinations:

  • Italian vs French and/or Spanish
  • Italian vs German
  • Italian from Italy vs Italian from Switzerland and/or Canada
  • Italian-French-Spanish vs English-German
  • English vs German
  • French vs German vs English

 

General features of the corpora

 

The comparable CONTRAST-IT and COMPARE-IT corpora are based on:

- comparable text collections

  • the text collections are comparable in terms of genre: they include articles published electronically in online daily newspapers
  • the text collections are comparable in terms of date: they have been published between 2011 and 2012
  • the text collections are to some extent also comparable in terms of content; they include at least the following thematic sections of the newspapers: politics, economy and sports

- original text collections

  • the text collections do not include translated texts

- full-length text collections

  • the text collections do not include text samples
  • the text collections include newspaper articles, their title, author(s), and date of publication

- commonly occurring text collections

  • the text collected are produced by the mass media industry and represent one of the most commonly occurring forms of written texts

- small to medium size text collections

  • the CONTRAST-IT corpus is based on a text collection of ca. 1.5 million words
  • the COMPARE-IT corpus is based on a text collection of ca. 500,000 words

CONTRAST-IT and COMPARE-IT are high quality corpora. All the texts have been manually checked to ensure that all the components are present and that they belong to the correct news section (politics, economy, sports etc.).

More information on the design of the CONTRAST-IT and COMPARE-IT corpora are available on the pages devoted to each corpus.