Carolina Corpus

Carolina is a corpus with a robust volume of texts in contemporary Brazilian Portuguese (1970-2021), with information on origin and typology. The corpus has been available in open access, for free download, since March 8, 2022. The current version, Ada 1.2 (March 8, 2022), has 823 million tokens, more than two million texts and more than 11 GBs.

Fundamentals

The Carolina Corpus is designed with an original methodology that we call WaC-wiPT : Web as Corpus with Provenance and Typology information. We consider provenance to be a crucial aspect to aspire to in web-based corpora, combined with typology and balance management. In addition to facilitating copyright compliance and typological labeling, it allows answering questions about the origin of texts and increases the scope of use of the corpus.

Carolina Project

The Carolina project is part of the big project in the field of Natural Language Processing (NLP2) of ​​the Artificial Intelligence Center at the University of São Paulo (C4AI). It is developed by a multidisciplinary team of linguists and computer scientists, members of C4IA and the Virtual Laboratory of Digital Humanities (LaViHD).

"Carolina"

Carolina Michaelis in a photo from 1876

The Carolina Corpus was named in honor of Carolina Michaelis de Vasconcelos (1851-1925), a German philologist and linguist based in Portugal, author of A Saudade Portuguesa, and the first woman to work as a professor at the Faculty of Arts of the University of Lisbon, in 1911.

This honor symbolizes the desire that drives the Lavihd computational team: to move towards the cutting edge of knowledge, valuing the Portuguese language and its history, on the path of a science carried out by women.