Context

Carolina is developed by a multidisciplinary team of linguists and computer scientists, members of the Virtual Digital Humanities Laboratory - LaViHD and the Artificial Intelligence Center of the University of São Paulo - C4AI. C4AI-USP's mission is to produce advanced research in Artificial Intelligence in Brazil.

The Natural Language Processing Project - NLP2 is one of C4AI's challenges, and its general objective is to develop systems that advance the state of the art of Natural Language Processing for Brazilian Portuguese, reaching a new level in generation quality and performance in compared to what exists today.

C4AI-USP's NLP2 is currently building several corpora, including Carolina, CORAA, Corpus of Annotated Audios of Spoken Portuguese and Portinari , Corpus Annotated of Portuguese .

Carolina will be a corpus of contemporary Portuguese for wide use, including serving as a “mothership” in relation to the other corpora produced at C4AI-USP (encompassing the audio transcriptions from CORAA, the unlabeled raw texts from Portinari and other corpora futures). Read more about the NPL2 challenge on the Project page of C4AI.

Fundamentals

The Carolina Corpus is designed with an original methodology that we call WaC-wiPT : Web as Corpus with Provenance and Typology information. We consider provenance to be a crucial aspect to aspire to in web-based corpora, combined with typology and balance management. In addition to facilitating copyright compliance and typological labeling, it allows answering questions about the origin of texts and increases the scope of use of the corpus.

The initial work of the Carolina team sought to approach textual typology in a broad sense, free from a strict theoretical commitment, as a crucial methodological tool in the development of a collection of texts of such a significant size - allowing the organization of searches, selection and balancing texts.

The current version of the corpus, with around 830 million tokens and 2 million texts, includes material from the Brazilian judiciary and legislation, literary works in the public domain, journalistic texts, texts from social networks and wikis, and texts already published in other corpora.

v1 Ada (1.0, 1.1 or 1.2) does not yet represent a balanced universe in terms of the origin of the texts and their broad typology, as shown by the data in the graph on the side, relating to version 1.2. Note that v1 corresponds to only a part of the documents already prospected and still being processed by the team.