Documentation

Carolina is fully available for free download. The current version is offered without support programs, which are planned for future releases. By downloading the corpus, you agree to the Terms of use.

Terms of use

Carolina is composed of texts gathered in several digital repositories, whose licenses are multiple and, therefore, must be strictly observed when using the corpus. The specific licenses for each document included in the Corpus are detailed in its metadata. There are broad public domain licenses to partial sharing licenses with restrictions on commercial use. No documents without a sharing license were included in the Corpus.

The Corpus header is under the license Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), detailed at https://creativecommons.org/licenses/by-nc-sa/4.0 .

Acknowledgments

"O Carolina foi construído por uma equipe de linguistas e cientistas da computação, membros do Laboratório Virtual de Humanidades Digitais – LaViHD e do Centro de Inteligência Artificial da Universidade de São Paulo – C4AI. "

How to cite the current version of Carolina:

Crespo, Maria Clara Ramos Morales; Rocha, Maria Lina de Souza Jeannine; Sturzeneker, Mariana Lourenço; Serras, Felipe Ribas; Mello, Guilherme Lamartine de; Costa, Aline Silva; Palma, Mayara Feliciano; Mesquita, Renata Morais; Guets, Raquel de Paula; Silva, Mariana Marques da; Finger, Marcelo; Paixão de Sousa, Maria Clara; Namiuti, Cristiane; Monte, Vanessa Martins do. 2023. Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information. arXiv preprint arXiv:2303.16098. Disponível em: https://arxiv.org/abs/2303.16098.

Source

All documents that are part of the Corpus are annotated with detailed headers, which include complete information on origin, authorship and sharing licenses.

Structure (tags and schema)

The data structure in Carolina follows the guidelines of protocol TEI (Text Encoding Initiative), which defines a specific XML scheme. Specifically for Carolina, tags were developed to comply with the WaC-wiPT methodology.

Latest version 1.2 Ada

Latest version: 1.2 Ada
Date of release: March 8, 2022
Size: ~ 11GB
Download - HuggingFace | 08.03.2023

Access the interactive search, provenance and schema of version 1.2 Ada to see metadata, text origin and data structure respectively:

Search 1.2 Ada Source 1.2 Ada (Tags and Schema) 1.2 Ada

Previous versions

1.1 Ada - 08/04/2022
1.0 Ada
>