Documentation
Carolina is fully available for free download. The current version is offered without support programs, which are planned for future releases. By downloading the corpus, you agree to the Terms of use.
Terms of use
Carolina is composed of texts gathered in several digital repositories, whose licenses are multiple and, therefore, must be strictly observed when using the corpus. The specific licenses for each document included in the Corpus are detailed in its metadata. There are broad public domain licenses to partial sharing licenses with restrictions on commercial use. No documents without a sharing license were included in the Corpus.
The Corpus header is under the license Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), detailed at https://creativecommons.org/licenses/by-nc-sa/4.0 .
Acknowledgments
"O Carolina foi construído por uma equipe de linguistas e cientistas da computação, membros do Laboratório Virtual de Humanidades Digitais – LaViHD e do Centro de Inteligência Artificial da Universidade de São Paulo – C4AI. "
How to cite the current version of Carolina:
Crespo, Maria Clara Ramos Morales; Rocha, Maria Lina de Souza Jeannine; Sturzeneker, Mariana Lourenço; Serras, Felipe Ribas; Mello, Guilherme Lamartine de; Costa, Aline Silva; Palma, Mayara Feliciano; Mesquita, Renata Morais; Guets, Raquel de Paula; Silva, Mariana Marques da; Finger, Marcelo; Paixão de Sousa, Maria Clara; Namiuti, Cristiane; Monte, Vanessa Martins do. 2023. Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information. arXiv preprint arXiv:2303.16098. Disponível em: https://arxiv.org/abs/2303.16098.
Source
All documents that are part of the Corpus are annotated with detailed headers, which include complete information on origin, authorship and sharing licenses.
Structure (tags and schema)
The data structure in Carolina follows the guidelines of protocol TEI (Text Encoding Initiative), which defines a specific XML scheme. Specifically for Carolina, tags were developed to comply with the WaC-wiPT methodology.
Latest version 1.2 Ada
Latest version: 1.2 Ada Date of release: March 8, 2022 Size: ~ 11GB Download - HuggingFace | 08.03.2023
Access the interactive search, provenance and schema of version 1.2 Ada to see metadata, text origin and data structure respectively:
Search 1.2 Ada Source 1.2 Ada (Tags and Schema) 1.2 Ada