The corpus was developed as a linguistic resource for Automatic Summarization research and his relation with different issues to engage studies on the discourse treatment.
Summ-it consists of fifty texts from Science domain extracted from Science section of Brazilian daily newspaper Folha de São Paulo (FSP), compose by:
I. Human summaries produced by experts in summarization (Coelho, 2007), rewriting the original texts in a compressed format.
II. Automatic summaries, obtained by GistSumm (Pardo et al., 2002, and Pardo et al., 2003) and SuPor-2 (Leite and Rino, 2006a, Leite and Rino, 2006c, and Leite and Rino, 2006b). All summaries were generated with a 70% compression rate, which means that the summaries correspond to roughly 30% of the original texts.
III. Manual underline sentences which contain relevant informations from the original texts (see 3.2).
IV. Texts semi-automatically annotated with morpho-syntactic informations, assisted by the syntactic parser PALAVRAS (available at: http://visl.sdu.dk/visl/pt/) and Xtractor converter (available at: http://abc.di.uevora.pt/xtractor/).
V. Texts semi-automatically annotated with co-reference informations (MMAX) and with rhetorical relations (RST) (cf. Carbonel et al., 2007, Fuchs, 2008, and Collovini et al., 2007) of noun phrases. The first process intents the identification of the entities in the discourse (e.g. noun phrases) referred or recovered in the text and, the second one, permits to structure a text by relating their discursive units through RST relations.
- Xtractor converter