EvalES

The Spanish Evaluation Benchmark

Datasets

What is EvalEs?

The EvalES benchmark consists of 7 tasks: Named Entity Recognition and Classification (CoNLL-NERC), Part-of-Speech Tagging (UD-POS), Text Classification (MLDoc), Paraphrase Identification (PAWS-X), Semantic Textual Similarity (STS), Question Answering (SQAC), and Textual Entailment (XNLI).

CoNLL-NERC

CoNLL-NERC is the Spanish dataset of the CoNLL-2002 Shared Task (Tjong Kim Sang, 2002). The dataset is annotated with four types of named entities --persons, locations, organizations, and other miscellaneous entities-- formatted in the standard Beginning-Inside-Outside (BIO) format. The corpus consists of 8,324 train sentences with 19,400 named entities, 1,916 development sentences with 4,568 named entities, and 1,518 test sentences with 3,644 named entities.

UD-POS

UD-POS comes from the AnCora corpus, projected on the Universal Dependencies treebank. The original annotation was done in a constituency framework as a part of the AnCora project at the University of Barcelona (Taulé, Martí, and Recasens, 2008). It was converted to dependencies by the Universal Dependencies team and used in the CoNLL 2009 shared task. The CoNLL 2009 version was later converted to HamleDT and to Universal Dependencies.

MLDoc

For document classification, we use the Multilingual Document Classification Corpus (MLDoc) (Schwenk and Li, 2018; Lewis et al., 2004), a cross-lingual document classification dataset covering 8 languages. We use the Spanish portion to evaluate our models on monolingual classification. The corpus consists of 14,458 news articles from Reuters classified in four categories: Corporate/Industrial, Economics, Government/Social and Markets.

PAWS-X

For paraphrase identification, we use the Cross-lingual Adversarial Dataset for Paraphrase Identification (PAWS-X) (Yang et al., 2019), a multilingual dataset that contains 49,401 training sentences, 2,000 sentences for the development set, and 2,000 sentences for the test set. It is important to note that this dataset contains machine translated text, and as a consequence some of the Spanish sentences might not be entirely correct.

STS

For Semantic Text Similarity, we collected the Spanish test sets from SemEval-2014 (Agirre et al., 2014) and SemEval-2015 (Agirre et al., 2015). Since no training data was provided for the Spanish subtask, we randomly sampled both datasets into 1,321 sentences for the train set, 78 sentences for the development set, and 156 sentences for the test set. To make the task harder for the models, we purposely made the development set smaller than the test set.

SQAC

For Question Answering, we built a new extractive QA dataset in Spanish created from texts extracted from the Spanish Wikipedia, encyclopedic articles, newswire articles from Wikinews, and the Spanish section of the AnCora corpus. The corpus consists of 18,817 questions and the annotation of their answer spans from 6,247 textual contexts.

XNLI

For Textual Entailment, we use the Spanish part of the Cross-Lingual NLI Corpus (XNLI) (Conneau et al., 2018). This evaluation corpus consists of a collection 400,202 sentences, annotated with textual entailment via crowdsourcing.

Massive

MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.