The test set described in was used as the basis for the assessment of word embeddings. An example entry in this data set would read: ‘Berlin Germany Lisbon Portugal’. With these four words relations – as in this example – one can test semantic analogies by using any of the possible combinations of three of the four word vectors in one entry and testing whether or not the resulting vector is similar to the (fourth) word vector missing from the combination being tested. In the example above, the completed analogy should read: ‘Berlin is to Germany as Lisbon is to Portugal’.
The test set contains five types of semantic analogy: common capitals and countries, all capitals and countries, currency, cities and states, and family relations. Nine types of syntactic analogy are also represented: adjective to adverb, opposite, comparative, superlative, present participle, nationality (adjective), past tense, plural nouns and plural verbs. The test set contains a total of 8869 semantic and 10675 syntactic entries.
For the evaluation of the Portuguese word embeddings, the original English test set was translated into Portuguese by skilled, native Portuguese-speaking
language experts. The resulting translations, LX-4WAnalogies, and corresponding English terms are available at http://github.com/nlx-group.