The LX-SimLex-999 was created from SimLex-999 (Hill et al., 2015) which, in turn, was based in the University of South Florida Free Association Database (USF) (Nelson et al., 2014).
There were strict guidelines to create SimLex-999. Both words in each pair have the same morphosyntactic category and the multiword expressions and named entities were excluded from that data set. Besides the morphosyntactic category criteria, the level of concreteness of each word was important. The word pairs in the USF data set had been tagged with a concreteness level that was provided by human annotators, on a scale of 1-7. In the creation of SimLex-999, this classification was taken into account and the pairs in which one of the concepts was more concrete than the other were not included.
The result was 999 word pairs organized in the following way: 666 pairs of noun-noun, 222 pairs of verb-verb and 111 pairs of adjective-adjective. Each pair received a score on a scale from 0 (totally unrelated) to 6 (very similar).