Contextual Rare Words (CRW)
Mikhail Khodak*, Nikunj Saunshi*, Yingyu Liang, Tengyu Ma, Brandon Stewart, Sanjeev Arora

#### Run the following commands to evaluate the baselines
# Extract context/
tar -xvf context.tar
# Extract vectors.txt
bzip2 -dk vectors.txt.bz2
# Extract transform.txt
bzip2 -dk transform.txt.bz2
# Evaluate baselines
python3 crw.py

#### Extract the subcorpus
bzip2 -dk WWC.txt.bz2

#### Directory Listing After Extraction:
  - CRW-562.txt: Contains 562 data points. Each line contains a pair of words and the corresponding human rated similarity score. The second word in each pair is the rare word
  - rarevocab.txt: Each line contains a rare word and its count in the original Westbury Wikipedia Corpus (WWC) [1]
  - WWC.txt: A subcorpus of the original WWC that does not contain any rare word
  - WWC_vocab.txt: Words and their counts for all words occurring at least 100 times in WWC subcorpus
  - context/
      - <rareword>.txt: Contains 255 sentences (contexts) for <rareword>
  - vectors.txt: word2vec embeddings trained on the WWC subcorpus for all words in WWC_vocab.txt
  - transform.txt: Linear induction transform learned on the WWC subcorpus using embeddings in vectors.txt
  - crw.py: Scripts evaluating the baselines

#### Dataset Construction
CRW is constructed by using all pairs from the Rare Words (RW) dataset [2] where the rarer word occurs between 512 and 10000 times in the original WWC. This yields 562 distinct pairs and a set of 455 distinct rare words. The rare words and their counts in the original WWC are stored in rarevocab.txt.
The WWC subcorpus contains sentences from the original WWC which do not contain any rare word. For every rare word we sample 255 contexts (sentences) which contain that rare word and no other rare words. These contexts are stored them in a file (one for each rare word) in the context directory.

#### Citation
If you use this dataset, please cite:
@InProceedings{khodak-EtAl:2018:Long,
  author    = {Khodak, Mikhail and Saunshi, Nikunj and Liang, Yingyu and Ma, Tengyu and Stewart, Brandon and Arora, Sanjeev},
  title     = {A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics},
  url       = {}
}

#### References 
[1] Cyrus Shaoul and Chris Westbury. 2010. The westbury lab wikipedia corpus.
[2] Thang Luong, Richard Socher and Christopher Manning. 2013. Better word representations with recursive neural networks for morphology. In Proc. CoNLL.