Wiki-3029 Dataset

```
tar -xzf Wiki-3029.tar.gz
```

This dataset is constructed from Wikipedia articles in the raw/wn/ subset. For each article, 200 sentences were randomly sampled from all sentences of length at least 20 within it (if the article had at least 200 such sentences). We end up with 3029 articles with 200 sentences in each.

The Wiki-3029 directory contains 3029 text files, where each file corresponds to an article (class) and each line in the file corresponds to a sentence (data point) from that article (class). The first 140 sentences are used as the train set, next 20 as the validation set and the final 40 as the test set.

If you use this dataset, please cite:
@InProceedings{CURL:19,
  title = {A Theoretical Analysis of Contrastive Unsupervised Representation Learning},
  author = {Arora, Sanjeev and Khandeparkar, Hrishikesh and Khodak, Mikhail and Plevrakis, Orestis and Saunshi, Nikunj},
  booktitle = {Proceedings of the 36th International Conference on Machine Learning},
  year = {2019}
}