Massively-Multi-Class NLP Classification Dataset

Each file title without the '.json' extension corresponds to a class ('~' replaces '/'). All files in the following format:

    {'data': 
        {'0':
            {'0': list of sentences (as tab-separated string) in first paragraph of first section,
             '1': list of sentences (as tab-separated string) in second paragraph of first section,
              .
              .
            },
         '1':
            {'0': list of sentences (as tab-separated string) in first paragraph of second section,
             '1': list of sentences (as tab-separated string) in second paragraph of second section,
              .
              .
            },
          .
          .
        },
     'sections':
        {'0': title of first section (as string),
         '1': title of second section (as string),
          .
          .
        }
    }

Dataset contains five subsets:

    * full: 18199 Wikipedia pages containing at least 8 sections with at least 12 sentences each and average sentence length at least 5; further filtering done on titles

    * uni: 3084 pages in full with single-word titles

    * wn: 3253 pages in full whose title corresponds to a WordNet synset

    * mini: 1748 pages in the intersection of uni and wn

    * tiny: 967 pages in mini with titles that are at most 12 characters long and correspond to at least one WordNet synset that is not a proper noun