Our training data contains large-scale text collected from news, webpages, and novels. Text data from diverse domains enables the coverage of various types of words and phrases. Moreover, the recently collected webpages and news data enable us to learn the semantic representations of fresh words.
Vocabulary building. To enrich our vocabulary, we involve phrases in Wikipedia and Baidu Baike. We also apply the phrase discovery approach in Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches, which enhances the coverage of emerging phrases.