Unsupervised learning text representations aims at converting natural languages into vector representations. These vector representations are used in bigger models such as neural networks to improve the performances of supervised tasks. In this line of work, we have Word2Vec, Skip-thought, ELMo, BERT, and other improved BERT models such as RoBERTa and ALBERT. To evaluate the effectiveness of these unsupervised learned text representations, people create suites of natural language processing tasks, including SentEval and GLUE. These tasks aims to evaluate the capabilities of these text representations at improving a variety of NLP tasks, including text classification, semantic relatedness and similarity, question answering, sequence labeling, etc. This thesis discuss our work on both sides. We develop methods to train better language representations and also develop better NLP task suites to evaluate these representations. Most of our pretrained unsupervised models use free text resources available online as training data. We use text and their categories to improve text classification tasks. We use Wikipedia category hierarchies to improve natural language inference tasks. We use Wikipedia document structures to learn sentence representations with discourse information. We also use the hyperlink structures from Wikipedia to learn entity representations. Along with these work we also propose a variety of test suites with standardized tasks to evaluate text representations in these aspects.




Downloads Statistics

Download Full History