Knowledge Discovery from Big Text Data

Download files
Access & Terms of Use
open access
Copyright: Park, Eunkyung
Among many NLP tasks, textual data are ubiquitous, and their analytics have great potential in many applications. Firstly, I propose a solution to build accurate sentiment classifiers from imbalanced textual data. I first establish topic vectors to capture local and global patterns from a corpus. Then, I use the synthetic minority over-sampling technique to balance the data. However, I found that residue over-fitting is still prominent. To address this problem, I propose an autoencoded oversampling framework to reconstruct balanced datasets. My extensive experiments on different datasets with various imbalanced ratios and different numbers of classes have found that my approach is sound and effective. Second, the research direction of language model construction has been focusing on optimizing prediction accuracy. However, these models are insufficient to be used for decision-making without rationalization. Hence, explainable models are becoming essential. Due to the challenges of small-n-large-p, multi- collinearity and, uncommon words in the transcripts of video ads, conventional models can only identify a few or even nil significant ad words. I propose Explainability Maximized Lasso (EMLasso) as a solution to maximize the number of significant features while delivering excellent prediction accuracy. Also, I find that the number of significant features is mainly affected by the correlations among explanatory variables and model complexity. Thirdly, I extend my proposed EMLasso by adopting VIF (Variance Inflation Factor) iterations. In other words, EMLasso anchors the initial set of candidate words by maximizing the number of significant words. Following VIF iterations adjust the set of words by excluding false significant words and finding true ones more. EMLasso+VIF excludes about 67 false significant words and identifies about 15 true words additionally from outcomes of EMLasso. As a result, EMLasso+VIF shows about 15 times higher F1 accuracy than standard method OLS+VIF. I also find that EMLasso+VIF has about twice higher F1 accuracy than EMLasso, emphasizing the need to exclude highly correlated variables additionally. In summary, this thesis addresses imbalanced classes and small-n-large-p problems in the text domain, which are common, but critical difficulty in knowledge discovery in text data.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Park, Eunkyung
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
Resource Type
Degree Type
Masters Thesis
UNSW Faculty
download public version.pdf 5.13 MB Adobe Portable Document Format
Related dataset(s)