Basic Infos
Student: ZHU Xingye (Joseph)
Supervisor: Prof. Francis C.M. Lau
Dissertation Abstract
Public perception analysis helps improve services and detect issues. This project conducts sentiment analysis and topic labelling task on Hong Kong MTR related tweets under Siemens application scene and compares algorithms adopted in each task. For sentiment analysis, we applies traditional deep neural network such as RNN, CNN on public massive sentiment dataset. For topic labelling, we crawled, labeled, augmented our own dataset and adopts latest transfer learning techniques like BERT, ULMFiT. For both tasks, we use FastText which is a light yet powerful and fast text classification algorithm as baseline.
In our experiments, RNN and ULMFiT achieved the best performance in sentiment analysis and topic labelling task respectively. Our experiments suggest that feature extraction determines model performance while most suitable feature extraction level depends on dataset(size, quality, etc) and label categories. Under extraction like CNN or over extraction like BERT might both lead to worse performance. Also, introducing transfer learning to NLP related tasks in public perception analysis is promising especially when labeled samples are limited.
Project Workflow
Demos
For demonstration models, we select RNN for Sentiment Analysis task and ULMFiT for Topic Labeling task respectively. They are the models with the best performance in our project on its own task.
Below we will demonstrate the classification for an example text “Why the escalator broke again? And the train has major delay!” to show that our system is workable.
The ground truths for example text are easy to tell.
- Sentiment: NEGATIVE
- Topic: TRAIN_SERVICE & FACILITIES
Sentiment Analysis Demo
Topic Labeling Demo
Video Demos
Here we will demonstrate the process of mode training and evaluation.
For Topic Labeling task, we wil demonstrate binary classifiers of category “TRAIN_SERVICE” only. Other categories are similar.
Topic Labeling - ULMFiT - TRAIN_SERVICE
Topic Labeling - BERT - PEOPLE
Topic Labeling - FastText - FACILITIES
Experiment Results
Sentiment Analysis
We trained RNN & CNN & FastText sentiment classifier using Tensorflow on Sentiment 140 dataset. Below is comparison results
Topic Labeling
We trained binary classifier for the following categories,
- TRAIN_SERVICE
- FACILITIES
- PEOPLE
using the following algorithms
- ULMFiT
- BERT
- FastText
on our own crawled dataset. (each category has around 450 pieces of tweets)
Below is comparison results
Conclusions
- Feature extraction determines the performance of models.
- Deeper extractor helps get more abstract & higher quality feature.
- For single layer extractor ’s ability, Transformer > RNN > FastText > CNN
- The level of feature extraction required might depends on the size of the dataset and the level of features required by the categories
- The lower the level of features required by the categories, the less powerful feature extractor we need.
- Too much feature extraction might leads to the decline of performance
- Further modification of structures might be necessary to reach the best performance in our downstream task when using transfer learning techniques
- FastText could serve as a good baseline and works on small dataset as well