Pipeline
- Before feature engineering, use clean_tweet function to takes a raw tweet as input and removes URLs, user mentions, special characters, and extra spaces. The output is a cleaned tweet with only the relevant textual content.
- Then, use tokenizing function of pre-trained models to normalize the tweets.
- Then, get embeddings of each tweet using pre-trained models.
Bert Model Choices: Bert, Bertweet, XLNet
- load the pre-trained BERTweet model tokenizer and model of bertweet-base
- Read clean_text column and use tokenizer to get tokens of each cleaned tweet
- Convert tokens to a tensor and generate bert_tweet embeddings
- For each tweet, use average of all 768-dimension embedding of tokens to get a tweet embedding: 768-dimension
- Store the first 100 tweets’ results to 100_tweets_with_bertweet_embeddings.csv