Bert feature embedding of tweets

Pipeline

  1. Before feature engineering, use clean_tweet function to takes a raw tweet as input and removes URLs, user mentions, special characters, and extra spaces. The output is a cleaned tweet with only the relevant textual content.
  2. Then, use tokenizing function of pre-trained models to normalize the tweets.
  3. Then, get embeddings of each tweet using pre-trained models.

Bert Model Choices: Bert, Bertweet, XLNet

  1. load the pre-trained BERTweet model tokenizer and model of bertweet-base
  2. Read clean_text column and use tokenizer to get tokens of each cleaned tweet
  3. Convert tokens to a tensor and generate bert_tweet embeddings
  4. For each tweet, use average of all 768-dimension embedding of tokens to get a tweet embedding: 768-dimension
  5. Store the first 100 tweets’ results to 100_tweets_with_bertweet_embeddings.csv

Leave a comment