Bert feature embedding of tweets

Qian 未分类 2023年4月6日 1 Minute

Pipeline

Before feature engineering, use clean_tweet function to takes a raw tweet as input and removes URLs, user mentions, special characters, and extra spaces. The output is a cleaned tweet with only the relevant textual content.
Then, use tokenizing function of pre-trained models to normalize the tweets.
Then, get embeddings of each tweet using pre-trained models.

Bert Model Choices: Bert, Bertweet, XLNet

load the pre-trained BERTweet model tokenizer and model of bertweet-base
Read clean_text column and use tokenizer to get tokens of each cleaned tweet
Convert tokens to a tensor and generate bert_tweet embeddings
For each tweet, use average of all 768-dimension embedding of tokens to get a tweet embedding: 768-dimension
Store the first 100 tweets’ results to 100_tweets_with_bertweet_embeddings.csv

Published by Qian

I am working for the happiness of mankind and looking for a life-long partner View all posts by Qian

Published 2023年4月6日

Leave a comment Cancel reply

Design a site like this with WordPress.com