NLP(VII)：使用sklearn进行文本情感分类（下）

这一节我们使用gensim来进行单词的向量化。

使用spacy进行tokenize

import spacy


all_texts = np.array(twitter_train_df['text']).tolist() + np.array(twitter_test_df['text']).tolist()
all_tokenized_texts = []
token_freq_dict = {}
nlp = spacy.load("en_core_web_sm")

for twitt in all_texts:
  
  doc = nlp(twitt)
  token_twitt = []
  for token in doc:
    token = token.text.lower()
    token_twitt.append(token)
    if token in token_freq_dict:
      token_freq_dict[token] += 1
    else:
      token_freq_dict[token] = 1
  all_tokenized_texts.append(token_twitt)

使用gensim将token向量化

gensim包的用法可以参考官方网站：
https://radimrehurek.com/gensim/models/word2vec.html

from gensim.models import Word2Vec


model = Word2Vec(all_tokenized_texts, size=300)

每一条推文的向量表示可以通过其所有token的向量取平均来计算：

all_vec_tweets = []
for tweet in all_tokenized_texts:
  tw_vecs = []
  for token in tweet:
    if token_freq_dict[token]>=5:
      tw_vecs.append(model.wv[token].tolist())
  if len(tw_vecs)==0:
    all_vec_tweets.append(np.zeros(300).tolist())
  else:
    all_vec_tweets.append(np.mean(np.array(tw_vecs), 0).tolist())

使用sklearn训练模型

这里就和上一节一样了。

from sklearn.linear_model import LogisticRegression

train_X = np.array(all_vec_tweets[:len(twitter_train_df)])
train_y = twitter_train_df['sentiment']


test_X = all_vec_tweets[len(twitter_train_df):]
test_y = twitter_test_df['sentiment']
clf = LogisticRegression(random_state=0).fit(train_X, train_y)
print("The accuracy of the trained classifier is "+str(clf.score(test_X, test_y)*100)+"%")