NLP之BoW&NLTK:自然语言处理中常用的技术——词袋法Bow、NLTK库

NLP之BoW&NLTK:自然语言处理中常用的技术——词袋法Bow、NLTK库


输出结果

[[0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0]
 [1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1]]

BoW:输出句子中的每个单词(包括符号)—按照顺序: ['by', 'career', 'combined', 'congress', 'for', 'government', 'huawei', 'imposed', 'in', 'james', 'jordan', 'lebron', 'michael', 'passed', 'playoffs', 'points', 'regular', 'restrictions', 'sales', 'season', 'sues', 'the', 'today', 'unconstitutional', 'us']

NLTK:输出句子中的每个单词(包括符号): ['Today', ',', 'LeBron', 'James', 'passed', 'Michael', 'Jordan', 'in', 'career', 'points', 'for', 'regular', 'season', ',', 'playoffs', 'combined', '.']
NLTK:输出句子中的每个单词(包括符号): ['Today', ',', 'Huawei', 'Sues', 'the', 'US', 'Government', 'for', 'Unconstitutional', 'Sales', 'Restrictions', 'Imposed', 'by', 'Congress', '.']

NLTK:输出句子中的每个单词(包括符号)—按照顺序: [',', '.', 'James', 'Jordan', 'LeBron', 'Michael', 'Today', 'career', 'combined', 'for', 'in', 'passed', 'playoffs', 'points', 'regular', 'season']
NLTK:输出句子中的每个单词(包括符号)—按照顺序: [',', '.', 'Congress', 'Government', 'Huawei', 'Imposed', 'Restrictions', 'Sales', 'Sues', 'Today', 'US', 'Unconstitutional', 'by', 'for', 'the']

['today', ',', 'lebron', 'jame', 'pass', 'michael', 'jordan', 'in', 'career', 'point', 'for', 'regular', 'season', ',', 'playoff', 'combin', '.']
['today', ',', 'huawei', 'sue', 'the', 'US', 'govern', 'for', 'unconstitut', 'sale', 'restrict', 'impos', 'by', 'congress', '.']

NLTK:输出句子中的每个单词(包括符号)—及其对应词性: [('Today', 'NN'), (',', ','), ('LeBron', 'NNP'), ('James', 'NNP'), ('passed', 'VBD'), ('Michael', 'NNP'), ('Jordan', 'NNP'), ('in', 'IN'), ('career', 'NN'), ('points', 'NNS'), ('for', 'IN'), ('regular', 'JJ'), ('season', 'NN'), (',', ','), ('playoffs', 'NNS'), ('combined', 'VBD'), ('.', '.')]
NLTK:输出句子中的每个单词(包括符号)—及其对应词性: [('Today', 'NN'), (',', ','), ('Huawei', 'NNP'), ('Sues', 'NNP'), ('the', 'DT'), ('US', 'NNP'), ('Government', 'NNP'), ('for', 'IN'), ('Unconstitutional', 'NNP'), ('Sales', 'NNS'), ('Restrictions', 'NNS'), ('Imposed', 'VBN'), ('by', 'IN'), ('Congress', 'NNP'), ('.', '.')]

实现代码

测试的句子:来自今天的新闻
sent1 = 'Today, LeBron James passed Michael Jordan in career points for regular season, playoffs combined.'
sent2 = 'Today, Huawei Sues the US Government for Unconstitutional Sales Restrictions Imposed by Congress.'
sent1='今天,勒布朗·詹姆斯在常规赛和季后赛的总得分中超过了迈克尔·乔丹。

#1、使用词袋法( Bag-of-Words)对示例文本进行特征向量化

from sklearn.feature_extraction.text import CountVectorizer

sent1 = 'Today, LeBron James passed Michael Jordan in career points for regular season, playoffs combined.'
sent2 = 'Today, Huawei Sues the US Government for Unconstitutional Sales Restrictions Imposed by Congress.'
count_vec = CountVectorizer()
sentences = [sent1, sent2]

print(count_vec.fit_transform(sentences).toarray())
print('BoW:输出句子中的每个单词(包括符号)—按照顺序:',count_vec.get_feature_names())

#2、使用NLTK对这两句里面所有词汇的形成与性质类属乃至词汇如何组成短语或者句子的规则,做了更加细致地分析。
import nltk

tokens_1 = nltk.word_tokenize(sent1)
tokens_2 = nltk.word_tokenize(sent2)
print('NLTK:输出句子中的每个单词(包括符号):',tokens_1)
print('NLTK:输出句子中的每个单词(包括符号):',tokens_2)

vocab_1 = sorted(set(tokens_1))
vocab_2 = sorted(set(tokens_2))
print('NLTK:输出句子中的每个单词(包括符号)—按照顺序:',vocab_1)
print('NLTK:输出句子中的每个单词(包括符号)—按照顺序:',vocab_2)

stemmer = nltk.stem.PorterStemmer()
stem_1 = [stemmer.stem(t) for t in tokens_1]
stem_2 = [stemmer.stem(t) for t in tokens_2]
print(stem_1)
print(stem_2)

pos_tag_1 = nltk.tag.pos_tag(tokens_1)
pos_tag_2 = nltk.tag.pos_tag(tokens_2)
print('NLTK:输出句子中的每个单词(包括符号)—及其对应词性:',pos_tag_1)
print('NLTK:输出句子中的每个单词(包括符号)—及其对应词性:',pos_tag_2)
(0)

相关推荐