用 Python 在句子列表中形成单词的双元组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/21844546/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Forming Bigrams of words in list of sentences with Python
提问by Hypothetical Ninja
I have a list of sentences:
我有一个句子列表:
text = ['cant railway station','citadel hotel',' police stn']. 
I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:
我需要形成二元对并将它们存储在一个变量中。问题是当我这样做时,我得到的是一对句子而不是单词。这是我所做的:
text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)
which yields
这产生
[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])
Can't railway station and citadel hotel form one bigram. What I want is
火车站和城堡酒店不能合二为一。我想要的是
[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...
The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?
第一个句子的最后一个词不应与第二个句子的第一个词合并。我该怎么做才能让它发挥作用?
采纳答案by butch
Using list comprehensionsand zip:
>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]
回答by Dan
Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:
与其将您的文本转换为字符串列表,不如将每个句子作为一个字符串单独开始。我还删除了标点符号和停用词,如果与您无关,只需删除这些部分:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)
    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result
To use it, do like so:
要使用它,请这样做:
for line in sentence:
    features = get_bigrams(line)
    # train set here
Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).
请注意,这更进一步,实际上对二元组进行了统计评分(这将在训练模型时派上用场)。
回答by alfasin
Without nltk:
没有 nltk:
ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])
print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]
回答by Tanveer Alam
>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text  for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]
Using enumerate and split function.
使用枚举和拆分功能。
回答by Jay Marm
Just fixing Dan's code:
只是修复丹的代码:
def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)
    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result
回答by gurinder
from nltk import word_tokenize 
from nltk.util import ngrams
text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
    token = nltk.word_tokenize(line)
    bigram = list(ngrams(token, 2)) 
    # the '2' represents bigram...you can change it to get ngrams with different size
回答by avi
Read the dataset
读取数据集
df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")
Collect all available months
收集所有可用月份
df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])
Create tokens of all tweets per month
每月创建所有推文的代币
tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))
Create bigrams per month
每月创建 bigrams
bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))
Count bigrams per month
每月计算 bigrams
count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))
Wrap up the result in neat dataframes
将结果包装在整洁的数据框中
month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])
回答by saicharan
There are a number of waysto solve it but I solved in this way:
有很多方法可以解决它,但我是这样解决的:
>>text = ['cant railway station','citadel hotel',' police stn']
>>text2 = [[word for word in line.split()] for line in text]
>>text2
[['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
>>output = []
>>for i in range(len(text2)):
    output = output+list(bigrams(text2[i]))
>>#Here you can use list comphrension also
>>output
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]
回答by Radio Controlled
I think the best and most general way to do it is the following:
我认为最好和最通用的方法如下:
n      = 2
ngrams = []
for l in L:
    for i in range(n,len(l)+1):
        ngrams.append(l[i-n:i])
or in other words:
或者换句话说:
ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]
This should work for any nand any sequence l. If there are no ngrams of length nit returns the empty list.
这应该适用于任何n和任何序列l。如果没有 ngrams 长度,n则返回空列表。

