用 Python 在句子列表中形成单词的双元组

Question

提问by Hypothetical Ninja

I have a list of sentences:

我有一个句子列表：

text = ['cant railway station','citadel hotel',' police stn'].

I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:

我需要形成二元对并将它们存储在一个变量中。问题是当我这样做时，我得到的是一对句子而不是单词。这是我所做的：

text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)

which yields

这产生

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

Can't railway station and citadel hotel form one bigram. What I want is

火车站和城堡酒店不能合二为一。我想要的是

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

第一个句子的最后一个词不应与第二个句子的第一个词合并。我该怎么做才能让它发挥作用？

Answer 1

采纳答案by butch

Using list comprehensionsand zip:

使用列表推导式和zip：

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]

Answer 2

回答by Dan

Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:

与其将您的文本转换为字符串列表，不如将每个句子作为一个字符串单独开始。我还删除了标点符号和停用词，如果与您无关，只需删除这些部分：

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

To use it, do like so:

要使用它，请这样做：

for line in sentence:
    features = get_bigrams(line)
    # train set here

Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).

请注意，这更进一步，实际上对二元组进行了统计评分（这将在训练模型时派上用场）。

Answer 3

回答by alfasin

Without nltk:

没有 nltk：

ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

Answer 4

回答by Tanveer Alam

>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text  for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

Using enumerate and split function.

使用枚举和拆分功能。

Answer 5

回答by Jay Marm

Just fixing Dan's code:

只是修复丹的代码：

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

Answer 6

回答by gurinder

from nltk import word_tokenize 
from nltk.util import ngrams


text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
    token = nltk.word_tokenize(line)
    bigram = list(ngrams(token, 2)) 

    # the '2' represents bigram...you can change it to get ngrams with different size

Answer 7

回答by avi

Read the dataset

读取数据集

df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")

Collect all available months

收集所有可用月份

df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])

Create tokens of all tweets per month

每月创建所有推文的代币

tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))

Create bigrams per month

每月创建 bigrams

bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))

Count bigrams per month

每月计算 bigrams

count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))

Wrap up the result in neat dataframes

将结果包装在整洁的数据框中

month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])

Answer 8

回答by saicharan

There are a number of waysto solve it but I solved in this way:

有很多方法可以解决它，但我是这样解决的：

>>text = ['cant railway station','citadel hotel',' police stn']
>>text2 = [[word for word in line.split()] for line in text]
>>text2
[['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
>>output = []
>>for i in range(len(text2)):
    output = output+list(bigrams(text2[i]))
>>#Here you can use list comphrension also
>>output
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

Answer 9

回答by Radio Controlled

I think the best and most general way to do it is the following:

我认为最好和最通用的方法如下：

n      = 2
ngrams = []

for l in L:
    for i in range(n,len(l)+1):
        ngrams.append(l[i-n:i])

or in other words:

或者换句话说：

ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]

This should work for any nand any sequence l. If there are no ngrams of length nit returns the empty list.

这应该适用于任何n和任何序列l。如果没有 ngrams 长度，n则返回空列表。

用 Python 在句子列表中形成单词的双元组

提问by Hypothetical Ninja

采纳答案by butch

回答by Dan

回答by alfasin

回答by Tanveer Alam

回答by Jay Marm

回答by gurinder

回答by avi

Read the dataset

读取数据集

Collect all available months

收集所有可用月份

Create tokens of all tweets per month

每月创建所有推文的代币

Create bigrams per month

每月创建 bigrams

Count bigrams per month

每月计算 bigrams

Wrap up the result in neat dataframes

将结果包装在整洁的数据框中

回答by saicharan

回答by Radio Controlled

相关推荐

最近更新

标签

用 Python 在句子列表中形成单词的双元组

提问by Hypothetical Ninja

采纳答案by butch

回答by Dan

回答by alfasin

回答by Tanveer Alam

回答by Jay Marm

回答by gurinder

回答by avi

Read the dataset

读取数据集

Collect all available months

收集所有可用月份

Create tokens of all tweets per month

每月创建所有推文的代币

Create bigrams per month

每月创建 bigrams

Count bigrams per month

每月计算 bigrams

Wrap up the result in neat dataframes

将结果包装在整洁的数据框中

回答by saicharan

回答by Radio Controlled

相关推荐

在 Python 中的同一类中从另一个方法调用一个方法

Python <Django 对象> 不是 JSON 可序列化的

Python 为什么 pylint 反对单字符变量名？

Python 使用 matplotlib 生成平滑线图

相关推荐

最近更新

标签