使用Python将文本文件中的复数转换为单数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31387905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:54:39  来源:igfitidea点击:

Converting plural to singular in a text file with Python

pythontextstemmingpluralsingular

提问by theintern

I have txt files that look like this:

我有看起来像这样的txt文件:

word, 23
Words, 2
test, 1
tests, 4

And I want them to look like this:

我希望它们看起来像这样:

word, 23
word, 2
test, 1
test, 4

I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:

我希望能够在 Python 中获取一个 txt 文件并将复数单词转换为单数。这是我的代码:

import nltk

f = raw_input("Please enter a filename: ")

def openfile(f):
    with open(f,'r') as a:
       a = a.read()
       a = a.lower()
       return a

def stem(a):
    p = nltk.PorterStemmer()
    [p.stem(word) for word in a]
    return a

def returnfile(f, a):
    with open(f,'w') as d:
        d = d.write(a)
    #d.close()

print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

I have also tried these 2 definitions instead of the stemdefinition:

我也尝试过这两个定义而不是stem定义:

def singular(a):
    for line in a:
        line = line[0]
        line = str(line)
        stemmer = nltk.PorterStemmer()
        line = stemmer.stem(line)
        return line

def stem(a):
    for word in a:
        for suffix in ['s']:
            if word.endswith(suffix):
                return word[:-len(suffix)]
            return word

Afterwards I'd like to take duplicate words (e.g. testand test) and merge them by adding up the numbers next to them. For example:

之后我想取重复的单词(例如testtest)并通过将它们旁边的数字相加来合并它们。例如:

word, 25
test, 5

I'm not sure how to do that. A solution would be nice but not necessary.

我不知道该怎么做。一个解决方案会很好但不是必需的。

采纳答案by NBartley

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.

看起来您对 Python 非常熟悉,但我仍然会尝试解释一些步骤。让我们从第一个问题开始,即去复数词。当您使用 a.read() 读取多行文件(在您的情况下为单词,数字 csv)时,您将把文件的整个主体读入一个大字符串。

def openfile(f):
    with open(f,'r') as a:
        a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        a = a.lower()
        return a

This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.

这很好,但是当您想将结果传递给 stem() 时,它将作为一个大字符串,而不是作为单词列表。这意味着当您使用 遍历输入时for word in a,您将遍历输入字符串的每个单独字符并将词干分析器应用于这些单独字符。

def stem(a):
    p = nltk.PorterStemmer()
    a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
    return a

This definitely doesn't work for your purposes, and there are a few different things we can do.

这绝对不适合您的目的,我们可以做一些不同的事情。

  1. We can change it so that we read the input file as one list of lines
  2. We can use the big string and break it down into a list ourselves.
  3. We can go through and stem each line in the list of lines one at a time.
  1. 我们可以更改它,以便我们将输入文件作为一个行列表读取
  2. 我们可以使用大字符串并将其分解为一个列表。
  3. 我们可以一次一行地遍历并删除行列表中的每一行。

Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:

为方便起见,让我们以#1 滚动。这将需要将 openfile(f) 更改为以下内容:

def openfile(f):
    with open(f,'r') as a:
        a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        b = [x.lower() for x in a]
        return b

This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:

这应该给我们 b 作为行列表,即 ['soc, 32', 'soc, 1', ...]。所以下一个问题变成了当我们将字符串列表传递给 stem() 时我们如何处理它。一种方法如下:

def stem(a):
    p = nltk.PorterStemmer()
    b = []
    for line in a:
        split_line = line.split(',') #break it up so we can get access to the word
        new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together 
        b.append(new_line) #add it to the new list of lines
    return b

This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.

这绝对是一个非常粗略的解决方案,但应该充分迭代输入中的所有行,并将它们去复数化。这很粗糙,因为当你放大时,拆分和重新组装它们并不是特别快。但是,如果您对此感到满意,那么剩下的就是遍历新行列表,并将它们写入您的文件。根据我的经验,写入新文件通常更安全,但这应该可以正常工作。

def returnfile(f, a):
    with open(f,'w') as d:
        for line in a:
            d.write(line)


print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

When I have the following input.txt

当我有以下input.txt

soc, 32
socs, 1
dogs, 8

I get the following stdout:

我得到以下标准输出:

Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None

And input.txtlooks like this:

input.txt中看起来像这样:

soc, 32
soc, 1
dog, 8


The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.

关于将数字与相同单词合并的第二个问题改变了我们上面的解决方案。根据评论中的建议,您应该看看使用字典来解决这个问题。与其将所有这些都作为一个大列表来做,更好的(可能也是更 Python 化的)方法是遍历输入的每一行,并在处理它们时将它们提取出来。如果您仍在努力弄清楚,我将在稍后编写有关此的代码。

回答by Albyorix

If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern:

如果你有复杂的单词要单数化,我不建议你使用词干,而是使用适当的 python 包链接pattern

from pattern.text.en import singularize

plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
           'families', 'dogs', 'child', 'wolves']

singles = [singularize(plural) for plural in plurals]
print singles

returns:

返回:

>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']

It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization

它并不完美,但它是我发现的最好的。96% 基于文档:http: //www.clips.ua.ac.be/pages/pattern-en#pluralization

回答by Vadym Pasko

The Nodebox English Linguistics library contains scripts for converting plural form to single form and vice versa. Checkout tutorial: https://www.nodebox.net/code/index.php/Linguistics#pluralization

Nodebox English Linguistics 库包含用于将复数形式转换为单一形式,反之亦然的脚本。结帐教程:https: //www.nodebox.net/code/index.php/Linguistics#pluralization

To convert plural to single just import singularmodule and use singular()function. It handles proper conversions for words with different endings, irregular forms, etc.

要将复数转换为单数,只需导入singular模块并使用singular()函数。它处理具有不同结尾、不规则形式等的单词的正确转换。

from en import singular
print(singular('analyses'))   
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child