在 Python 中扩展英语语言收缩

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19790188/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:37:51  来源:igfitidea点击:

Expanding English language contractions in Python

pythonnlptext-processing

提问by Maarten

The English language has a couple of contractions. For instance:

英语有几个缩写。例如:

you've -> you have
he's -> he is

These can sometimes cause headache when you are doing natural language processing. Is there a Python library, which can expand these contractions?

当您进行自然语言处理时,这些有时会引起头痛。有没有可以扩展这些收缩的 Python 库?

回答by alko

You don't need a library, it is possible to do with reg exp for example.

您不需要库,例如可以使用 reg exp。

>>> import re
>>> contractions_dict = {
...     'didn\'t': 'did not',
...     'don\'t': 'do not',
... }
>>> contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
>>> def expand_contractions(s, contractions_dict=contractions_dict):
...     def replace(match):
...         return contractions_dict[match.group(0)]
...     return contractions_re.sub(replace, s)
...
>>> expand_contractions('You don\'t need a library')
'You do not need a library'

回答by Hyman_of_All_Trades

I would like to add little to alko's answer here. If you check wikipedia, the number of English Language contractions as mentioned there are less than 100. Granted, in real scenario this number could be more than that. But still, I am pretty sure that 200-300 words are all you will have for English contraction words. Now, do you want to get a separate library for those (I don't think what you are looking for actually exists, though)?. However, you can easily solve this problem with dictionary and using regex. I would recommend using a nice tokenizer asNatural Language Toolkitand the rest you should have no problem in implementing yourself.

我想在这里对 alko 的回答做一些补充。如果你查看维基百科,上面提到的英语缩略词的数量不到 100。当然,在实际情况下,这个数字可能不止于此。但是,我很确定您将拥有 200-300 个单词来获取英语收缩词。现在,您是否想为那些获得一个单独的库(不过,我认为您正在寻找的内容实际上并不存在)?。但是,您可以使用字典和使用正则表达式轻松解决此问题。我建议使用一个不错的标记器作为自然语言工具包,其余的你在实现自己时应该没有问题。

回答by arturomp

I made that wikipedia contraction-to-expansion page into a python dictionary (see below)

我把维基百科的收缩到展开页面变成了一个 python 字典(见下文)

Note, as you might expect, that you definitely want to use double quotes when querying the dictionary:

请注意,正如您所料,您肯定希望在查询字典时使用双引号:

enter image description here

在此处输入图片说明

Also, I've left multiple options in as in the wikipedia page. Feel free to modify it as you wish. Note that disambiguation to the right expansion would be a tricky problem!

另外,我在维基百科页面中留下了多个选项。随意修改它。请注意,消除对正确扩展的歧义将是一个棘手的问题!

contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

回答by Yannick

Even though this is an old question, I figured I might as well answer since there is still no real solution to this as far as I can see.

尽管这是一个老问题,但我想我也可以回答,因为就我所见,仍然没有真正的解决方案。

I have had to work on this on a related NLP project and I decided to tackle the problem since there didn't seem to be anything here. You can check my expander github repositoryif you are interested.

我不得不在一个相关的 NLP 项目上处理这个问题,我决定解决这个问题,因为这里似乎没有任何东西。如果您有兴趣,可以查看我的扩展器 github 存储库

It's a fairly badly optimized (I think) program based on NLTK, the Stanford Core NLP models, which you will have to download separately, and the dictionary in the previous answer. All the necessary information should be in the README and the lavishly commented code. I know commented code is dead code, but this is just how I write to keep things clear for myself.

这是一个基于 NLTK、斯坦福核心 NLP 模型(您必须单独下载)以及上一个答案中的字典的相当糟糕的优化(我认为)程序。所有必要的信息都应该在自述文件和大量注释的代码中。我知道注释代码是死代码,但这正是我为了让自己清楚事情而编写的方式。

The example input in expander.pyare the following sentences:

输入的示例expander.py是以下句子:

    ["I won't let you get away with that",  # won't ->  will not
    "I'm a bad person",  # 'm -> am
    "It's his cat anyway",  # 's -> is
    "It's not what you think",  # 's -> is
    "It's a man's world",  # 's -> is and 's possessive
    "Catherine's been thinking about it",  # 's -> has
    "It'll be done",  # 'll -> will
    "Who'd've thought!",  # 'd -> would, 've -> have
    "She said she'd go.",  # she'd -> she would
    "She said she'd gone.",  # she'd -> had
    "Y'all'd've a great time, wouldn't it be so cold!", # Y'all'd've -> You all would have, wouldn't -> would not
    " My name is Hyman.",   # No replacements.
    "'Tis questionable whether Ma'am should be going.", # 'Tis -> it is, Ma'am -> madam
    "As history tells, 'twas the night before Christmas.", # 'Twas -> It was
    "Martha, Peter and Christine've been indulging in a menage-à-trois."] # 've -> have

To which the output is

输出是

    ["I will not let you get away with that",
    "I am a bad person",
    "It is his cat anyway",
    "It is not what you think",
    "It is a man's world",
    "Catherine has been thinking about it",
    "It will be done",
    "Who would have thought!",
    "She said she would go.",
    "She said she had gone.",
    "You all would have a great time, would not it be so cold!",
    "My name is Hyman.",
    "It is questionable whether Madam should be going.",
    "As history tells, it was the night before Christmas.",
    "Martha, Peter and Christine have been indulging in a menage-à-trois."]

So for this small set of test sentences, I came up with to test some edge-cases, it works well.

所以对于这一小组测试语句,我想出了测试一些边缘情况,效果很好。

Since this project has lost importance right now, I am not actively developing this anymore. Any help on this project would be appreciated. Things to be done are written in the TODO list. Or if you have any tips on how to improve my python I would also be very thankful.

由于这个项目现在已经失去了重要性,我不再积极开发它了。对这个项目的任何帮助将不胜感激。要做的事情写在 TODO 列表中。或者,如果您对如何改进我的 Python 有任何建议,我也会非常感激。

回答by Yann Dubois

The answers above will work perfectly well and could be better for ambiguous contraction (although I would argue that there aren't that many ambiguous cases). I would use something more readable and easier to maintain:

上面的答案将非常有效,并且可能更适合模棱两可的收缩(尽管我认为没有那么多模棱两可的情况)。我会使用更具可读性和更易于维护的东西:

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase


test = "Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))
# Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.

It might have some flaws I didn't think about though.

它可能有一些我没有想到的缺陷。

Reposted from my other answer

转自我的另一个回答

回答by Joe9008

This is a very cool and easy to use library for the purpose https://pypi.python.org/pypi/pycontractions/1.0.1.

这是一个非常酷且易于使用的库,用于 https://pypi.python.org/pypi/pycontractions/1.0.1

Example of use (detailed in link):

使用示例(链接中有详细说明):

from pycontractions import Contractions

# Load your favorite word2vec model
cont = Contractions('GoogleNews-vectors-negative300.bin')

# optional, prevents loading on first expand_texts call
cont.load_models()

out = list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the zoo and she'll be home for dinner."], precise=True))
print(out)

You will also need GoogleNews-vectors-negative300.bin, link to download in the pycontractions link above. *Example code in python3.

您还需要 GoogleNews-vectors-negative300.bin,在上面的 pycontractions 链接中下载链接。*python3中的示例代码。

回答by Hammad Hassan

I have found a library for this, contractionsIts very simple.

我为此找到了一个库,contractions它非常简单。

import contractions
print(contractions.fix("you've"))
print(contractions.fix("he's"))

Output:

输出:

you have
he is