Python Untokenize 一个句子

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21948019/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:56:27  来源:igfitidea点击:

Python Untokenize a sentence

pythonpython-2.7nltk

提问by Brana

There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.

有很多关于如何标记句子的指南,但我没有找到任何关于如何做相反的指南。

 import nltk
 words = nltk.word_tokenize("I've found a medicine for my disease.")
 result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize()for some reason doesn't work.

是否有任何功能可以将标记化的句子恢复到原始状态。该功能tokenize.untokenize()由于某种原因不起作用。

Edit:

编辑:

I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:

我知道我可以这样做,这可能解决了问题,但我很好奇是否有一个集成的功能:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')   

采纳答案by alecxe

You can use "treebank detokenizer" - TreebankWordDetokenizer:

您可以使用“treebank detokenizer” - TreebankWordDetokenizer

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'


There is also MosesDetokenizerwhich was in nltkbut got removed because of the licensing issues, but it is available as a Sacremosesstandalone package.

还有MosesDetokenizer这是nltk,但因为得到了去除许可问题,但它可以作为一个Sacremoses独立包

回答by shaktimaan

Use the joinfunction:

使用连接函数:

You could just do a ' '.join(words)to get back the original string.

你可以做一个' '.join(words)来找回原始字符串。

回答by dparpyani

The reason tokenize.untokenizedoes not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize:

之所以tokenize.untokenize行不通,是因为它需要的信息不仅仅是单词。这是一个使用的示例程序tokenize.untokenize

from StringIO import StringIO
import tokenize

sentence = "I've found a medicine for my disease.\n"
tokens = tokenize.generate_tokens(StringIO(sentence).readline)
print tokenize.untokenize(tokens)


Additional Help: Tokenize - Python Docs| Potential Problem


附加帮助: 标记化 - Python 文档| 潜在问题

回答by alvas

To reverse word_tokenizefrom nltk, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenizeand do some reverse engineering.

为了扭转word_tokenizenltk,我建议在寻找http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize,并做一些逆向工程。

Short of doing crazy hacks on nltk, you can try this:

没有对 nltk 进行疯狂的黑客攻击,你可以试试这个:

>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."

回答by Renklauf

use token_utils.untokenizefrom here

token_utils.untokenize这里使用

import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

 tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
 untokenize(tokenized)
 "I've found a medicine for my disease."

回答by alemol

I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.

我建议在标记化中保留偏移量:(令牌,偏移量)。我认为,这些信息对于处理原始句子很有用。

import re
from nltk.tokenize import word_tokenize

def offset_tokenize(text):
    tail = text
    accum = 0
    tokens = self.tokenize(text)
    info_tokens = []
    for tok in tokens:
        scaped_tok = re.escape(tok)
        m = re.search(scaped_tok, tail)
        start, end = m.span()
        # global offsets
        gs = accum + start
        ge = accum + end
        accum += end
        # keep searching in the rest
        tail = tail[end:]
        info_tokens.append((tok, (gs, ge)))
    return info_token

sent = '''I've found a medicine for my disease.

This is line:3.'''

toks_offsets = offset_tokenize(sent)

for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]

Gives:

给出:

True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .

回答by Asad

I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens

我使用以下代码没有任何主要库函数用于 detokeization 目的。我正在对某些特定令牌使用去令牌化

_SPLITTER_ = r"([-.,/:!?\";)(])"

def basic_detokenizer(sentence):
""" This is the basic detokenizer helps us to resolves the issues we created by  our tokenizer"""
detokenize_sentence =[]
words = sentence.split(' ')
pos = 0
while( pos < len(words)):
    if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
        left = detokenize_sentence.pop()
        detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
        pos +=1
    elif  words[pos] in '[(' and pos < len(words) - 1:
        detokenize_sentence.append(''.join(words[pos:pos + 2]))   
        pos +=1        
    elif  words[pos] in ']).,:!?;' and pos > 0:
        left  = detokenize_sentence.pop()
        detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))            
    else:
        detokenize_sentence.append(words[pos])
    pos +=1
return ' '.join(detokenize_sentence)

回答by Sathyanarayanan Kulasekaran

For me, it worked when I installed python nltk 3.2.5,

对我来说,当我安装 python nltk 3.2.5 时它起作用了,

pip install -U nltk

then,

然后,

import nltk
nltk.download('perluniprops')

from nltk.tokenize.moses import MosesDetokenizer

If you are using insides pandas dataframe, then

如果您使用 insides pandas 数据框,则

df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))

回答by gss

The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:

没有简单答案的原因是您实际上需要字符串中原始标记的跨度位置。如果您没有,并且您没有对原始标记化进行逆向工程,则重新组装的字符串将基于对使用的标记化规则的猜测。如果你的分词器没有给你跨度,如果你有三件事,你仍然可以这样做:

1) The original string

1)原始字符串

2) The original tokens

2)原始代币

3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)

3)修改后的令牌(我假设您已经以某种方式更改了令牌,因为如果您已经拥有#1,这是我能想到的唯一应用程序)

Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.

使用原始标记集来识别跨度(如果标记器这样做不是很好吗?)并从后到前修改字符串,这样跨度就不会随着您的进行而改变。

Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.

在这里,我使用 TweetTokenizer,但只要您使用的标记生成器不更改标记的值,因此它们实际上不在原始字符串中,这应该无关紧要。

tokenizer=nltk.tokenize.casual.TweetTokenizer()
string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
tokens=tokenizer.tokenize(string)
replacement_tokens=list(tokens)
replacement_tokens[-3]="cute"

def detokenize(string,tokens,replacement_tokens):
    spans=[]
    cursor=0
    for token in tokens:
        while not string[cursor:cursor+len(token)]==token and cursor<len(string):
            cursor+=1        
        if cursor==len(string):break
        newcursor=cursor+len(token)
        spans.append((cursor,newcursor))
        cursor=newcursor
    i=len(tokens)-1
    for start,end in spans[::-1]:
        string=string[:start]+replacement_tokens[i]+string[end:]
        i-=1
    return string

>>> detokenize(string,tokens,replacement_tokens)
'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'

回答by Uri

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'