如何标记python输入文件中的自然英文文本？

Question

提问by Target

I want to tokenize input file in pythonplease suggest me i am new user of python .

我想tokenize input file in python请建议我是 python 的新用户。

I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.

我阅读了关于正则表达式的一些内容，但仍然有些困惑，所以请建议任何链接或代码概述。

Answer 1

回答by del

Try something like this:

尝试这样的事情：

import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens

The NLTK tutorial is also full of easy to follow examples: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html

NLTK 教程也充满了易于理解的示例：http: //nltk.googlecode.com/svn/trunk/doc/book/ch03.html

Answer 2

回答by alvas

Using `NLTK`

使用 `NLTK`

If your file is small:

如果您的文件很小：

Open the file with the context manager with open(...) as x,
then do a .read()and tokenize it with word_tokenize()

使用上下文管理器打开文件with open(...) as x，
然后做一个.read()并标记它word_tokenize()

[code]:

[代码]：

from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
    tokens = word_tokenize(fin.read())

If your file is larger:

如果您的文件较大：

Open the file with the context manager with open(...) as x,
read the file line by line with a for-loop
tokenize the line with word_tokenize()
output to your desired format (with the write flag set)

使用上下文管理器打开文件with open(...) as x，
使用 for 循环逐行读取文件
用 word_tokenize()
输出到您想要的格式（设置写标志）

[code]:

[代码]：

from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
    for line in fin:
        tokens = word_tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

Using SpaCy

使用 SpaCy

from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

with open ('myfile.txt') as fin, open('tokens.txt') as fout:
    for line in fin:
        tokens = tokenizer.tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

Answer 3

回答by svp

with open ("file.txt", "r") as f1:
         data=str(f1.readlines())
         sent_tokenize(data)

如何标记python输入文件中的自然英文文本？

提问by Target

回答by del

回答by alvas

Using `NLTK`

使用 `NLTK`

Using SpaCy

使用 SpaCy

回答by svp

相关推荐

最近更新

标签

如何标记python输入文件中的自然英文文本？

提问by Target

回答by del

回答by alvas

Using NLTK

使用 NLTK

Using SpaCy

使用 SpaCy

回答by svp

相关推荐

Python 与应用程序的烧瓶混淆

Python 正则表达式，多行匹配模式.. 为什么这不起作用？

在 Python 脚本中定义全局函数

Python 删除 Pandas 系列中的行并清理索引

相关推荐

最近更新

标签

Using `NLTK`

使用 `NLTK`