如何标记python输入文件中的自然英文文本?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12703842/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to tokenize natural English text in an input file in python?
提问by Target
I want to tokenize input file in pythonplease suggest me i am new user of python .
我想tokenize input file in python请建议我是 python 的新用户。
I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.
我阅读了关于正则表达式的一些内容,但仍然有些困惑,所以请建议任何链接或代码概述。
回答by del
Try something like this:
尝试这样的事情:
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
The NLTK tutorial is also full of easy to follow examples: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html
NLTK 教程也充满了易于理解的示例:http: //nltk.googlecode.com/svn/trunk/doc/book/ch03.html
回答by alvas
Using NLTK
使用 NLTK
If your file is small:
如果您的文件很小:
- Open the file with the context manager
with open(...) as x, - then do a
.read()and tokenize it withword_tokenize()
- 使用上下文管理器打开文件
with open(...) as x, - 然后做一个
.read()并标记它word_tokenize()
[code]:
[代码]:
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
tokens = word_tokenize(fin.read())
If your file is larger:
如果您的文件较大:
- Open the file with the context manager
with open(...) as x, - read the file line by line with a for-loop
- tokenize the line with
word_tokenize() - output to your desired format (with the write flag set)
- 使用上下文管理器打开文件
with open(...) as x, - 使用 for 循环逐行读取文件
- 用
word_tokenize() - 输出到您想要的格式(设置写标志)
[code]:
[代码]:
from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
for line in fin:
tokens = word_tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
Using SpaCy
使用 SpaCy
from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = tokenizer.tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
回答by svp
with open ("file.txt", "r") as f1:
data=str(f1.readlines())
sent_tokenize(data)

