用python读取.doc文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36001482/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read .doc file with python
提问by Italo Lemos
I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:
我得到了一份工作申请测试,我的交易是阅读一些 .doc 文件。有谁知道图书馆可以做到这一点?我从一个原始的 python 代码开始:
f = open('test.doc', 'r')
f.read()
but this does not return a friendly string I need to convert it to utf-8
但这不会返回一个友好的字符串,我需要将其转换为 utf-8
Edit: I just want get the text from this file
编辑:我只想从这个文件中获取文本
回答by Shivam Kotwalia
One can use the textractlibrary. It take care of both "doc" as well as "docx"
可以使用texttract库。它同时处理“doc”和“docx”
import textract
text = textract.process("path/to/file.extension")
You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.
您甚至可以使用“antiword”(sudo apt-get install antiword),然后将 doc to first 转换为 docx,然后通读docx2txt。
antiword filename.doc > filename.docx
Ultimately, textract in the backend is using antiword.
最终,后端的 textract 使用的是 antiword。
回答by Billal Begueradj
You can use python-docx2txtlibrary to read text from Microsoft Word documents. It is an improvement over python-docxlibrary as it can, in addition, extract text from links, headers and footers. It can even extract images.
您可以使用python-docx2txt库从 Microsoft Word 文档中读取文本。它是对python-docx库的改进,因为它还可以从链接、页眉和页脚中提取文本。它甚至可以提取图像。
You can install it by running: pip install docx2txt
.
您可以通过运行来安装它: pip install docx2txt
.
Let's download and read the first Microsoft document on here:
让我们在这里下载并阅读第一个 Microsoft 文档:
import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)
Here is a screenshot of the Terminal output the above code:
这是终端输出上述代码的屏幕截图:
EDIT:
编辑:
This does NOTwork for .docfiles. The only reason I am keep this answer is that it seems there are people who find it useful for .docxfiles.
但这不是对工作的.doc文件。我保留这个答案的唯一原因是似乎有人发现它对.docx文件有用。
回答by 10SecTom
I was trying to to the same, I found lots of information on reading .docx but much less on .doc; Anyway, I managed to read the text using the following:
我也试图这样做,我发现了很多关于阅读 .docx 的信息,但很少有关于 .doc 的信息;无论如何,我设法使用以下内容阅读了文本:
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)
回答by Aslam Shaik
Prerequisites :
先决条件:
install antiword : sudo apt-get install antiword
安装反词: sudo apt-get install antiword
install docx : pip install docx
安装 docx : pip install docx
from subprocess import Popen, PIPE
from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
cmd = ['antiword', file_path]
p = Popen(cmd, stdout=PIPE)
stdout, stderr = p.communicate()
return stdout.decode('ascii', 'ignore')
print document_to_text('your_file_name','your_file_path')
Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx
注意 – 新版本的 python-docx 删除了这个功能。确保 pip install docx 而不是新的 python-docx
回答by Rahul Nimbal
I agree with Shivam's answer except for textractdoesn't exist for windows. And, for some reason antiwordalso fails to read the '.doc' files and gives an error:
我同意 Shivam 的回答,除了 windows 不存在textract。而且,由于某种原因,antiword也无法读取“.doc”文件并给出错误:
'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.
So, I've got the following workaround to extract the text:
所以,我有以下解决方法来提取文本:
from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text
This script will work with most kinds of files. Have fun!
此脚本适用于大多数类型的文件。玩得开心!
回答by lucas F
The answer from Shivam Kotwalia works perfectly. However, the object is imported as a bytetype. Sometimes you may need it as a string for performing REGEX or something like that.
Shivam Kotwalia 的回答非常有效。但是,对象是作为字节类型导入的。有时你可能需要它作为一个字符串来执行 REGEX 或类似的东西。
I recommend the following code (two lines from Shivam Kotwalia's answer) :
我推荐以下代码(Shivam Kotwalia 的回答中的两行):
import textract
text = textract.process("path/to/file.extension")
text = text.decode("utf-8")
The last line will convert the object textto a string.
最后一行将对象文本转换为字符串。