用python读取.doc文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36001482/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:15:28  来源:igfitidea点击:

Read .doc file with python

pythonpython-2.7

提问by Italo Lemos

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

我得到了一份工作申请测试,我的交易是阅读一些 .doc 文件。有谁知道图书馆可以做到这一点?我从一个原始的 python 代码开始:

f = open('test.doc', 'r')
f.read()

but this does not return a friendly string I need to convert it to utf-8

但这不会返回一个友好的字符串,我需要将其转换为 utf-8

Edit: I just want get the text from this file

编辑:我只想从这个文件中获取文本

回答by Shivam Kotwalia

One can use the textractlibrary. It take care of both "doc" as well as "docx"

可以使用texttract库。它同时处理“doc”和“docx”

import textract
text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

您甚至可以使用“antiword”(sudo apt-get install antiword),然后将 doc to first 转换为 docx,然后通读docx2txt

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

最终,后端的 textract 使用的是 antiword。

回答by Billal Begueradj

You can use python-docx2txtlibrary to read text from Microsoft Word documents. It is an improvement over python-docxlibrary as it can, in addition, extract text from links, headers and footers. It can even extract images.

您可以使用python-docx2txt库从 Microsoft Word 文档中读取文本。它是对python-docx库的改进,因为它还可以从链接、页眉和页脚中提取文本。它甚至可以提取图像。

You can install it by running: pip install docx2txt.

您可以通过运行来安装它: pip install docx2txt.

Let's download and read the first Microsoft document on here:

让我们在这里下载并阅读第一个 Microsoft 文档:

import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)

Here is a screenshot of the Terminal output the above code:

这是终端输出上述代码的屏幕截图:

enter image description here

在此处输入图片说明

EDIT:

编辑:

This does NOTwork for .docfiles. The only reason I am keep this answer is that it seems there are people who find it useful for .docxfiles.

但这不是对工作的.doc文件。我保留这个答案的唯一原因是似乎有人发现它对.docx文件有用。

回答by 10SecTom

I was trying to to the same, I found lots of information on reading .docx but much less on .doc; Anyway, I managed to read the text using the following:

我也试图这样做,我发现了很多关于阅读 .docx 的信息,但很少有关于 .doc 的信息;无论如何,我设法使用以下内容阅读了文本:

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)

回答by Aslam Shaik

Prerequisites :

先决条件:

install antiword : sudo apt-get install antiword

安装反词: sudo apt-get install antiword

install docx : pip install docx

安装 docx : pip install docx

from subprocess import Popen, PIPE

from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
    cmd = ['antiword', file_path]
    p = Popen(cmd, stdout=PIPE)
    stdout, stderr = p.communicate()
    return stdout.decode('ascii', 'ignore')

print document_to_text('your_file_name','your_file_path')

Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx

注意 – 新版本的 python-docx 删除了这个功能。确保 pip install docx 而不是新的 python-docx

回答by Rahul Nimbal

I agree with Shivam's answer except for textractdoesn't exist for windows. And, for some reason antiwordalso fails to read the '.doc' files and gives an error:

我同意 Shivam 的回答,除了 windows 不存在textract。而且,由于某种原因,antiword也无法读取“.doc”文件并给出错误:

'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.

So, I've got the following workaround to extract the text:

所以,我有以下解决方法来提取文本:

from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text

This script will work with most kinds of files. Have fun!

此脚本适用于大多数类型的文件。玩得开心!

回答by lucas F

The answer from Shivam Kotwalia works perfectly. However, the object is imported as a bytetype. Sometimes you may need it as a string for performing REGEX or something like that.

Shivam Kotwalia 的回答非常有效。但是,对象是作为字节类型导入的。有时你可能需要它作为一个字符串来执行 REGEX 或类似的东西。

I recommend the following code (two lines from Shivam Kotwalia's answer) :

我推荐以下代码(Shivam Kotwalia 的回答中的两行):

import textract

text = textract.process("path/to/file.extension")
text = text.decode("utf-8") 

The last line will convert the object textto a string.

最后一行将对象文本转换为字符串