用python读取.doc文件

Question

提问by Italo Lemos

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

我得到了一份工作申请测试，我的交易是阅读一些 .doc 文件。有谁知道图书馆可以做到这一点？我从一个原始的 python 代码开始：

f = open('test.doc', 'r')
f.read()

but this does not return a friendly string I need to convert it to utf-8

但这不会返回一个友好的字符串，我需要将其转换为 utf-8

Edit: I just want get the text from this file

编辑：我只想从这个文件中获取文本

Answer 1

回答by Shivam Kotwalia

One can use the textractlibrary. It take care of both "doc" as well as "docx"

可以使用texttract库。它同时处理“doc”和“docx”

import textract
text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

您甚至可以使用“antiword”（sudo apt-get install antiword），然后将 doc to first 转换为 docx，然后通读docx2txt。

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

最终，后端的 textract 使用的是 antiword。

Answer 2

回答by Billal Begueradj

You can use python-docx2txtlibrary to read text from Microsoft Word documents. It is an improvement over python-docxlibrary as it can, in addition, extract text from links, headers and footers. It can even extract images.

您可以使用python-docx2txt库从 Microsoft Word 文档中读取文本。它是对python-docx库的改进，因为它还可以从链接、页眉和页脚中提取文本。它甚至可以提取图像。

You can install it by running: pip install docx2txt.

您可以通过运行来安装它： pip install docx2txt.

Let's download and read the first Microsoft document on here:

让我们在这里下载并阅读第一个 Microsoft 文档：

import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)

Here is a screenshot of the Terminal output the above code:

这是终端输出上述代码的屏幕截图：

EDIT:

编辑：

This does NOTwork for .docfiles. The only reason I am keep this answer is that it seems there are people who find it useful for .docxfiles.

但这不是对工作的.doc文件。我保留这个答案的唯一原因是似乎有人发现它对.docx文件有用。

Answer 3

回答by 10SecTom

I was trying to to the same, I found lots of information on reading .docx but much less on .doc; Anyway, I managed to read the text using the following:

我也试图这样做，我发现了很多关于阅读 .docx 的信息，但很少有关于 .doc 的信息；无论如何，我设法使用以下内容阅读了文本：

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)

Answer 4

回答by Aslam Shaik

Prerequisites :

先决条件：

install antiword : sudo apt-get install antiword

安装反词： sudo apt-get install antiword

install docx : pip install docx

安装 docx ： pip install docx

from subprocess import Popen, PIPE

from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
    cmd = ['antiword', file_path]
    p = Popen(cmd, stdout=PIPE)
    stdout, stderr = p.communicate()
    return stdout.decode('ascii', 'ignore')

print document_to_text('your_file_name','your_file_path')

Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx

注意 – 新版本的 python-docx 删除了这个功能。确保 pip install docx 而不是新的 python-docx

Answer 5

回答by Rahul Nimbal

I agree with Shivam's answer except for textractdoesn't exist for windows. And, for some reason antiwordalso fails to read the '.doc' files and gives an error:

我同意 Shivam 的回答，除了 windows 不存在textract。而且，由于某种原因，antiword也无法读取“.doc”文件并给出错误：

'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.

So, I've got the following workaround to extract the text:

所以，我有以下解决方法来提取文本：

from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text

This script will work with most kinds of files. Have fun!

此脚本适用于大多数类型的文件。玩得开心！

Answer 6

回答by lucas F

The answer from Shivam Kotwalia works perfectly. However, the object is imported as a bytetype. Sometimes you may need it as a string for performing REGEX or something like that.

Shivam Kotwalia 的回答非常有效。但是，对象是作为字节类型导入的。有时你可能需要它作为一个字符串来执行 REGEX 或类似的东西。

I recommend the following code (two lines from Shivam Kotwalia's answer) :

我推荐以下代码（Shivam Kotwalia 的回答中的两行）：

import textract

text = textract.process("path/to/file.extension")
text = text.decode("utf-8")

The last line will convert the object textto a string.

最后一行将对象文本转换为字符串。

用python读取.doc文件

提问by Italo Lemos

回答by Shivam Kotwalia

回答by Billal Begueradj

回答by 10SecTom

回答by Aslam Shaik

回答by Rahul Nimbal

回答by lucas F

相关推荐

最近更新

标签

用python读取.doc文件

提问by Italo Lemos

回答by Shivam Kotwalia

回答by Billal Begueradj

回答by 10SecTom

回答by Aslam Shaik

回答by Rahul Nimbal

回答by lucas F

相关推荐

如何在 mac 上为 2.7 设置 python 路径？

Python 检查某个值是否包含在 Pandas 的数据框列中

Python 属性错误：模块“cv2.face”没有属性“createlbphfacerecognizer”

Python 我可以在 GPU 上运行 Keras 模型吗？

相关推荐

最近更新

标签