python中的Doc,rtf和txt阅读器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3278850/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:23:40  来源:igfitidea点击:

Doc, rtf and txt reader in python

pythonpython-3.x

提问by Rajeev

Like csv.reader()are there any other functions which can read .rtf, .txt, .docfiles in Python?

csv.reader()有没有其他函数可以读取Python 中的.rtf, .txt,.doc文件?

采纳答案by Jesse Dhillon

You can read a text file with

您可以使用以下命令读取文本文件

txt = open("file.txt").read()

Try PyRTFfor RTF files. I would think that reading MS Word .doc files are pretty unlikely unless you are on Windows and you can use some of the native MS interfaces for reading those files. This articleclaims to show how to write scripts that interface with Word.

为 RTF 文件尝试PyRTF。我认为阅读 MS Word .doc 文件的可能性很小,除非您使用的是 Windows 并且您可以使用一些本机 MS 界面来阅读这些文件。本文声称展示了如何编写与 Word 交互的脚本。

回答by Noufal Ibrahim

csvis a specific format so you need a "parser" to read it. This is what the csv module provides as you've mentioned. Text files (usually suffixed with .txt) don't have any fixed "format" so you can just read them after opening them (Jesse's answer gives the details). CSV files are commonly text files so your distinction is not very accurate.

csv是一种特定格式,因此您需要一个“解析器”来读取它。正如您所提到的,这就是 csv 模块提供的内容。文本文件(通常以 为后缀.txt)没有任何固定的“格式”,因此您可以在opening之后阅读它们(Jesse 的回答提供了详细信息)。CSV 文件通常是文本文件,因此您的区分不是很准确。

As for RTF, There are a bunch of them. See this answerfor details. The PyRTF thing which Jesse mentioned seems to be the most popular though.

至于RTF,有很多。有关详细信息,请参阅此答案。Jesse 提到的 PyRTF 似乎是最受欢迎的。

Microsoft Word document files (usually suffixed with .doc) are another beast since the format is proprietary. I don't have much experience with Python converters but there are a few command line ones (like wvHTML) which do a somewhat decent job. This questiondiscusses quite a few. There's also the option of having MS-Word itself do that for you via. a COM interface like Jesse has mentioned.

Microsoft Word 文档文件(通常以 为后缀.doc)是另一种野兽,因为该格式是专有的。我对 Python 转换器没有太多经验,但有一些命令行转换器(如 wvHTML)做得不错。这个问题讨论了很多。还可以选择让 MS-Word 自己为您执行此操作。像 Jesse 提到的 COM 接口。

回答by markling

I've had a real headache trying to do this simple thing for word and writer documents.

试图为 word 和 writer 文档做这个简单的事情时,我真的很头疼。

There is a simple solution: call openoffice on the command line to convert your target document to text, then load the text into Python.

有一个简单的解决方案:在命令行调用 openoffice 将目标文档转换为文本,然后将文本加载到 Python 中。

Other conversion tools I tried produced unreliable output, while other Python oOo libraries were too complex.

我尝试过的其他转换工具产生的输出不可靠,而其他 Python oOo 库太复杂了。

If you just want to get at the text so you can process it, use this on the linux command line:

如果您只想获取文本以便处理它,请在 linux 命令行上使用它:

soffice --headless --convert-to txt:Text /path_to/document_to_convert.doc

(call it from Python using subprocess if you want to automate it).

(如果你想自动化它,使用子进程从 Python 调用它)。

It will create text file you can simpley load into python.

它将创建文本文件,您可以简单地加载到 python 中。

(Credit)

信用

回答by SystemOverflow LLC

import win32com.client
if tmpFile.endswith('.xml') or tmpFile.endswith('.doc') or tmpFile.endswith('.docx'):
       app = win32com.client.Dispatch("Word.Application")
       app.Visible = False
       app.Documents.Open(tmpFile)
       doc = app.ActiveDocument

       docText = doc.Content.Text 
       print(docText)
       doc.Close()
       app.Quit()

回答by Rugved Modak

There is a python modulecalled 'docx'which you can use to read .docxfiles. You won't be able to read .doc though because it is nearly obsolete nowadays.

有一个名为“docx”Python 模块,您可以使用它来读取.docx文件。您将无法阅读 .doc,因为它现在几乎过时了。

from docx import Document
doc = Document(filepath)
# Reading Data
data = doc.paragraphs
tables = doc.tables

You can find it Hereon Pypi.

你可以找到它这里PyPI上。