将 PDF 转换为 DOC(Python/Bash)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26358281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:23:44  来源:igfitidea点击:

Convert PDF to DOC (Python/Bash)

pythonbashpdfdocxdoc

提问by AlvaroAV

I've saw some pages that allow user to upload PDFand returns a DOCfile, like PdfToWord

我看到了一些允许用户上传PDF并返回DOC文件的页面,比如PdfToWord

Is there any way to convert a PDFfile to a DOC/DOCXfile using Python or any Unix command ?

有没有办法使用 Python 或任何 Unix 命令将PDF文件转换为文件DOC/DOCX

Thanks in advance

提前致谢

采纳答案by ham-sandwich

If you have LibreOffice installed

如果您安装了 LibreOffice

lowriter --invisible --convert-to doc '/your/file.pdf'

If you want to use Python for this:

如果您想为此使用 Python:

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

回答by ham-sandwich

This is difficult because PDFs are presentation oriented and word documents are content oriented. I have tested both and can recommend the following projects.

这很困难,因为 PDF 是面向演示的,而 Word 文档是面向内容的。我已经测试了两者,可以推荐以下项目。

  1. PyPDF2
  2. PDFMiner
  1. PyPDF2
  2. PDFMiner

However, you are most definitely going to lose presentational aspects in the conversion.

但是,您肯定会在转换中丢失表现方面。

回答by Tilal Ahmad

You can use GroupDocs.Conversion Cloud SDK for pythonwithout installing any third-party tool or software.

您可以使用GroupDocs.Conversion Cloud SDK for python无需安装任何第三方工具或软件。

Sample Python code:

示例 Python 代码:

# Import module
import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

        #upload soruce file to storage
        filename = 'Sample.pdf'
        remote_name = 'Sample.pdf'
        output_name= 'sample.docx'
        strformat='docx'

        request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
        response_upload = file_api.upload_file(request_upload)
        #Convert PDF to Word document
        settings = groupdocs_conversion_cloud.ConvertSettings()
        settings.file_path =remote_name
        settings.format = strformat
        settings.output_path = output_name

        loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
        loadOptions.hide_pdf_annotations = True
        loadOptions.remove_embedded_files = False
        loadOptions.flatten_all_fields = True

        settings.load_options = loadOptions

        convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
        convertOptions.from_page = 1
        convertOptions.pages_count = 1

        settings.convert_options = convertOptions
 .               
        request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
        response = convert_api.convert_document(request)

        print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

I'm developer evangelist at aspose.

我是 aspose 的开发人员布道者。

回答by eleks007

If you want to convert PDF -> MS Word type file like docx, I came across this.

如果你想转换 PDF -> MS Word 类型的文件,比如 docx,我遇到了这个.

Ahsin Shabbirwrote:

阿辛·沙比尔写道:

import glob
import win32com.client
import os

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
    print(doc)
    filename = doc.split('\')[-1]
    in_file = os.path.abspath(doc)
    print(in_file)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
    print("outfile\n",out_file)
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    print("success...")
    wb.Close()

word.Quit()

This worked like a charm for me, converted 500 pages PDF with formatting and images.

这对我来说就像一个魅力,转换了 500 页的 PDF 格式和图像。