bash 并行文档转换 ODT > PDF Libreoffice

Question

提问by timlev

I am converting hundreds of ODT files to PDF files, and it takes a long time doing one after the other. I have a CPU with multiple cores. Is it possible to use bash or python to write a script to do these in parallel? Is there a way to parallelize (not sure if I'm using the right word) batch document conversion using libreoffice from the command line? I have been doing it in python/bash calling the following commands:

我正在将数百个 ODT 文件转换为 PDF 文件，一个接一个执行需要很长时间。我有一个多核的 CPU。是否可以使用 bash 或 python 编写脚本来并行执行这些操作？有没有办法从命令行使用 libreoffice 并行化（不确定我是否使用正确的词）批处理文档转换？我一直在 python/bash 中调用以下命令：

libreoffice --headless --convert-to pdf *appsmergeme.odt

OR

或者

subprocess.call(str('cd $HOME; libreoffice --headless --convert-to pdf *appsmergeme.odt'), shell=True);

Thank you!

谢谢！

Tim

蒂姆

Answer 1

采纳答案by Pancho Jay

You can run libreoffice as a daemon/service. Please check the following link, maybe it helps you too: Daemonize the LibreOffice service

您可以将 libreoffice 作为守护程序/服务运行。请检查以下链接，也许它对您也有帮助：Daemonize the LibreOffice service

Other posibility is to use unoconv. "unoconv is a command line utility that can convert any file format that OpenOffice can import, to any file format that OpenOffice is capable of exporting."

其他可能性是使用unoconv。“unoconv 是一个命令行实用程序，可以将 OpenOffice 可以导入的任何文件格式转换为 OpenOffice 能够导出的任何文件格式。”

Answer 2

回答by user3165514

this thread or answer is old. I tested libreoffice 4.4, I can confirm I can run libreoffice concurrently. see my script.

这个线程或答案是旧的。我测试了 libreoffice 4.4，我可以确认我可以同时运行 libreoffice。看我的脚本。

for odt in test*odt ; do
echo $odt
soffice --headless --convert-to pdf $odt & 
ps -ef|grep ffice 
done

Answer 3

回答by Chickenmarkus

Since the author already introduced Python as a valid answer:

由于作者已经介绍了 Python 作为有效答案：

import subprocess
import os, glob
from multiprocessing.dummy import Pool    # wrapper around the threading module

def worker(fname, dstdir=os.path.expanduser("~")):
    subprocess.call(["libreoffice", "--headless", "--convert-to", "pdf", fname],
                    cwd=dstdir)

pool = Pool()
pool.map(worker, glob.iglob(
        os.path.join(os.path.expanduser("~"), "*appsmergeme.odt")
    ))

Using a thread pool instead of a process pool by multiprocessing.dummyis sufficient because new processes for real parallelism are spawn by subprocess.call()anyway.

使用线程池而不是进程池multiprocessing.dummy就足够了，因为subprocess.call()无论如何都会产生用于真正并行的新进程。

We can set the command as well as the current working directory cwddirectly. No need to load a shellfor each file for just doing that. Furthermore, os.pathenables cross-platform interoperability.

我们可以直接设置命令以及当前工作目录cwd。无需shell为每个文件加载一个。此外，os.path支持跨平台互操作性。

Answer 4

回答by luca76

I've written a program in golang to batch convert thousands of doc/xls files.

我用 golang 编写了一个程序来批量转换数千个 doc/xls 文件。

define the "root" variable value to the path of your documents to convert
already converted documents to pdf are skipped (if not, comment the check condition in the visit() function)
here I'm using 4 threads (I have an Intel i3 with 4 cores). You can modify the value in the main() function

将“root”变量值定义为要转换的文档路径
已转换为 pdf 的文档将被跳过（如果没有，请在 visit() 函数中注释检查条件）
在这里，我使用了 4 个线程（我有一个带有 4 个内核的 Intel i3）。可以修改main()函数中的值

Sometimes it can happen that Libreoffice doesn't convert some files, so you should open it and convert them to PDF manually. Luckily, they were only 10 out of my 16.000 documents to convert.

有时可能会发生 Libreoffice 不转换某些文件的情况，因此您应该打开它并手动将它们转换为 PDF。幸运的是，在我要转换的 16.000 个文档中，它们只有 10 个。

package main

import (
    "os/exec"
    "sync"
    "path/filepath"
    "os"
    "fmt"
    "strings"
)

// root dir of your documents to convert
root := "/.../conversion-from-office/"

var tasks = make(chan *exec.Cmd, 64)

func visit(path string, f os.FileInfo, err error) error {
    if (f.IsDir()) {
        // fmt.Printf("Entering %s\n", path)
    } else {
        ext := filepath.Ext(path)
        if (strings.ToLower (ext) == "pdf") {
        } else {


            outfile := path[0:len(path)-len(ext)] + ".pdf"

            if _, err := os.Stat(outfile); os.IsNotExist(err) {

                fmt.Printf("Converting %s\n", path)

                outdir := filepath.Dir(path)
                tasks <- exec.Command("soffice", "--headless", "--convert-to", "pdf", path, "--outdir", outdir)
            }
        }
    }
    return nil
} 


func main() {
    // spawn four worker goroutines
    var wg sync.WaitGroup

    // the  ...; i < 4;... indicates that I'm using 4 threads
    for i := 0; i < 4; i++ {
        wg.Add(1)
        go func() {
            for cmd := range tasks {
                cmd.Run()
            }
            wg.Done()
        }()
    }


    err := filepath.Walk(root, visit)
    fmt.Printf("filepath.Walk() returned %v\n", err)

    close(tasks)

    // wait for the workers to finish
    wg.Wait()
}

Answer 5

回答by abhayAndPoorvisDad

We had a similar problem with unoconv. unoconvinternally makes use of libreoffice. We solved it by sending multiple files to unoconv in one invocation. So, instead of iterating over all files, we just partition the set of files into buckets, each bucket representing the o/p format. Then we make as many calls as there are buckets.

我们在unoconv 上遇到了类似的问题。unoconv 在内部使用libreoffice。我们通过在一次调用中向 unoconv 发送多个文件来解决它。因此，不是迭代所有文件，我们只是将文件集划分为桶，每个桶代表 o/p 格式。然后我们进行尽可能多的调用。

I am pretty sure libreofficealso has a similar mode.

我很确定libreoffice也有类似的模式。

Answer 6

回答by BRPocock

Untestedpotentially valid:

未经测试可能有效：

You /may/ be able to:

你/可能/能够：

Divide up the files into a number of parallel batches in some equitable way, e.g. placing them all in folders;
Create a distinct local user account to handle each folder;
Run Libreoffice serially as each user

以某种公平的方式将文件分成多个平行批次，例如将它们全部放在文件夹中；
创建一个不同的本地用户帐户来处理每个文件夹；
以每个用户的身份连续运行 Libreoffice

e.g.

例如

 for paralleluser in timlev1 timlev2 timlev3 timlev4 ; do
      su - $paralleluser -c \
         "for file in /var/spool/pdfbatches/$paralleluser ; do \
            libreoffice --headless --convert-to pdf $file ; done" 
 done

By using su -you won't accidentally inherit any environment variables from your real session, so the parallel processes shouldn'tinterfere with one another (aside from competing for resources).

通过使用，su -您不会意外地从真实会话中继承任何环境变量，因此并行进程不应相互干扰（除了争夺资源）。

Keep in mind, these are likely I/O-bound tasks, so running 1 per CPU core will probably notspeed you up so very much.

请记住，这些可能是 I/O 密集型任务，因此每个 CPU 内核运行 1 个可能不会使您的速度提高太多。

bash 并行文档转换 ODT > PDF Libreoffice

提问by timlev

采纳答案by Pancho Jay

回答by user3165514

回答by Chickenmarkus

回答by luca76

回答by abhayAndPoorvisDad

回答by BRPocock

相关推荐

最近更新

标签

bash 并行文档转换 ODT > PDF Libreoffice

提问by timlev

采纳答案by Pancho Jay

回答by user3165514

回答by Chickenmarkus

回答by luca76

回答by abhayAndPoorvisDad

回答by BRPocock

相关推荐

有人可以在 bash 中解释这个 try/catch 替代方案吗？

bash 多行 zenity 输入文件

bash \r 在 Linux 系统上的含义

bash sed 替换字符串中的数字

相关推荐

最近更新

标签