Linux 从pdf文件中提取矢量图像

Question

提问by v923z

Is there a command line tool on linux that would extract figures from a pdf file, and save them in vector format? I know about pdfimages, but that would create a bitmap, and that is not what I need.

linux 上是否有命令行工具可以从 pdf 文件中提取数字，并将它们保存为矢量格式？我知道 pdfimages，但这会创建一个位图，这不是我需要的。

Answer 1

回答by Dingo

not for imagesonly, as you seem to need, but

不仅用于图像，正如您似乎需要的那样，但是

pdftocairo

pdftocairo

http://poppler.freedesktop.org/

http://www.manpagez.com/man/1/pdftocairo/(manpage)

http://www.manpagez.com/man/1/pdftocairo/ （手册页）

is able to render a pdf page to other vector formats like PS/EPS/SVG

能够将 pdf 页面呈现为其他矢量格式，如PS/EPS/SVG

assuming you have a pdf page with vectorized images, you can render this page to svg and then copy only image you are interested in

假设您有一个带有矢量化图像的 pdf 页面，您可以将此页面渲染为 svg，然后仅复制您感兴趣的图像

note: pdftocairocannot render multipage pdf to multipage svg

注意：pdftocairo无法将多页 pdf 渲染为多页 svg

if you need to convert to svg several pdf pages you need first to pick this page range and then burst pdf pages into single pdf pages

如果您需要将多个 pdf 页面转换为 svg，您首先需要选择此页面范围，然后将 pdf 页面分解为单个 pdf 页面

example (if we need to convert pages 1-10 of a pdf file to svg)

示例（如果我们需要将 pdf 文件的第 1-10 页转换为 svg）

1°

1°

pdftk file.pdf cat 1-10 output 1-10.pdf

2°

2°

pdftk 1-10.pdf burst

3°

3°

for f in *.pdf; do pdftocairo -svg $f; done

4°

4°

finally, with sodipodi or inkscape, you can extract images you are interested from svg rendered pdf page

最后，使用 sodipodi 或 inkscape，您可以从 svg 渲染的 pdf 页面中提取您感兴趣的图像

Answer 2

回答by Falko Menge

This articledescribes the tools gpdfx, inkscape and pdf2svg which are not completely commandline-based, but still sound helpful.

本文介绍了 gpdfx、inkscape 和 pdf2svg 工具，它们并非完全基于命令行，但听起来仍然很有帮助。

Answer 3

回答by David van Driessche

What do you consider a "figure"? This is a concept that doesn't exist in PDF. The reason there are so many tools that can extract images from a PDF file, is because images are a very clearly identified entity.

你认为什么是“形象”？这是 PDF 中不存在的概念。之所以有这么多工具可以从 PDF 文件中提取图像，是因为图像是一个非常明确的实体。

Your "figures" however, are much less clearly defined. PDF files may contain lots of vector content that you wouldn't call a figure. Text can be stroked for example, which would make it vector art and as such it might be confused with your figures. Other decorative elements may be used in the background of the pages. Text may be underlined, which would be a vector element...

然而，你的“数字”定义得不太清楚。PDF 文件可能包含许多您不会称之为图形的矢量内容。例如，可以对文本进行描边，这将使其成为矢量艺术，因此它可能会与您的图形混淆。其他装饰元素可用于页面背景。文本可能带有下划线，这将是一个向量元素......

In the other direction, your "figure" may contain a caption that is text, further complicating things.

另一方面，您的“图形”可能包含一个文本标题，使事情进一步复杂化。

As PDF doesn't have the notion of a figure, you'll have to figure out how to isolate one on a PDF page (perhaps because the creator application always adds metadata to them, or because they use a special color or... If you can isolate them, it should be possible to trim everything irrelevant on the page and export what you need as EPS or SVG using some of the techniques described in the other answer.

由于 PDF 没有图形的概念，您必须弄清楚如何在 PDF 页面上隔离一个（可能是因为创建者应用程序总是向它们添加元数据，或者因为它们使用特殊颜色或...如果您可以隔离它们，则应该可以修剪页面上不相关的所有内容，并使用其他答案中描述的一些技术将您需要的内容导出为 EPS 或 SVG。

Linux 从pdf文件中提取矢量图像

提问by v923z

回答by Dingo

回答by Falko Menge

回答by David van Driessche

相关推荐

最近更新

标签

Linux 从pdf文件中提取矢量图像

提问by v923z

回答by Dingo

回答by Falko Menge

回答by David van Driessche

相关推荐

Linux tcmalloc/jemalloc 和内存池之间（以及选择的原因）有什么区别？

C# 如何让 NLog 写入数据库

Linux 为什么 C 时钟（）返回 0

C# 获取远程服务器的确切时间

相关推荐

最近更新

标签