Linux 如何搜索多个pdf文件的内容？

Question

提问by Jestin Joy

How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grepcan't search PDF files.

如何在目录/子目录中搜索 PDF 文件的内容？我正在寻找一些命令行工具。似乎grep无法搜索PDF文件。

Answer 1

采纳答案by sjr

Your distribution should provide a utility called pdftotext:

您的发行版应提供一个名为的实用程序pdftotext：

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The "-" is necessary to have pdftotext output to stdout, not to files. The --with-filenameand --label=options will put the file name in the output of grep. The optional --colorflag is nice and tells grep to output using colors on the terminal.

“-”是将 pdftotext 输出到标准输出而不是文件所必需的。在--with-filename和--label=选项将在grep的输出把文件名。可选--color标志很好，它告诉 grep 在终端上使用颜色输出。

(In Ubuntu, pdftotextis provided by the package xpdf-utilsor poppler-utils.)

（在 Ubuntu 中，pdftotext由包xpdf-utils或poppler-utils.）提供

This method, using pdftotextand grep, has an advantage over pdfgrepif you want to use features of GNU grepthat pdfgrepdoesn't support. Note: pdfgrep-1.3.x supports -Coption for printing line of context.

这种方法，使用pdftotext和grep，拥有一个优势pdfgrep，如果你想使用GNU的特点grep是pdfgrep不支持。注意：pdfgrep-1.3.x 支持-C打印上下文行的选项。

Answer 2

回答by Nylon Smile

You need some tools like pdf2text to first convert your pdf to a text file and then search inside the text. (You will probably miss some information or symbols).

您需要一些像 pdf2text 这样的工具来首先将您的 pdf 转换为文本文件，然后在文本中进行搜索。（您可能会错过一些信息或符号）。

If you are using a programming language there are probably pdf libraries written for this purpose. e.g. http://search.cpan.org/dist/CAM-PDF/for Perl

如果您使用的是编程语言，则可能有为此目的编写的 pdf 库。例如http://search.cpan.org/dist/CAM-PDF/对于 Perl

Answer 3

回答by acathur

try using 'acroread' in a simple script like the one above

尝试在像上面那样的简单脚本中使用“acroread”

Answer 4

回答by Graeme

There is pdfgrep, which does exactly what its name suggests.

有pdfgrep，它完全符合其名称的含义。

pdfgrep -R 'a pattern to search recursively from path' /some/path

I've used it for simple searches and it worked fine.

我已经将它用于简单的搜索，并且效果很好。

(There are packages in Debian, Ubuntu and Fedora.)

（Debian、Ubuntu 和 Fedora 中有软件包。）

Since version 1.3.0 pdfgrepsupports recursive search. This version is available in Ubuntu since Ubuntu 12.10 (Quantal).

从 1.3.0 版开始，pdfgrep支持递归搜索。自 Ubuntu 12.10 (Quantal) 起，此版本在 Ubuntu 中可用。

Answer 5

回答by phil

I made this destructivesmall script. Have fun with it.

我制作了这个破坏性的小脚本。玩得开心。

function pdfsearch()
{
    find . -iname '*.pdf' | while read filename
    do
        #echo -e "3[34;1m// === PDF Document:3[33;1m $filename3[0m"
        pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i  "$filename."
        # remove it!  rm -f "$filename."
    done
}

Answer 6

回答by Paul Weibert

I had the same problem and thus I wrote a script which searches all pdf files in the specified folder for a string and prints the PDF files wich matched the query string.

我遇到了同样的问题，因此我编写了一个脚本，该脚本在指定文件夹中的所有 pdf 文件中搜索字符串并打印与查询字符串匹配的 PDF 文件。

Maybe this will be helpful to you.

也许这会对你有所帮助。

You can download it here

你可以在这里下载

Answer 7

回答by Aleksey Kontsevich

If You want to see file names with pdftotextuse following command:

如果您想查看带有pdftotext 的文件名，请使用以下命令：

find . -name '*.pdf' -exec echo {} \; -exec pdftotext {} - \; | grep "pattern\|pdf"

Answer 8

回答by Glutanimate

Recollis a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.

Recoll是一款出色的 Unix/Linux 全文 GUI 搜索应用程序，支持多种不同格式，包括 PDF。它甚至可以将查询的确切页码和搜索词传递给文档查看器，从而允许您直接从其 GUI 跳转到结果。

Recoll also comes with a viable command-line interface and a web-browser interface.

Recoll 还带有一个可行的命令行界面和一个网络浏览器界面。

Answer 9

回答by Craig

There is an open source common resource grep tool crgrepwhich searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.

有一个开源的公共资源 grep 工具crgrep，它可以在 PDF 文件中搜索，但也可以搜索其他资源，如嵌套在档案中的内容、数据库表、图像元数据、POM 文件依赖项和 Web 资源 - 以及这些的组合，包括递归搜索。

The full description under the Files tab pretty much covers what the tool supports.

文件选项卡下的完整描述几乎涵盖了该工具支持的内容。

I developed crgrep as an opensource tool.

我开发了 crgrep 作为开源工具。

Answer 10

回答by arkhi

My actual version of pdfgrep (1.3.0) allows the following:

我的 pdfgrep (1.3.0) 的实际版本允许以下内容：

pdfgrep -HiR 'pattern' /path

When doing pdfgrep --help:

做的时候pdfgrep --help：

H: Print the file name for each match.
i: Ignore case distinctions.
R: Search directories recursively.

H：打印每个匹配项的文件名。
i：忽略大小写区别。
R：递归搜索目录。

It works well on my Ubuntu.

它在我的 Ubuntu 上运行良好。

Linux 如何搜索多个pdf文件的内容？

提问by Jestin Joy

采纳答案by sjr

回答by Nylon Smile

回答by acathur

回答by Graeme

回答by phil

回答by Paul Weibert

回答by Aleksey Kontsevich

回答by Glutanimate

回答by Craig

回答by arkhi

相关推荐

最近更新

标签

Linux 如何搜索多个pdf文件的内容？

提问by Jestin Joy

采纳答案by sjr

回答by Nylon Smile

回答by acathur

回答by Graeme

回答by phil

回答by Paul Weibert

回答by Aleksey Kontsevich

回答by Glutanimate

回答by Craig

回答by arkhi

相关推荐

Linux 有没有一种快速的方法可以从 Jar/war 中删除文件而不必提取 jar 并重新创建它？

C# WPF 图片资源

Linux Python virtualenv 问题

使用 Moq 确定方法是否被调用

相关推荐

最近更新

标签