Linux 如何搜索多个pdf文件的内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4643438/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to search contents of multiple pdf files?
提问by Jestin Joy
How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep
can't search PDF files.
如何在目录/子目录中搜索 PDF 文件的内容?我正在寻找一些命令行工具。似乎grep
无法搜索PDF文件。
采纳答案by sjr
Your distribution should provide a utility called pdftotext
:
您的发行版应提供一个名为 的实用程序pdftotext
:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;
The "-" is necessary to have pdftotext output to stdout, not to files.
The --with-filename
and --label=
options will put the file name in the output of grep.
The optional --color
flag is nice and tells grep to output using colors on the terminal.
“-”是将 pdftotext 输出到标准输出而不是文件所必需的。在--with-filename
和--label=
选项将在grep的输出把文件名。可选--color
标志很好,它告诉 grep 在终端上使用颜色输出。
(In Ubuntu, pdftotext
is provided by the package xpdf-utils
or poppler-utils
.)
(在 Ubuntu 中,pdftotext
由包xpdf-utils
或poppler-utils
.)提供
This method, using pdftotext
and grep
, has an advantage over pdfgrep
if you want to use features of GNU grep
that pdfgrep
doesn't support. Note: pdfgrep-1.3.x supports -C
option for printing line of context.
这种方法,使用pdftotext
和grep
,拥有一个优势pdfgrep
,如果你想使用GNU的特点grep
是pdfgrep
不支持。注意:pdfgrep-1.3.x 支持-C
打印上下文行的选项。
回答by Nylon Smile
You need some tools like pdf2text to first convert your pdf to a text file and then search inside the text. (You will probably miss some information or symbols).
您需要一些像 pdf2text 这样的工具来首先将您的 pdf 转换为文本文件,然后在文本中进行搜索。(您可能会错过一些信息或符号)。
If you are using a programming language there are probably pdf libraries written for this purpose. e.g. http://search.cpan.org/dist/CAM-PDF/for Perl
如果您使用的是编程语言,则可能有为此目的编写的 pdf 库。例如http://search.cpan.org/dist/CAM-PDF/对于 Perl
回答by acathur
try using 'acroread' in a simple script like the one above
尝试在像上面那样的简单脚本中使用“acroread”
回答by Graeme
There is pdfgrep, which does exactly what its name suggests.
有pdfgrep,它完全符合其名称的含义。
pdfgrep -R 'a pattern to search recursively from path' /some/path
I've used it for simple searches and it worked fine.
我已经将它用于简单的搜索,并且效果很好。
(There are packages in Debian, Ubuntu and Fedora.)
(Debian、Ubuntu 和 Fedora 中有软件包。)
Since version 1.3.0 pdfgrepsupports recursive search. This version is available in Ubuntu since Ubuntu 12.10 (Quantal).
从 1.3.0 版开始,pdfgrep支持递归搜索。自 Ubuntu 12.10 (Quantal) 起,此版本在 Ubuntu 中可用。
回答by phil
I made this destructivesmall script. Have fun with it.
我制作了这个破坏性的小脚本。玩得开心。
function pdfsearch()
{
find . -iname '*.pdf' | while read filename
do
#echo -e "3[34;1m// === PDF Document:3[33;1m $filename3[0m"
pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i "$filename."
# remove it! rm -f "$filename."
done
}
回答by Paul Weibert
I had the same problem and thus I wrote a script which searches all pdf files in the specified folder for a string and prints the PDF files wich matched the query string.
我遇到了同样的问题,因此我编写了一个脚本,该脚本在指定文件夹中的所有 pdf 文件中搜索字符串并打印与查询字符串匹配的 PDF 文件。
Maybe this will be helpful to you.
也许这会对你有所帮助。
You can download it here
你可以在这里下载
回答by Aleksey Kontsevich
If You want to see file names with pdftotextuse following command:
如果您想查看带有pdftotext 的文件名,请使用以下命令:
find . -name '*.pdf' -exec echo {} \; -exec pdftotext {} - \; | grep "pattern\|pdf"
回答by Glutanimate
Recollis a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.
Recoll是一款出色的 Unix/Linux 全文 GUI 搜索应用程序,支持多种不同格式,包括 PDF。它甚至可以将查询的确切页码和搜索词传递给文档查看器,从而允许您直接从其 GUI 跳转到结果。
Recoll also comes with a viable command-line interface and a web-browser interface.
Recoll 还带有一个可行的命令行界面和一个网络浏览器界面。
回答by Craig
There is an open source common resource grep tool crgrepwhich searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
有一个开源的公共资源 grep 工具crgrep,它可以在 PDF 文件中搜索,但也可以搜索其他资源,如嵌套在档案中的内容、数据库表、图像元数据、POM 文件依赖项和 Web 资源 - 以及这些的组合,包括递归搜索。
The full description under the Files tab pretty much covers what the tool supports.
文件选项卡下的完整描述几乎涵盖了该工具支持的内容。
I developed crgrep as an opensource tool.
我开发了 crgrep 作为开源工具。
回答by arkhi
My actual version of pdfgrep (1.3.0) allows the following:
我的 pdfgrep (1.3.0) 的实际版本允许以下内容:
pdfgrep -HiR 'pattern' /path
When doing pdfgrep --help
:
做的时候pdfgrep --help
:
- H: Print the file name for each match.
- i: Ignore case distinctions.
- R: Search directories recursively.
- H:打印每个匹配项的文件名。
- i:忽略大小写区别。
- R:递归搜索目录。
It works well on my Ubuntu.
它在我的 Ubuntu 上运行良好。