数千个文档（pdf 和/或 xml）的可搜索存档的最佳实践

Question

提问by Meltemi

Revisiting a stalled project and looking for advice in modernizing thousands of "old" documents and making them available via web.

重新审视一个停滞不前的项目，并寻求对数千个“旧”文档进行现代化改造并通过网络提供它们的建议。

Documents exist in various formats, some obsolete: (.doc, PageMaker, hardcopy (OCR), PDF, etc.). Funds are available to migrate the documents into a 'modern' format, and many of the hardcopies have already been OCR'd into PDFs - we had originally assumed that PDF would be the final format but we're open to suggestions (XML?).

文档以各种格式存在，有些已过时：（.doc、PageMaker、硬拷贝 (OCR)、PDF等）。资金可用于将文档迁移到“现代”格式，并且许多硬拷贝已经被 OCR 转换为 PDF - 我们最初假设 PDF 将是最终格式，但我们愿意接受建议（XML？） .

Once all docs are in a common format we would like to make their contents available and searchable via a web interface. We'd like the flexibility to return only portions (pages?) of the entire document where a search 'hit' is found (I believe Lucene/elasticsearch makes this possible?!?) Might it be more flexible if content was all XML? If so how/where to store the XML? Directly in database, or as discrete files in the filesystem? What about embedded images/graphs in the documents?

一旦所有文档都采用通用格式，我们希望通过 Web 界面提供和搜索它们的内容。我们希望能够灵活地仅返回整个文档中找到搜索“命中”的部分（页面？）（我相信 Lucene/elasticsearch 使这成为可能？！？）如果内容都是 XML 会更灵活吗？如果是这样，如何/在哪里存储 XML？直接在数据库中，还是作为文件系统中的离散文件？文档中的嵌入图像/图形怎么样？

Curious how others might approach this. There is no "wrong" answer I'm just looking for as many inputs as possible to help us proceed.

好奇其他人会如何处理这个问题。没有“错误”的答案，我只是在寻找尽可能多的输入来帮助我们继续。

Thanks for any advice.

感谢您的任何建议。

Answer 1

回答by DrTech

In summary: I'm going to be recommending ElasticSearch, but let's break the problem down and talk about how to implement it:

总结：我将推荐ElasticSearch，但让我们分解问题并讨论如何实现它：

There are a few parts to this:

这有几个部分：

Extracting the text from your docs to make them indexable
Making this text available as full text search
Returning highlighted snippets of the doc
Knowing where in the doc those snippets are found to allow for paging
Return the full doc

从您的文档中提取文本以使其可索引
使此文本可用作全文搜索
返回文档的突出显示片段
知道在文档中的何处找到这些片段以允许分页
返回完整文档

What can ElasticSearch provide:

ElasticSearch 可以提供什么：

ElasticSearch (like Solr) uses Tikato extract text and metadata from a wide variety of doc formats
It, pretty obviously, provides powerful full text search. It can be configured to analyse each doc in the appropriate language with, stemming, boosting the relevance of certain fields (eg title more important than content), ngrams etc. ie standard Lucene stuff
It can return highlighted snippetsfor each search result
It DOESN'T know where those snippets occur in your doc
It can store the original doc as an attachment, or it can store and return the extracted text. But it'll return the whole doc, not a page.

ElasticSearch（如 Solr）使用Tika从各种文档格式中提取文本和元数据
很明显，它提供了强大的全文搜索。它可以配置为以适当的语言分析每个文档，使用、词干提取、提高某些领域的相关性（例如，标题比内容更重要）、ngram 等，即标准的 Lucene 内容
它可以为每个搜索结果返回突出显示的片段
它不知道这些片段出现在您的文档中的哪个位置
它可以将原始文档存储为附件，也可以存储并返回提取的文本。但它会返回整个文档，而不是页面。

You could just send the whole doc to ElasticSearch as an attachment, and you'd get full text search. But the sticking points are (4) and (5) above: knowing where you are in a doc, and returning parts of a doc.

您可以将整个文档作为附件发送到 ElasticSearch，您将获得全文搜索。但关键点是上面的 (4) 和 (5)：知道你在文档中的位置，并返回文档的一部分。

Storing individual pages is probably sufficient for your where-am-I purposes (although you could equally go down to paragraph level), but you want them grouped in a way that a doc would be returned in the search results, even if search keywords appear on different pages.

存储单个页面可能足以满足您的 where-am-I 目的（尽管您同样可以下到段落级别），但您希望它们以一种将在搜索结果中返回文档的方式分组，即使出现搜索关键字在不同的页面上。

First the indexing part: storing your docs in ElasticSearch:

首先是索引部分：将您的文档存储在 ElasticSearch 中：

Use Tika (or whatever you're comfortable with) to extract the text from each doc. Leave it as plain text, or as HTML to preserve some formatting. (forget about XML, no need for it).
Also extract the metadata for each doc: title, authors, chapters, language, dates etc
Store the original doc in your filesystem, and record the path so that you can serve it later
In ElasticSearch, index a "doc" doc which contains all of the metadata, and possibly the list of chapters
Index each page as a "page" doc, which contains:
- A parent fieldwhich contains the ID of the "doc" doc (see "Parent-child relationship" below)
- The text
- The page number
- Maybe the chapter title or number
- Any metadata which you want to be searchable

使用 Tika（或您喜欢的任何工具）从每个文档中提取文本。将其保留为纯文本或 HTML 以保留某些格式。（忘记 XML，不需要它）。
还提取每个文档的元数据：标题、作者、章节、语言、日期等
将原始文档存储在您的文件系统中，并记录路径以便稍后提供
在 ElasticSearch 中，索引一个“doc”文档，其中包含所有元数据，可能还有章节列表
将每个页面索引为“页面”文档，其中包含：
- 甲父字段，其包含的“doc”文档的ID（见“父-子关系”下方）
- 文本
- 页码
- 也许是章节标题或编号
- 您想要搜索的任何元数据

Now for searching. How you do this depends on how you want to present your results - by page, or grouped by doc.

现在进行搜索。您如何执行此操作取决于您希望如何呈现结果 - 按页面或按文档分组。

Results by page are easy. This query returns a list of matching pages (each page is returned in full) plus a list of highlighted snippets from the page:

按页显示结果很容易。此查询返回匹配页面的列表（每个页面都完整返回）以及页面中突出显示的片段列表：

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "text" : "interesting keywords"
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

Displaying results grouped by "doc" with highlights from the text is a bit trickier. It can't be done with a single query, but a little client side grouping will get you there. One approach might be:

显示按“doc”分组的结果和文本中的突出显示有点棘手。它不能通过单个查询完成，但是一个小的客户端分组会让你到达那里。一种方法可能是：

Step 1: Do a top-children-queryto find the parent ("doc") whose children ("page") best match the query:

第 1 步：执行top-children-query以查找其子项（“page”）与查询最匹配的父项（“doc”）：

curl -XGET 'http://127.0.0.1:9200/my_index/doc/_search?pretty=1'  -d '
{
   "query" : {
      "top_children" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "score" : "sum",
         "type" : "page",
         "factor" : "5"
      }
   }
}

Step 2: Collect the "doc" IDs from the above query and issue a new query to get the snippets from the matching "page" docs:

第 2 步：从上述查询中收集“文档”ID，并发出新查询以从匹配的“页面”文档中获取片段：

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "filtered" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "filter" : {
            "terms" : {
               "doc_id" : [ 1,2,3],
            }
         }
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

Step 3: In your app, group the results from the above query by doc and display them.

第 3 步：在您的应用中，将上述查询的结果按 doc 分组并显示出来。

With the search results from the second query, you already have the full text of the page which you can display. To move to the next page, you can just search for it:

通过第二个查询的搜索结果，您已经拥有了可以显示的页面全文。要移至下一页，您只需搜索它：

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "constant_score" : {
         "filter" : {
            "and" : [
               {
                  "term" : {
                     "doc_id" : 1
                  }
               },
               {
                  "term" : {
                     "page" : 2
                  }
               }
            ]
         }
      }
   },
   "size" : 1
}
'

Or alternatively, give the "page" docs an ID consisting of $doc_id _ $page_num(eg 123_2) then you can just retrieve that page:

或者，给“页面”文档一个由$doc_id _ $page_num（例如 123_2）组成的 ID，然后您就可以检索该页面：

curl -XGET 'http://127.0.0.1:9200/my_index/page/123_2

Parent-child relationship:

亲子关系：

Normally, in ES (and most NoSQL solutions) each doc/object is independent - there are no real relationships. By establishing a parent-child relationship between the "doc" and the "page", ElasticSearch makes sure that the child docs (ie the "page") are stored on the same shard as the parent doc (the "doc").

通常，在 ES（和大多数 NoSQL 解决方案）中，每个文档/对象都是独立的——没有真正的关系。通过在“文档”和“页面”之间建立父子关系，ElasticSearch 确保子文档（即“页面”）与父文档（“文档”）存储在同一个分片上。

This enables you to run the top-children-querywhich will find the best matching "doc" based on the content of the "pages".

这使您能够运行top-children-query，它将根据“页面”的内容找到最匹配的“文档”。

Answer 2

回答by Josh Siok

I've built and maintain an application that indexes and searches 70k+ PDF documents. I found it was necessarily to pull out the plain text from the PDFs, store the contents in SQL and index the SQL table using Lucene. Otherwise, performance was horrible.

我构建并维护了一个应用程序，该应用程序可以索引和搜索 70k+ PDF 文档。我发现必须从 PDF 中提取纯文本，将内容存储在 SQL 中并使用 Lucene 索引 SQL 表。否则，性能是可怕的。

Answer 3

回答by Dave Newton

Use Sunspotor RSolror similar, it handles most major document formats. They use Solr/Lucene.

使用Sunspot或RSolr或类似工具，它可以处理大多数主要的文档格式。他们使用 Solr/Lucene。

数千个文档（pdf 和/或 xml）的可搜索存档的最佳实践

提问by Meltemi

回答by DrTech

回答by Josh Siok

回答by Dave Newton

相关推荐

最近更新

标签

数千个文档（pdf 和/或 xml）的可搜索存档的最佳实践

提问by Meltemi

回答by DrTech

回答by Josh Siok

回答by Dave Newton

相关推荐

xslt - 从 xml 文件中选择一个属性值

如何将 xlsx (office 2007) 文件保存为 XML 文件格式

如何使用 VB 6.0 生成格式良好的 XML 文件？

在 Eclipse 中格式化 XML 代码

相关推荐

最近更新

标签