java Java中的搜索引擎？

Question

提问by lana

I am trying to create a search engine just to learn and get more experience in Java.
My intention is to store about 100 files on a server, a mixture of html, xml, doc, txt, and for each file to have meta data.
SO when i search for a keyword, it should display a file with its meta description like Google.
My question is, apart from html, can you add meta data to any other file formats, so that the meta description is shown.
Would you be able to point me towards a Java search engine, that can search within file formats (txt,html) and display the result.
I am working on my own code for this, but would like to have a look at other peoples code for some help?

我正在尝试创建一个搜索引擎，只是为了学习和获得更多的 Java 经验。
我的目的是在服务器上存储大约 100 个文件，混合 html、xml、doc、txt，并且每个文件都有元数据。
因此，当我搜索关键字时，它应该显示一个带有元描述的文件，例如 Google。
我的问题是，除了 html，您是否可以将元数据添加到任何其他文件格式，以便显示元描述。
您能否将我指向一个 Java 搜索引擎，它可以在文件格式（txt、html）中进行搜索并显示结果。
我正在为此编写自己的代码，但想看看其他人的代码以获得帮助吗？

Answer 1

回答by Dave Newton

Luceneis the canonical Java search engine.

Lucene是规范的 Java 搜索引擎。

For adding documents from a variety of sources, take a look at Apache Tikaand for a full-blown system with service/web interfaces, solr.

要添加来自各种来源的文档，请查看Apache Tika以及具有服务/Web 界面的完整系统solr。

Lucene allows arbitrary metadata to be associated with its documents. Tika will automatically cull metadata from a variety of formats.

Lucene 允许将任意元数据与其文档相关联。Tika 将自动从各种格式中剔除元数据。

Answer 2

回答by Thomas

1)My question is apart from html can you add meta data to any other file formats, so that the meta description is shown.

1）我的问题是除了 html 之外，您是否可以将元数据添加到任何其他文件格式，以便显示元描述。

In general you would use a database and store the metadata along with the document there. You'd then do a keyword search using a database query (possibly using SQL like or ilike).

通常，您会使用数据库并将元数据与文档一起存储在那里。然后，您将使用数据库查询（可能使用 SQL like 或 ilike）进行关键字搜索。

The files might either be stored on the harddrive with just paths in the DB or put into the database as either CLOB or BLOB, depending on whether you have text or binary documents.

这些文件可能仅通过数据库中的路径存储在硬盘驱动器上，也可能作为 CLOB 或 BLOB 放入数据库，具体取决于您是文本文档还是二进制文档。

2) Would you be able to point be towards a Java search engine, that can search within file formats (txt,html) and displays the result.

2）您能否指出一个Java搜索引擎，它可以在文件格式（txt，html）中进行搜索并显示结果。

Try Apache Lucene.

试试Apache Lucene。

Answer 3

回答by sbridges

Look at apache nutch

看看apache nutch

Apache Nutch is an open source web-search software project.

Nutch builds on top of lucene/solr for indexing, tika for parsing documents, and adds its own web crawler.

Nutch 建立在 lucene/solr 之上用于索引，tika 用于解析文档，并添加了自己的网络爬虫。

Answer 4

回答by stivlo

Google ignores completely meta descriptions nowadays, because it has been either abused, or not filled with significant values
Luceneand/or Solrmight do what you want, take a look.
100 files is a very small amount, you won't have any problem to manage this amount of data in any way you like, if it's for exercise.

谷歌如今完全忽略元描述，因为它要么被滥用，要么没有填充重要的值
Lucene和/或Solr可能会做你想做的，看看。
100 个文件是一个非常小的数量，如果是为了锻炼，您可以按照自己喜欢的任何方式管理这些数据量不会有任何问题。

Answer 5

回答by vector

... luceneand solrcome to mind as far other people's code is concerned.

...就其他人的代码而言，我会想到lucene和solr。

Answer 6

回答by Dewfy

The really good is Lucene. There are lot of plugins (that would allow for example you read from .doc), support multiple languages and lot of algorithms (like Levenshtein distance)

真正好的是Lucene。有很多插件（例如允许您从 .doc 中读取），支持多种语言和大量算法（例如 Levenshtein distance）

Answer 7

回答by Matthijs Bierman

You'll have to use several libraries. First of all, as many people mentioned before you can use Luceneto do the actual searching. However, Lucene only handles plain text, so you need to extract this from the files you index. For this, you could use Apache Tika.

您将不得不使用多个库。首先，正如之前很多人提到的，您可以使用Lucene进行实际搜索。然而，Lucene 只处理纯文本，所以你需要从你索引的文件中提取它。为此，您可以使用Apache Tika。

To get started, you should probably buy the book Lucene in Action 2nd edition. Most of the examples in there are still up to date. If you want to be a cheapskate you could also just look at the provided source code on that page.

首先，您可能应该购买Lucene in Action 2nd edition一书。那里的大多数示例仍然是最新的。如果您想成为一个小气鬼，您也可以查看该页面上提供的源代码。

Answer 8

回答by Alfredo Osorio

Apache Tikato extract metadata.

Apache Tika提取元数据。

Apache Tika The Apache Tika toolkit is an ASFv2 licensed open source tool for extracting information from digital documents. Tika allows search engines, content management systems and other applications that work with various kinds of digital documents to easily detect and extract metadata and content from all major file formats.

Apache Tika Apache Tika 工具包是一个 ASFv2 许可的开源工具，用于从数字文档中提取信息。Tika 允许搜索引擎、内容管理系统和其他处理各种数字文档的应用程序轻松检测和提取所有主要文件格式的元数据和内容。

java Java中的搜索引擎？

提问by lana

回答by Dave Newton

回答by Thomas

回答by sbridges

回答by stivlo

回答by vector

回答by Dewfy

回答by Matthijs Bierman

回答by Alfredo Osorio

相关推荐

最近更新

标签

java Java中的搜索引擎？

提问by lana

回答by Dave Newton

回答by Thomas

回答by sbridges

回答by stivlo

回答by vector

回答by Dewfy

回答by Matthijs Bierman

回答by Alfredo Osorio

相关推荐

java 为什么 List<String>.toArray() 返回 Object[] 而不是 String[]？如何解决这个问题？

java 如何从 JDOM 获取节点内容

java 在 Android 中使用 ListView 实现自动完成

java Singleton - 通过反射防止多次创建

相关推荐

最近更新

标签