java Java中的搜索引擎?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7930474/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Search Engine in Java?
提问by lana
I am trying to create a search engine just to learn and get more experience in Java.
My intention is to store about 100 files on a server, a mixture of html, xml, doc, txt, and for each file to have meta data.
SO when i search for a keyword, it should display a file with its meta description like Google.
My question is, apart from html, can you add meta data to any other file formats, so that the meta description is shown.
Would you be able to point me towards a Java search engine, that can search within file formats (txt,html) and display the result.
I am working on my own code for this, but would like to have a look at other peoples code for some help?
我正在尝试创建一个搜索引擎,只是为了学习和获得更多的 Java 经验。
我的目的是在服务器上存储大约 100 个文件,混合 html、xml、doc、txt,并且每个文件都有元数据。
因此,当我搜索关键字时,它应该显示一个带有元描述的文件,例如 Google。
我的问题是,除了 html,您是否可以将元数据添加到任何其他文件格式,以便显示元描述。
您能否将我指向一个 Java 搜索引擎,它可以在文件格式(txt、html)中进行搜索并显示结果。
我正在为此编写自己的代码,但想看看其他人的代码以获得帮助吗?
回答by Dave Newton
Luceneis the canonical Java search engine.
Lucene是规范的 Java 搜索引擎。
For adding documents from a variety of sources, take a look at Apache Tikaand for a full-blown system with service/web interfaces, solr.
要添加来自各种来源的文档,请查看Apache Tika以及具有服务/Web 界面的完整系统solr。
Lucene allows arbitrary metadata to be associated with its documents. Tika will automatically cull metadata from a variety of formats.
Lucene 允许将任意元数据与其文档相关联。Tika 将自动从各种格式中剔除元数据。
回答by Thomas
1)My question is apart from html can you add meta data to any other file formats, so that the meta description is shown.
1)我的问题是除了 html 之外,您是否可以将元数据添加到任何其他文件格式,以便显示元描述。
In general you would use a database and store the metadata along with the document there. You'd then do a keyword search using a database query (possibly using SQL like or ilike).
通常,您会使用数据库并将元数据与文档一起存储在那里。然后,您将使用数据库查询(可能使用 SQL like 或 ilike)进行关键字搜索。
The files might either be stored on the harddrive with just paths in the DB or put into the database as either CLOB or BLOB, depending on whether you have text or binary documents.
这些文件可能仅通过数据库中的路径存储在硬盘驱动器上,也可能作为 CLOB 或 BLOB 放入数据库,具体取决于您是文本文档还是二进制文档。
2) Would you be able to point be towards a Java search engine, that can search within file formats (txt,html) and displays the result.
2)您能否指出一个Java搜索引擎,它可以在文件格式(txt,html)中进行搜索并显示结果。
Try Apache Lucene.
回答by sbridges
Look at apache nutch
Apache Nutch is an open source web-search software project.
Nutch builds on top of lucene/solr for indexing, tika for parsing documents, and adds its own web crawler.
Nutch 建立在 lucene/solr 之上用于索引,tika 用于解析文档,并添加了自己的网络爬虫。
回答by stivlo
- Google ignores completely meta descriptions nowadays, because it has been either abused, or not filled with significant values
- Luceneand/or Solrmight do what you want, take a look.
- 100 files is a very small amount, you won't have any problem to manage this amount of data in any way you like, if it's for exercise.
回答by vector
回答by Dewfy
回答by Matthijs Bierman
You'll have to use several libraries. First of all, as many people mentioned before you can use Luceneto do the actual searching. However, Lucene only handles plain text, so you need to extract this from the files you index. For this, you could use Apache Tika.
您将不得不使用多个库。首先,正如之前很多人提到的,您可以使用Lucene进行实际搜索。然而,Lucene 只处理纯文本,所以你需要从你索引的文件中提取它。为此,您可以使用Apache Tika。
To get started, you should probably buy the book Lucene in Action 2nd edition. Most of the examples in there are still up to date. If you want to be a cheapskate you could also just look at the provided source code on that page.
首先,您可能应该购买Lucene in Action 2nd edition一书。那里的大多数示例仍然是最新的。如果您想成为一个小气鬼,您也可以查看该页面上提供的源代码。
回答by Alfredo Osorio
Apache Tikato extract metadata.
Apache Tika提取元数据。
Apache Tika The Apache Tika toolkit is an ASFv2 licensed open source tool for extracting information from digital documents. Tika allows search engines, content management systems and other applications that work with various kinds of digital documents to easily detect and extract metadata and content from all major file formats.
Apache Tika Apache Tika 工具包是一个 ASFv2 许可的开源工具,用于从数字文档中提取信息。Tika 允许搜索引擎、内容管理系统和其他处理各种数字文档的应用程序轻松检测和提取所有主要文件格式的元数据和内容。