Python 文件索引和搜索
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/532312/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python file indexing and searching
提问by Staale
I have a large set off files (hdf) that I need to enable search for. For Java I would use Lucene for this, as it's a file and document indexing engine. I don't know what the python equivalent would be though.
我有一个大文件 (hdf) 需要启用搜索。对于 Java,我会为此使用 Lucene,因为它是一个文件和文档索引引擎。我不知道 python 等价物是什么。
Can anyone recommend which library I should use for indexing a large collection of files for fast search? Or is the prefered way to roll your own?
谁能推荐我应该使用哪个库来索引大量文件以进行快速搜索?或者是自己滚动的首选方式?
I have looked at pyluceneand lupy, but both projects seem rather inactive and unsupported, so I am not sure if should rely on them.
我看过pylucene和lupy,但这两个项目似乎都相当不活跃且不受支持,所以我不确定是否应该依赖它们。
Final notes: Woosh and pylucene seems promising, but woosh is still alpha so I am not sure I want to rely on it, and I have problems compiling pylucene, and there are no actual releases off it. After I have looked a bit more at the data, it's mostly numbers and default text strings, so as off now an indexing engine won't help me. Hopefully these libraries will stabilize and later visitors will find some use for them.
最后说明:Woosh 和 pylucene 看起来很有希望,但 woosh 仍然是 alpha 版本,所以我不确定我是否想要依赖它,而且我在编译 pylucene 时遇到了问题,并且没有实际发布它。在我更多地查看数据之后,它主要是数字和默认文本字符串,因此现在索引引擎对我无济于事。希望这些库会稳定下来,以后的访问者会发现它们有一些用处。
采纳答案by A. Coady
Lupy has been retiredand the developers recommend PyLucene instead. As for PyLucene, its mailing list activity may be low, but it is definitely supported. In fact, it just recently became an official apache subproject.
卢皮已经退役和开发商建议PyLucene代替。至于PyLucene,它的邮件列表活跃度可能较低,但肯定是支持的。事实上,它最近才成为官方的 apache 子项目。
You may also want to look at a new contender: Whoosh. It's similar to lucene, but implemented in pure python.
您可能还想看看一个新的竞争者: Whoosh。它类似于 lucene,但在纯 python 中实现。
回答by batbrat
I haven't done indexing before, however the following may be helpful :-
我以前没有做过索引,但是以下内容可能会有所帮助:-
- pyIndex - http://rgaucher.info/beta/pyIndex/-- File indexing library for Python
- http://www.xml.com/pub/a/ws/2003/05/13/email.html-- Thats a script for searching Outlook email using Python and Lucene
- http://gadfly.sourceforge.net/- Aaron water's gadfly database (I think you can use this one for indexing. Haven't used it myself.)
- pyIndex - http://rgaucher.info/beta/pyIndex/-- Python 的文件索引库
- http://www.xml.com/pub/a/ws/2003/05/13/email.html-- 这是使用 Python 和 Lucene 搜索 Outlook 电子邮件的脚本
- http://gadfly.sourceforge.net/- Aaron water 的 gadfly 数据库(我想你可以用这个做索引。我自己没用过。)
As far as using HDF files goes, I have heard of a module called h5py.
就使用 HDF 文件而言,我听说过一个名为 h5py 的模块。
I hope this helps.
我希望这有帮助。
回答by Seb
回答by Rob Young
A popular C++ based information retrieval library that is often used with Python is Xapian http://xapian.org/
一个经常与 Python 一起使用的流行的基于 C++ 的信息检索库是 Xapian http://xapian.org/
It's incredibly quick and can happily manage large amounts of data, however it's not quite as easily extensible as Lucene.
它非常快并且可以愉快地管理大量数据,但是它不像 Lucene 那样容易扩展。
回答by Saurabh
Elastic search can be used to index documents and search by keywords
Elasticsearch can be integrated with graph db and hadoop as well
Some urls below:
1) https://www.elastic.co/products/elasticsearch
2) https://towardsdatascience.com/getting-started-with-elasticsearch-in-python-c3598e718380
Elasticsearch 可用于索引文档和关键字搜索Elasticsearch 可以与图 db 和 hadoop 以及下面的一些 urls 集成:
1) https://www.elastic.co/products/elasticsearch
2) https://towardsdatascience。 com/getting-started-with-elasticsearch-in-python-c3598e718380