PHP/mySQL 中类似 Google 的搜索引擎
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/502238/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Google-like Search Engine in PHP/mySQL
提问by lkessler
We have OCRed thousands of pages of newspaper articles. The newspaper, issue, date, page number and OCRed text of each page has been put into a mySQL database.
我们已经对数千页报纸文章进行了 OCR。每页的报纸、期数、日期、页码和OCRed 文本已放入mySQL 数据库中。
We now want to build a Google-like search engine in PHP to find the pages given a query. It's got to be fast, and take no more than a second for any search.
我们现在想要在 PHP 中构建一个类似于 Google 的搜索引擎来查找给定查询的页面。它必须很快,任何搜索都不超过一秒钟。
How should we do it?
我们应该怎么做?
采纳答案by cnu
You can also try out SphinxSearch. Craigslist uses sphinx and it can connect to both mysql and postgresql.
你也可以试试SphinxSearch。Craigslist 使用 sphinx,它可以连接到 mysql 和 postgresql。
回答by James Brady
If MySQL's fulltext search is taking 20 seconds per query, you either have it misconfigured or running on underpowered hardware - some bigsites are successfully using plain old MyISAM searching.
如果 MySQL 的全文搜索每个查询需要 20 秒,那么您要么配置错误,要么在动力不足的硬件上运行 - 一些大站点成功地使用了普通的 MyISAM 搜索。
My vote goes for Solr, however. It's based on Lucene, so you get all the richness and performance of that best of breed product, but with a RESTful API, making it very easily from PHP. There's even a dW article.
然而,我的票投给了Solr。它基于 Lucene,因此您可以获得同类最佳产品的所有丰富性和性能,但使用 RESTful API,可以非常轻松地从 PHP 实现。甚至还有一篇 dW 文章。
回答by Glenn
There are some interesting search engines for you to take a look at. I don't know what you mean by "Google like" so I'm just going to ignore that part.
有一些有趣的搜索引擎供您查看。我不知道你所说的“像谷歌一样”是什么意思,所以我将忽略那部分。
- Take a look at the Luceneengine. The original is high performance but written in Java. There is a port of Lucene to PHP(already mentioned elsewhere) but it is too slow.
- Take a serious look at the Xapian Project. It's fast. It's written in C++ so you'll most probably have to build it for your target server(s) but has PHP bindings.
- 看看Lucene引擎。原版是高性能但用Java编写的。有一个Lucene 到 PHP的端口(已经在别处提到过)但是它太慢了。
- 认真看看Xapian 项目。它很快。它是用 C++ 编写的,因此您很可能必须为目标服务器构建它,但具有 PHP 绑定。
回答by Sun
You could put all the files on Google Docs, then scrape the results to your own web site.
您可以将所有文件放在 Google Docs 上,然后将结果抓取到您自己的网站。
My concern is that OCR accuracy is still an issue, so one consideration for a search requirement is the ability to perform "fuzzy" searches. Fuzzy meaning when the OCR incorrectly recognizes the word "hat" for "hot", the search engine will be smart enough to return results that are similar but not exact. In Oracle, there is a function called UTL_MATCH that compares the similarity between two strings: http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm#ARPLS352
我担心的是 OCR 准确性仍然是一个问题,因此搜索要求的一个考虑因素是执行“模糊”搜索的能力。模糊含义当 OCR 错误地将“帽子”一词识别为“热”时,搜索引擎将足够聪明以返回相似但不准确的结果。在Oracle中,有一个叫做UTL_MATCH的函数可以比较两个字符串之间的相似度:http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm#ARPLS352
A function like this would be useful.
像这样的功能会很有用。
回答by Silver Dragon
Your scenario suggest, that you'd like to roll your own; good starting points for a general search engine would include:
您的方案表明,您想推出自己的产品;一般搜索引擎的良好起点包括:
- Software Engineering for Internet Applications / Search
- The Anatomy of a Large-Scale Hypertextual Web Search Engineby some guys
- If your document structure suggest inter-linking features, you can exploit that in the ranking system, see PageRank
- 互联网应用软件工程/搜索
- 一些人对大型超文本网络搜索引擎的剖析
- 如果您的文档结构建议互连功能,您可以在排名系统中利用它,请参阅PageRank
If you want to use an off-shelf solution:
如果您想使用现成的解决方案:
- If your application is web-based, and available to public internet, you really have to come up with a very good reason to notto go with Google Site Search
- Lucene has a port for PHP
- 如果您的应用程序是基于网络的,并且可以在公共互联网上使用,那么您真的必须想出一个非常好的理由不使用Google Site Search
- Lucene 有一个 PHP 端口
回答by Pradeep
Why don't you try something like Google Search Appliance or Google Enterprise? It will have cost associated but then it will save you from re-inventing the wheel and give you "google like" search.
为什么不试试 Google Search Appliance 或 Google Enterprise 之类的东西?它会产生相关的成本,但它会让您免于重新发明轮子并为您提供“类似谷歌”的搜索。
回答by Darryl Hein
回答by CMS
回答by Michael MD
sqlite has quite good full text search capability (look up sqlite FTS 3/4 - its surprisingly good)
sqlite 具有相当好的全文搜索能力(查找 sqlite FTS 3/4 - 它出奇的好)
if you want simple a PHP diy approach indexing using up of lots of small files split by a hash of the terms being indexed can work very well amd searching can be very fast even in php if you take care designing it. (the idea is to make a search on a term only need to search a very small file containing terms matching the hash and record id's - you could use bitarray slices to represent record ids if you want to save HD space) .. but doing the indexing of every word for fulltext would be slow in php .. that part should really be done in c
如果你想要简单的 PHP diy 方法索引,使用由被索引的术语的散列分割的大量小文件可以很好地工作,如果你仔细设计它,即使在 php 中搜索也可以非常快。(这个想法是对一个术语进行搜索,只需要搜索一个包含与哈希和记录 id 匹配的术语的非常小的文件——如果你想节省 HD 空间,你可以使用 bitarray 切片来表示记录 id)..但是做在 php 中索引全文的每个单词会很慢..那部分应该在 c 中完成
for "Fuzzy" searches maybe look at using metaphone hashes.
对于“模糊”搜索,可能会考虑使用元音哈希。
for pre-built fulltext tools check out these: sqlite FTS 3/4 (sqlite has very good fulltext search capability!), Sphinx, kinoSearch (kinoSearch is a bit like Lucene but the back-end is c with a nice easy perl wrapper - there is also cLucene but I think thats still pre-alpha)
对于预先构建的全文工具,请查看这些:sqlite FTS 3/4(sqlite 具有非常好的全文搜索能力!)、Sphinx、kinoSearch(kinoSearch 有点像 Lucene,但后端是 c,带有一个很好的简单的 perl 包装器 -还有cLucene,但我认为那仍然是pre-alpha)
Java Lucene (or anything Java-based) probably needs a lot of ram to to be set aside to run a JVM - so probably not so great if you are on a budget
Java Lucene(或任何基于 Java 的)可能需要留出大量内存来运行 JVM - 所以如果你的预算有限,可能不会那么好

