Oracle 模糊文本搜索
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14283140/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Oracle Fuzzy text search
提问by Lloyd
How would one go about doing a fuzzy name search in Oracle?
如何在 Oracle 中进行模糊名称搜索?
For example:
例如:
Our data system has the preferred mailing as:
我们的数据系统有首选的邮寄方式:
Mr. Nicolas Jim Raleigh
尼古拉斯·吉姆·罗利先生
But in Facebook, or other search field, the name passed to the algorithm is:
但在 Facebook 或其他搜索领域,传递给算法的名称是:
Nick Jim Raleigh
尼克·吉姆·罗利
The process would run the search name against all of the preferred names, then return the result that contains the most matching characters:
该过程将针对所有首选名称运行搜索名称,然后返回包含最匹配字符的结果:
Mr. [Nic]olas Jim Raleigh
[ Nic]olas Jim Raleigh先生
[Nic]k Jim Raleigh
[尼克]ķ吉姆罗利
16 out of my searched name's 17 characters appear in the preferred name, and we could return a ranked suggestion.
我搜索的名称的 17 个字符中有 16 个出现在首选名称中,我们可以返回排名建议。
[EDITED TO ADD]
[编辑添加]
After initial suggestion, and reading of Oracle's Text Query optionsI have created an index on the table
在初步建议和阅读Oracle 的文本查询选项后,我在表上创建了一个索引
create index ADD_EX_INDX3 on address_extract(pref_mail_name)
indextype is ctxsys.context
parameters ('DATASTORE CTXSYS.DEFAULT_DATASTORE');
and now am able sucessfully retireve
现在可以成功退休
select score(1), ae.pref_mail_name
from address_extract ae
where contains(pref_mail_name,'fuzzy(raleigh,,,weight)',1) > 0
order by score(1) desck
Which returns
哪个返回
100 Mr. Raleigh H. Jameson
100 Mr. Nicolas Jim Raleigh
100 Ms. Susanne M. Raleigh
66 Mrs. LaReign Smith
66 Ms. Rahil Smith
62 Mr. Smith Ragalie
I am struggling to to a full name search however. How would I go about doing the full name?
然而,我正在努力进行全名搜索。我将如何去做全名?
回答by APC
Name matching is hard. Oracle's Text indexing supports fuzzy matching and stemming, which is a start, but consider these names:
姓名匹配很难。Oracle 的文本索引支持模糊匹配和词干提取,这是一个开始,但请考虑以下名称:
- Nicholas Raleigh
- Nihcolas Raleigh
- Nico Raleigh
- Nik Raleigh
- Nicky Raleigh
- Nick Raleigh
- Nikolaus Raleigh
- Nicola Raleigh
- Nikki Raleigh
- Nikola Raleigh
- Nikolai Raleigh
- Nikolaj Raleigh
- 尼古拉斯·罗利
- 尼古拉斯·罗利
- 尼科·罗利
- 尼克·罗利
- 尼基·罗利
- 尼克·罗利
- 尼古拉斯·罗利
- 尼古拉·罗利
- 妮基·罗利
- 尼古拉·罗利
- 尼古拉·罗利
- 尼古拉·罗利
Attempting to match those through abstractions, be it Levenshtein Distance or Double Metaphone, is going to generate false positives and false negatives. This is the nature of abstraction. The best way to get a focused and accurate result set is with a thesaurus (and even this isn't perfect). Unfortunately, assembling a comprehensive thesaurus of names is a gigantic undertaking; to get a sense of the task check out the stats on the NameX site.
尝试通过抽象来匹配这些,无论是 Levenshtein Distance 还是 Double Metaphone,都会产生误报和漏报。这就是抽象的本质。获得重点准确的结果集的最佳方法是使用同义词库(即使这并不完美)。不幸的是,组装一个全面的名称词库是一项艰巨的任务;要了解任务,请查看NameX 站点上的统计数据。
Update: Oracle 11gR2 includes an extension to Oracle Text tailored to name searching. This is highly neat, and definitely the first place to start. Find out more.
更新:Oracle 11gR2 包括为名称搜索量身定制的 Oracle Text 扩展。这是非常整洁的,绝对是第一个开始的地方。 了解更多。