Oracle 中的模糊文本搜索
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25258954/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fuzzy text searching in Oracle
提问by Ufuk Can Bicici
I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.
我有一个大型 Oracle DB 表,其中包含整个国家/地区的街道名称,其中有 600000 多行。在我的应用程序中,我将一个地址字符串作为输入,并希望检查该地址字符串的特定子字符串是否与表中的一个或多个街道名称匹配,以便我可以将该地址子字符串标记为街道名称。
Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htmin which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.
显然,这应该是一个模糊文本匹配问题,我查询的子字符串与 DB 表中的街道名称完全匹配的可能性很小。所以应该有某种模糊文本匹配的方法。我正在尝试阅读http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm 上的 Oracle 文档,其中解释了 CONTAINS 和 CATSEARCH 搜索运算符。但是这些似乎用于更复杂的任务,例如在文档中搜索给定字符串的匹配项。我只想对表格的一列执行此操作。
What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?
在这种情况下,您对我有何建议,Oracle 是否支持此类模糊文本匹配查询?
采纳答案by Jon Heller
UTL_MATCHcontains methods for matching strings and comparing their similarity. The edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance relative to the size of the strings.
UTL_MATCH包含匹配字符串和比较它们的相似性的方法。编辑距离,也称为 Levenshtein 距离,可能是一个很好的起点。由于一个字符串是一个子字符串,因此比较编辑距离相对于字符串的大小可能会有所帮助。
--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
--Rank edit ratios.
select substring, address, edit_ratio
,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
from
(
--Calculate edit ratio - edit distance relative to string sizes.
select
substring,
address,
(length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
from
(
--Fake addreses (from http://names.igopaygo.com/street/north_american_address)
select '526 Burning Hill Big Beaver District of Columbia 20041' address from dual union all
select '5206 Hidden Rise Whitebead Michigan 48426' address from dual union all
select '2714 Noble Drive Milk River Michigan 48770' address from dual union all
select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
select '968 Iron Corner Wacker Arkansas 72793' address from dual
) addresses
cross join
(
--Address substrings.
select 'Michigan' substring from dual union all
select 'Not-So-Hidden Rise' substring from dual union all
select '123 Fake Street' substring from dual
)
order by substring, edit_ratio desc
)
)
where edit_ratio_rank = 1
order by substring, address;
These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.
这些结果不是很好,但希望这至少是一个很好的起点。它应该适用于任何语言。但是您可能仍然希望将其与某些特定于语言或区域设置的比较规则相结合。
SUBSTRING ADDRESS EDIT_RATIO
--------- ------- ----------
123 Fake Street 526 Burning Hill Big Beaver District of Columbia 20041 0.5333
Michigan 2714 Noble Drive Milk River Michigan 48770 1
Michigan 5206 Hidden Rise Whitebead Michigan 48426 1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426 0.5
回答by Axel Kemper
You could make use of the SOUNDEXfunction available in Oracledatabases. SOUNDEX
computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.
您可以利用Oracle数据库中可用的SOUNDEX函数。计算文本字符串的数字签名。这可用于查找听起来相似的字符串,从而减少字符串比较的次数。SOUNDEX
Edited:If SOUNDEX
is not suitable for your local language, you can ask Googlefor a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.
编辑:如果SOUNDEX
不适合您的本地语言,您可以向Google寻求性能更好的语音签名或语音匹配功能。此函数必须为每个新表条目评估一次,并且为每个查询评估一次。因此,它不需要驻留在Oracle 中。
Example: A Turkish SOUNDEX is promoted here.
示例:此处宣传土耳其语 SOUNDEX 。
To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:
为提高匹配质量,第一步应统一街道名称拼写。这可以通过应用一组规则来完成:
Simplified example rules:
简化示例规则:
- Convert all characters to lowercase
- Remove "str." at the end of a name
- Remove "drv." at the end of a name
- Remove "place" at the end of a name
- Remove "ave." at the end of a name
- Sort names with multiple words alphabetically
- Drop auxiliary words like "of", "and", "the", ...
- 将所有字符转换为小写
- 删除“str”。在名字的末尾
- 删除“驱动程序”。在名字的末尾
- 删除名称末尾的“地点”
- 删除“ave”。在名字的末尾
- 按字母顺序对包含多个单词的名称进行排序
- 删除辅助词,如“of”、“and”、“the”、...