postgresql 在忽略大小写和特殊字符的两列中查找可能的重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12979390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 00:27:22  来源:igfitidea点击:

Find possible duplicates in two columns ignoring case and special characters

sqlpostgresqlduplicatespattern-matchingcase-insensitive

提问by Ghostman

Query

询问

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lower case and upper case.
E.g.: sunnyand Sunnydon't show up as a duplicates.
So how to find all possible duplicates in PostgreSQL for two columns.

它有时无法在小写和大写之间找到重复项。
例如:sunny并且Sunny不要显示为重复项。
那么如何在 PostgreSQL 中为两列找到所有可能的重复项。

回答by Erwin Brandstetter

lower()/ upper()

lower()/ upper()

Use one of these to fold characters to either lower or upper case. Special characters are not affected:

使用其中之一将字符折叠成小写或大写。特殊字符不受影响:

SELECT count(*), lower(name), number
FROM   tbl
GROUP  BY lower(name), number
HAVING count(*) > 1;

unaccent()

unaccent()

If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

如果你真的想忽略变音符号,就像你的评论暗示的那样,安装附加模块unaccent,它提供了一个文本搜索字典,可以删除重音和通用功能unaccent()

CREATE EXTENSION unaccent;

Makes it very simple:

让它变得非常简单:

SELECT lower(unaccent('Bü?ercafé')) AS norm

Result:

结果:

busercafe

This doesn't strip non-letters. Add regexp_replace()like @Craig mentioned for that:

这不会去除非字母。添加regexp_replace()像@Craig提到的那样:

SELECT lower(unaccent(regexp_replace('$s^o&f!t Bü?ercafé', '\W', '', 'g') ))
                                                                     AS norm

Result:

结果:

softbusercafe

You can even build a functional index on top of that:

您甚至可以在此基础上构建功能索引:

回答by Palpatim

PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:

PostgreSQL 默认区分大小写。您可以通过将所有值转换为单个大小写来强制它在搜索过程中不区分大小写:

SELECT COUNT(*), lower(name), number FROM TABLE 
GROUP BY lower(name), number HAVING COUNT(*) > 1
  • NOTE: This has not been tested in Postgres
  • 注意:这尚未在 Postgres 中测试

回答by Craig Ringer

(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résuméinto resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigramsor Levenshtein distances.

(海报澄清后的更新答案):“不重音”或剥离重音(双连音)的想法通常是虚假的。这是确定十岁上下,如果你匹配的数据,以找出是否有些误导用户或应用程序被改写的résumé进入resume,但它是完全错误的改变到彼此,因为他们是不同的词。即便如此,它也只是一种工作,并且应该与字符串相似性匹配系统(如trigramsLevenshtein distances )结合使用

The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ?might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tap?" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

“非重音”的想法假定任何重音字符都有一个有效的等效非重音字符,或者至少任何给定的重音字符在单词的 ascii 化表示中最多被一个非重音字符替换。那根本不是真的。在一种语言中?可能是“u”音,而在另一种语言中它可能是长“oo”,而“ascii-ized”拼写约定可能反映了这一点。因此,在语言中,虚构的虚拟词“Tap”的正确“非重音”?可能是“Tapu”,而在另一个中,这个虚构的词可能被 ascii 化为“Tapoo”。在这两种情况下,“Tapo”的“非重音”形式都不会与人们在强制进入 ascii 字符集时实际写入的内容相匹配。

You can see this in Englishwith ligatures, where the word d?monis ascii-ized daemon. If you stripped the ligature you'd get dmonwhich wouldn't match daemon, the common spelling. The same is true of ?therwhich is typically ascii-ized to aetheror ether. You can also see this in German with ?, typically "expanded" as ss.

您可以带有连字的英文中看到这个,其中单词d?mon是 ascii-ized daemon。如果你去掉连字,你会得到dmon不匹配的daemon,常见的拼写。也是如此,?ther通常 ascii-ized 为aetherether。你也可以用德语看到这个,通常“扩展”为ss.

If you mustattempt to "un-accent", "normalize" accents or "strip" accents:

如果您必须尝试“取消重音”、“标准化”重音或“剥离”重音:

You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \Wescape (shorthand for the character class [^[:alnum:]_]as per the manual) to exclude "symbols" but not accented characters:

您可以使用字符类正则表达式去除除指定字符集之外的所有字符。在这种情况下,我们使用\W转义符[^[:alnum:]_](根据手册中字符类的简写)排除“符号”但不排除重音字符:

regress=# SELECT regexp_replace(lower(x),'\W','','g') 
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 café
(2 rows)

If you want to filter out accented chars too you can define your own character class:

如果您也想过滤掉重音字符,您可以定义自己的字符类:

regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 caf
(2 rows)

If you actually intended to substitutesome accented characters for similar unaccented characters, you could use translateas per this wiki article:

如果您实际上打算将一些重音字符替换为类似的非重音字符,您可以translate按照这篇 wiki 文章使用

regress=# SELECT translate(
        lower(x),
        'a???ā??á????ā??èééê?ē???ěē???ěìí??ì?ī?ìí??ì?ī?ó???ō??òó???ō??ùú?ü?ū??ùú?ü?ū??',
        'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
    )
    FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);

 translate 
-----------
 $s^o&f!t
 cafe
(2 rows)