SQL 查找包含相似字符串的sql记录
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5299996/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find sql records containing similar strings
提问by Nial
I have the following table with 2 columns: ID and Title containing over 500.000 records. For example:
我有一个包含 2 列的下表:ID 和 Title 包含超过 500.000 条记录。例如:
ID Title
-- ------------------------
1 Aliens
2 Aliens (1986)
3 Aliens vs Predator
4 Aliens 2
5 The making of "Aliens"
I need to find records that are very similar, and by that I mean they are different by 3-6 letters, usually this difference is at the end of the Titles. So I have to design a query that returns the records no. 1,2 and 4. I already looked at levenstein distance but I don't know how to apply it. Also because of the number of records the query shouldn't take all night long.
我需要找到非常相似的记录,我的意思是它们相差 3-6 个字母,通常这种区别在标题的末尾。所以我必须设计一个返回记录号的查询。1,2 和 4。我已经看过 levenstein 距离,但我不知道如何应用它。同样由于记录的数量,查询不应该整夜进行。
Thanks for any idea or suggestion
感谢您的任何想法或建议
采纳答案by Simen S
If you really want to define similarity in the exact way that you have formulated in your question, then you would - as you say - have to implement the Levensthein Distance calculation. Either in code calculated on each row retrieved by a DataReader or as a SQL Server function.
如果您真的想以您在问题中制定的确切方式定义相似性,那么您将 - 正如您所说 - 必须实施 Levensthein 距离计算。在对 DataReader 检索的每一行计算的代码中或作为 SQL Server 函数。
The problem stated is actually more tricky than it may appear at first sight, because you cannot assume to know what the mutually sharedelements between two strings may be.
所陈述的问题实际上比乍一看更棘手,因为您无法假设知道两个字符串之间相互共享的元素可能是什么。
So in addition to Levensthein Distance you probably also want to specify a minimum number of consecutive characters that actually have to match (in order for sufficient similarity to be concluded).
因此,除了 Levensthein 距离之外,您可能还想指定实际必须匹配的最少连续字符数(以便得出足够的相似性)。
In sum: It sounds like an overly complicated and time consuming/slow approach.
总而言之:这听起来像是一种过于复杂且耗时/缓慢的方法。
Interestingly, in SQL Server 2008 you have the DIFFERENCEfunction which maybe used for something like this.
有趣的是,在 SQL Server 2008 中,您有DIFFERENCE函数,它可以用于类似的事情。
It evaluates the phonetic value of two strings and calculates the difference. I'm unsure if you will get it to work properly for multi-word expressions such as movie titles since it doesn't deal well with spaces or numbers and puts too much emphasis on the beginning of the string, but it is still an interesting predicate to be aware of.
它评估两个字符串的语音值并计算差异。我不确定你是否能让它适用于多词表达,比如电影片名,因为它不能很好地处理空格或数字,并且过分强调字符串的开头,但它仍然很有趣要知道的谓词。
If what you are actuallytrying to describe is some sort of search feature, then you should look into the Full Text Searchcapabilities of SQL Server 2008. It provides built-in Thesaurus support, fancy SQL predicatesand a ranking mechanism for "best matches"
如果您实际上要描述的是某种搜索功能,那么您应该查看SQL Server 2008的全文搜索功能。它提供了内置的同义词支持、花哨的 SQL谓词和“最佳匹配”的排名机制
EDIT: If you are looking to eliminate duplicates maybe you could look into SSIS Fuzzy Lookup and Fuzzy Group Transformation. I have not tried this myself, but it looks like a promising lead.
编辑:如果您想消除重复项,也许您可以查看 SSIS Fuzzy Lookup 和 Fuzzy Group Transformation。我自己还没有尝试过,但它看起来很有希望。
EDIT2: If you don't want to dig into SSIS and still struggle with the performance of the Levensthein Distance algorithm, you could perhaps try this algorithmwhich appears to be less complex.
EDIT2:如果您不想深入研究 SSIS 并且仍然为 Levensthein 距离算法的性能而苦恼,您也许可以尝试这种似乎不那么复杂的算法。
回答by mattmc3
For all the Googlers out there that run into this question, though it's already been marked as answered, I figured I'd share some code to help with this. If you're able to do CLR user-defined functions on your SQL Server, you can implement your own Levensthein Distance algorithm and then from there create a function that gives you a 'similarity score' called dbo.GetSimilarityScore()
. I've based my score case-insensitivity, without much weight to jumbled word order and non-alphanumeric characters. You can adjust your scoring algorithm as needed, but this is a good start. Credit to this code project linkfor getting me started.
对于所有遇到这个问题的 Google 员工,虽然它已经被标记为已回答,但我想我会分享一些代码来帮助解决这个问题。如果您能够在 SQL Server 上执行 CLR 用户定义函数,则可以实现您自己的 Levensthein 距离算法,然后从那里创建一个函数,该函数为您提供名为dbo.GetSimilarityScore()
. 我的分数不区分大小写,对混乱的词序和非字母数字字符没有太大影响。您可以根据需要调整评分算法,但这是一个好的开始。感谢这个代码项目的链接为我安排开始。
Option Explicit On
Option Strict On
Option Compare Binary
Option Infer On
Imports System
Imports System.Collections.Generic
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports System.Text
Imports System.Text.RegularExpressions
Imports Microsoft.SqlServer.Server
Partial Public Class UserDefinedFunctions
Private Const Xms As RegexOptions = RegexOptions.IgnorePatternWhitespace Or RegexOptions.Multiline Or RegexOptions.Singleline
Private Const Xmsi As RegexOptions = Xms Or RegexOptions.IgnoreCase
''' <summary>
''' Compute the distance between two strings.
''' </summary>
''' <param name="s1">The first of the two strings.</param>
''' <param name="s2">The second of the two strings.</param>
''' <returns>The Levenshtein cost.</returns>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value
Dim s2 As String = string2.Value
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d As Integer(,) = New Integer(n, m) {}
' Step 1
If n = 0 Then Return m
If m = 0 Then Return n
' Step 2
For i As Integer = 0 To n
d(i, 0) = i
Next
For j As Integer = 0 To m
d(0, j) = j
Next
' Step 3
For i As Integer = 1 To n
'Step 4
For j As Integer = 1 To m
' Step 5
Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)
' Step 6
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
' Step 7
Return d(n, m)
End Function
''' <summary>
''' Returns a score between 0.0-1.0 indicating how closely two strings match. 1.0 is a 100%
''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
''' </summary>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too
Dim score1 As SqlDouble = InternalGetSimilarityScore(s1, s2)
If score1.IsNull Then Return SqlDouble.Null
Dim mod1 As String = GetSimilarityString(s1)
Dim mod2 As String = GetSimilarityString(s2)
Dim score2 As SqlDouble = InternalGetSimilarityScore(mod1, mod2)
If score2.IsNull Then Return SqlDouble.Null
If score1 = 1.0F AndAlso score2 = 1.0F Then Return 1.0F
If score1 = 0.0F AndAlso score2 = 0.0F Then Return 0.0F
' Return weighted result
Return (score1 * 0.2F) + (score2 * 0.8F)
End Function
Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As SqlDouble
Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
If maxLen = 0 Then Return 1.0F
Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
End Function
''' <summary>
''' Removes all non-alpha numeric characters and then sorts
''' the words in alphabetical order.
''' </summary>
Private Shared Function GetSimilarityString(s1 As String) As String
Dim normString = Regex.Replace(If(s1, ""), "\W|_", " ", Xms)
normString = Regex.Replace(normString, "\s+", " ", Xms).Trim()
Dim words As New List(Of String)(normString.Split(" "c))
words.Sort()
Return String.Join(" ", words.ToArray())
End Function
End Class
回答by Clodoaldo Neto
select id, title
from my_table
where
title like 'Aliens%'
and
len(rtrim(title)) < len('Aliens') + 7
回答by Reynaldo
From what you've asked I imagine the differences you're looking for should not be more than a single word at the end of the original title. Is that why 1,2 and 4 are returned?
根据您的要求,我想您正在寻找的差异不应超过原始标题末尾的一个词。这就是返回 1,2 和 4 的原因吗?
Anyway I've made a query that checks the difference at the end consists of a single word, without spaces.
无论如何,我做了一个查询,检查末尾的差异由一个单词组成,没有空格。
declare @title varchar(20)
set @title = 'Aliens'
select id, title
from movies with (nolock)
where ltrim(title) like @title + '%'
and Charindex(' ', ltrim(right(title, len(title) - len(@title)))) = 0
and len(ltrim(right(title, len(title) - len(@title)))) < 7
hope it helps.
希望能帮助到你。
回答by Transact Charlie
if you are using sql server 2008 you should be able to use the FULLTEXT functionality.
如果您使用的是 sql server 2008,您应该能够使用 FULLTEXT 功能。
The basic steps are:
基本步骤是:
1) Create a fulltext index over the column. This will tokenise each string (stremmers, splitters, etc) and let you search for 'LIKE THIS' strings.
1) 在列上创建全文索引。这将标记每个字符串(stremmers、splitters 等)并让您搜索“LIKE THIS”字符串。
The disclaimer is that I've never had to use it but I think it can do what you want.
免责声明是我从来没有使用过它,但我认为它可以做你想做的事。
Start reading here: http://msdn.microsoft.com/en-us/library/ms142571.aspx
从这里开始阅读:http: //msdn.microsoft.com/en-us/library/ms142571.aspx