从 SQL Server 中的 VARCHAR 中删除非数字字符的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/106206/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 23:30:46  来源:igfitidea点击:

Fastest way to remove non-numeric characters from a VARCHAR in SQL Server

sqlsql-serverperformanceoptimization

提问by Dan Herbert

I'm writing an import utility that is using phone numbers as a unique key within the import.

我正在编写一个导入实用程序,它使用电话号码作为导入中的唯一键。

I need to check that the phone number does not already exist in my DB. The problem is that phone numbers in the DB could have things like dashes and parenthesis and possibly other things. I wrote a function to remove these things, the problem is that it is slowand with thousands of records in my DB and thousands of records to import at once, this process can be unacceptably slow. I've already made the phone number column an index.

我需要检查电话号码是否已存在于我的数据库中。问题是数据库中的电话号码可能有破折号和括号之类的东西,也可能有其他东西。我写了一个函数来删除这些东西,问题是它很,我的数据库中有数千条记录,并且一次导入数千条记录,这个过程可能会慢得令人无法接受。我已经将电话号码列为索引。

I tried using the script from this post:
T-SQL trim &nbsp (and other non-alphanumeric characters)

我尝试使用这篇文章中的脚本:
T-SQL trim   (and other non-alphanumeric characters)

But that didn't speed it up any.

但这并没有加快速度。

Is there a faster way to remove non-numeric characters? Something that can perform well when 10,000 to 100,000 records have to be compared.

有没有更快的方法来删除非数字字符?当必须比较 10,000 到 100,000 条记录时,可以表现良好的东西。

Whatever is done needs to perform fast.

无论做什么都需要快速执行。

Update
Given what people responded with, I think I'm going to have to clean the fields before I run the import utility.

更新
鉴于人们的反应,我想我将不得不在运行导入实用程序之前清理字段。

To answer the question of what I'm writing the import utility in, it is a C# app. I'm comparing BIGINT to BIGINT now, with no need to alter DB data and I'm still taking a performance hit with a very small set of data (about 2000 records).

为了回答我在其中编写导入实用程序的问题,它是一个 C# 应用程序。我现在正在将 BIGINT 与 BIGINT 进行比较,无需更改数据库数据,而且我仍然使用非常小的数据集(大约 2000 条记录)对性能造成影响。

Could comparing BIGINT to BIGINT be slowing things down?

将 BIGINT 与 BIGINT 进行比较会减慢速度吗?

I've optimized the code side of my app as much as I can (removed regexes, removed unneccessary DB calls). Although I can't isolate SQL as the source of the problem anymore, I still feel like it is.

我已经尽可能地优化了我的应用程序的代码端(删除了正则表达式,删除了不必要的数据库调用)。虽然我不能再将 SQL 孤立为问题的根源,但我仍然觉得它是。

采纳答案by Scott Nichols

I may misunderstand, but you've got two sets of data to remove the strings from one for current data in the database and then a new set whenever you import.

我可能会误解,但是您有两组数据可以从数据库中当前数据的一组数据中删除字符串,然后在导入时删除一组新数据。

For updating the existing records, I would just use SQL, that only has to happen once.

为了更新现有记录,我只会使用 SQL,这只需要发生一次。

However, SQL isn't optimized for this sort of operation, since you said you are writing an import utility, I would do those updates in the context of the import utility itself, not in SQL. This would be much better performance wise. What are you writing the utility in?

但是,SQL 并未针对此类操作进行优化,因为您说您正在编写导入实用程序,所以我会在导入实用程序本身的上下文中进行这些更新,而不是在 SQL 中。这将是更好的性能明智。你在写什么实用程序?

Also, I may be completely misunderstanding the process, so I apologize if off-base.

另外,我可能完全误解了这个过程,所以如果不在基地,我深表歉意。

Edit:
For the initial update, if you are using SQL Server 2005, you could try a CLR function. Here's a quick one using regex. Not sure how the performance would compare, I've never used this myself except for a quick test right now.

编辑:
对于初始更新,如果您使用的是 SQL Server 2005,您可以尝试使用 CLR 函数。这是使用正则表达式的快速方法。不知道性能如何比较,除了现在的快速测试外,我自己从未使用过它。

using System;  
using System.Data;  
using System.Text.RegularExpressions;  
using System.Data.SqlClient;  
using System.Data.SqlTypes;  
using Microsoft.SqlServer.Server;  

public partial class UserDefinedFunctions  
{  
    [Microsoft.SqlServer.Server.SqlFunction]  
    public static SqlString StripNonNumeric(SqlString input)  
    {  
        Regex regEx = new Regex(@"\D");  
        return regEx.Replace(input.Value, "");  
    }  
};  

After this is deployed, to update you could just use:

部署后,要更新,您可以使用:

UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber)

回答by David Coster

I saw this solution with T-SQL code and PATINDEX. I like it :-)

我用 T-SQL 代码和 PATINDEX 看到了这个解决方案。我喜欢 :-)

CREATE Function [fnRemoveNonNumericCharacters](@strText VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
    WHILE PATINDEX('%[^0-9]%', @strText) > 0
    BEGIN
        SET @strText = STUFF(@strText, PATINDEX('%[^0-9]%', @strText), 1, '')
    END
    RETURN @strText
END

回答by Brainwater

replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(string,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')*1 AS string,

replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(string,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')*1 AS string,

:)

:)

回答by Tom

In case you didn't want to create a function, or you needed just a single inline call in T-SQL, you could try:

如果您不想创建函数,或者您只需要在 T-SQL 中进行一次内联调用,您可以尝试:

set @Phone = REPLACE(REPLACE(REPLACE(REPLACE(@Phone,'(',''),' ',''),'-',''),')','')

Of course this is specific to removing phone number formatting, not a generic remove all special characters from string function.

当然,这特定于删除电话号码格式,而不是从字符串函数中删除所有特殊字符的通用方法。

回答by AdamE

Simple function:

简单的功能:

CREATE FUNCTION [dbo].[RemoveAlphaCharacters](@InputString VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
  WHILE PATINDEX('%[^0-9]%',@InputString)>0
        SET @InputString = STUFF(@InputString,PATINDEX('%[^0-9]%',@InputString),1,'')     
  RETURN @InputString
END

GO

回答by Debayan Samaddar

create function dbo.RemoveNonNumericChar(@str varchar(500))  
returns varchar(500)  
begin  
declare @startingIndex int  
set @startingIndex=0  
while 1=1  
begin  
    set @startingIndex= patindex('%[^0-9]%',@str)  
    if @startingIndex <> 0  
    begin  
        set @str = replace(@str,substring(@str,@startingIndex,1),'')  
    end  
    else    break;   
end  
return @str  
end

go  

select dbo.RemoveNonNumericChar('aisdfhoiqwei352345234@#$%^$@345345%^@#$^')  

回答by Dennis Allen

I know it is late to the game, but here is a function that I created for T-SQL that quickly removes non-numeric characters. Of note, I have a schema "String" that I put utility functions for strings into...

我知道现在已经晚了,但这里有一个我为 T-SQL 创建的函数,它可以快速删除非数字字符。值得注意的是,我有一个模式“字符串”,我将字符串的实用函数放入......

CREATE FUNCTION String.ComparablePhone( @string nvarchar(32) ) RETURNS bigint AS
BEGIN
    DECLARE @out bigint;

-- 1. table of unique characters to be kept
    DECLARE @keepers table ( chr nchar(1) not null primary key );
    INSERT INTO @keepers ( chr ) VALUES (N'0'),(N'1'),(N'2'),(N'3'),(N'4'),(N'5'),(N'6'),(N'7'),(N'8'),(N'9');

-- 2. Identify the characters in the string to remove
    WITH found ( id, position ) AS
    (
        SELECT 
            ROW_NUMBER() OVER (ORDER BY (n1+n10) DESC), -- since we are using stuff, for the position to continue to be accurate, start from the greatest position and work towards the smallest
            (n1+n10)
        FROM 
            (SELECT 0 AS n1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) AS d1,
            (SELECT 0 AS n10 UNION SELECT 10 UNION SELECT 20 UNION SELECT 30) AS d10
        WHERE
            (n1+n10) BETWEEN 1 AND len(@string)
            AND substring(@string, (n1+n10), 1) NOT IN (SELECT chr FROM @keepers)
    )
-- 3. Use stuff to snuff out the identified characters
    SELECT 
        @string = stuff( @string, position, 1, '' )
    FROM 
        found
    ORDER BY
        id ASC; -- important to process the removals in order, see ROW_NUMBER() above

-- 4. Try and convert the results to a bigint   
    IF len(@string) = 0
        RETURN NULL; -- an empty string converts to 0

    RETURN convert(bigint,@string); 
END

Then to use it to compare for inserting, something like this;

然后用它来比较插入,像这样;

INSERT INTO Contacts ( phone, first_name, last_name )
SELECT i.phone, i.first_name, i.last_name
FROM Imported AS i
LEFT JOIN Contacts AS c ON String.ComparablePhone(c.phone) = String.ComparablePhone(i.phone)
WHERE c.phone IS NULL -- Exclude those that already exist

回答by Dan Williams

can you remove them in a nightly process, storing them in a separate field, then do an update on changed records right before you run the process?

您可以在每晚的流程中删除它们,将它们存储在单独的字段中,然后在运行流程之前对更改的记录进行更新吗?

Or on the insert/update, store the "numeric" format, to reference later. A trigger would be an easy way to do it.

或者在插入/更新时,存储“数字”格式,以供以后参考。触发器将是一种简单的方法。

回答by Grank

Working with varchars is fundamentally slow and inefficient compared to working with numerics, for obvious reasons. The functions you link to in the original post will indeed be quite slow, as they loop through each character in the string to determine whether or not it's a number. Do that for thousands of records and the process is bound to be slow. This is the perfect job for Regular Expressions, but they're not natively supported in SQL Server. You can add support using a CLR function, but it's hard to say how slow this will be without trying it I would definitely expect it to be significantly faster than looping through each character of each phone number, however!

与使用数字相比,使用 varchars 从根本上说是缓慢且低效的,原因显而易见。您在原始帖子中链接到的函数确实会很慢,因为它们会遍历字符串中的每个字符以确定它是否是数字。对数千条记录执行此操作,过程肯定会很慢。这是正则表达式的完美工作,但 SQL Server 本身并不支持它们。您可以使用 CLR 函数添加支持,但很难说不尝试它会有多慢,但是我肯定希望它比循环遍历每个电话号码的每个字符快得多!

Once you get the phone numbers formatted in your database so that they're only numbers, you could switch to a numeric type in SQL which would yield lightning-fast comparisons against other numeric types. You might find that, depending on how fast your new data is coming in, doing the trimming and conversion to numeric on the database side is plenty fast enough once what you're comparing to is properly formatted, but if possible, you would be better off writing an import utility in a .NET language that would take care of these formatting issues before hitting the database.

一旦您在数据库中格式化电话号码以便它们只是数字,您就可以切换到 SQL 中的数字类型,这将与其他数字类型进行快速比较。您可能会发现,根据新数据输入的速度,一旦您比较的内容格式正确,在数据库端进行修剪和转换为数字就足够快了,但如果可能的话,您会更好关闭使用 .NET 语言编写导入实用程序,该实用程序将在访问数据库之前处理这些格式问题。

Either way though, you're going to have a big problem regarding optional formatting. Even if your numbers are guaranteed to be only North American in origin, some people will put the 1 in front of a fully area-code qualified phone number and others will not, which will cause the potential for multiple entries of the same phone number. Furthermore, depending on what your data represents, some people will be using their home phone number which might have several people living there, so a unique constraint on it would only allow one database member per household. Some would use their work number and have the same problem, and some would or wouldn't include the extension which would cause artificial uniqueness potential again.

无论哪种方式,您都会遇到有关可选格式的大问题。即使您的号码保证仅来自北美,但有些人会将 1 放在完全符合区号条件的电话号码前,而其他人则不会,这将导致可能多次输入同一电话号码。此外,根据您的数据所代表的内容,有些人将使用他们的家庭电话号码,其中可能有几个人住在那里,因此对它的唯一限制是每个家庭只允许一个数据库成员。有些人会使用他们的工作编号并遇到同样的问题,有些人会或不会包括会再次导致人为唯一性潜力的扩展名。

All of that may or may not impact you, depending on your particular data and usages, but it's important to keep in mind!

所有这些可能会或可能不会影响您,具体取决于您的特定数据和用途,但请务必记住!

回答by Mike L

I would try Scott's CLR function first but add a WHERE clause to reduce the number of records updated.

我会先尝试 Scott 的 CLR 函数,但添加一个 WHERE 子句以减少更新的记录数。

UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber) 
WHERE phonenumber like '%[^0-9]%'

If you know that the great majority of your records have non-numeric characters it might not help though.

如果您知道绝大多数记录都包含非数​​字字符,那么它可能无济于事。