如何从 T-SQL 中的字符串中删除扩展 ASCII 字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15259622/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 14:06:35  来源:igfitidea点击:

How do I remove extended ASCII characters from a string in T-SQL?

sqlsql-serverstringtsqlascii

提问by Codeman

I need to filter out (remove) extended ASCII characters from a SELECT statement in T-SQL.

我需要从 T-SQL 的 SELECT 语句中过滤掉(删除)扩展的 ASCII 字符。

I'm using a stored procedure to do so.

我正在使用存储过程来这样做。

Expected input:

预期输入:

????eeee????

Expected output:

预期输出:

eeee

All that I've found is for MySQL.

我发现的所有内容都是针对MySQL 的

I'm using :

我正在使用 :

Microsoft SQL Server Management Studio  11.0.2100.60
Microsoft .NET Framework    4.0.30319.17929

回答by amit kohan

OK, give this a try. It seems the same issue they have. Anyway you need to modify it based on your requirements.

好的,试试这个。他们似乎有同样的问题。无论如何,您需要根据您的要求对其进行修改。

CREATE FUNCTION RemoveNonASCII 
(
    @nstring nvarchar(255)
)
RETURNS varchar(255)
AS
BEGIN

    DECLARE @Result varchar(255)
    SET @Result = ''

    DECLARE @nchar nvarchar(1)
    DECLARE @position int

    SET @position = 1
    WHILE @position <= LEN(@nstring)
    BEGIN
        SET @nchar = SUBSTRING(@nstring, @position, 1)
        --Unicode & ASCII are the same from 1 to 255.
        --Only Unicode goes beyond 255
        --0 to 31 are non-printable characters
        IF UNICODE(@nchar) between 32 and 255
            SET @Result = @Result + @nchar
        SET @position = @position + 1
    END

    RETURN @Result

END
GO

Check it out at SqlServerCentral

SqlServerCentral查看

回答by Shnugo

The accepted answer is using a loop which should be avoided...

接受的答案是使用一个应该避免的循环......

My solution is completely inlineable, it's easy to create an UDF (or maybe even better: an inline TVF) from this.

我的解决方案是完全内联的,很容易从中创建 UDF(或者甚至更好:内联 TVF)。

The idea: Create a set of running numbers (here it's limited with the count of objects in sys.objects, but there are tons of example how to create a numbers tally on the fly). In the second CTE the strings are splitted to single characters. The final select comes back with the cleaned string.

想法:创建一组运行数字(这里受 sys.objects 中对象数量的限制,但有大量示例如何即时创建数字计数)。在第二个 CTE 中,字符串被拆分为单个字符。最终选择返回清洁后的字符串。

DECLARE @tbl TABLE(ID INT IDENTITY, EvilString NVARCHAR(100));
INSERT INTO @tbl(EvilString) VALUES('????eeee????'),('?a??b?eeee???c?');

WITH RunningNumbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
    FROM sys.objects
)
,SingleChars AS
(
    SELECT tbl.ID,rn.Nmbr,SUBSTRING(tbl.EvilString,rn.Nmbr,1) AS Chr
    FROM @tbl AS tbl
    CROSS APPLY (SELECT TOP(LEN(tbl.EvilString)) Nmbr FROM RunningNumbers) AS rn 
)
SELECT ID,EvilString
      ,(
        SELECT '' + Chr 
        FROM SingleChars AS sc
        WHERE sc.ID=tbl.ID AND ASCII(Chr)<128
        ORDER BY sc.Nmbr
        FOR XML PATH('')
      ) AS GoodString
FROM @tbl As tbl

The result

结果

1   ????eeee????    eeee
2   ?a??b?eeee???c? abeeeec

Here is another answerfrom me where this approach is used to replace all specialcharacters with securecharacters to get plain latin

这是我的另一个答案,该方法用于将所有特殊字符替换为安全字符以获得纯拉丁语

回答by Clicks to Bricks

Just correcting above code (it was cutting DOTs)

只是更正上面的代码(它正在削减 DOT)

CREATE FUNCTION [dbo].[RemoveNonASCII] 
(
    @nstring nvarchar(255)
)
RETURNS nvarchar(255)
AS
BEGIN

    DECLARE @Result nvarchar(255)
    SET @Result = ''

    DECLARE @nchar nvarchar(1)
    DECLARE @position int

    SET @position = 1
    WHILE @position <= LEN(@nstring)
    BEGIN
        SET @nchar = SUBSTRING(@nstring, @position, 1)
        --Unicode & ASCII are the same from 1 to 255.
        --Only Unicode goes beyond 255
        --0 to 31 are non-printable characters
        IF (UNICODE(@nchar) between 192 and 198) or (UNICODE(@nchar) between 225 and 230) -- letter A or a with accents
            SET @nchar = 'a'
        IF (UNICODE(@nchar) between 200 and 203) or (UNICODE(@nchar) between 232 and 235) -- letter E or e with accents
            SET @nchar = 'e'
        IF (UNICODE(@nchar) between 204 and 207) or (UNICODE(@nchar) between 236 and 239) -- letter I or i with accents
            SET @nchar = 'i'
        IF (UNICODE(@nchar) between 210 and 214) or (UNICODE(@nchar) between 242 and 246) or (UNICODE(@nchar)=240) -- letter O or o with accents
            SET @nchar = 'o'
        IF (UNICODE(@nchar) between 217 and 220) or (UNICODE(@nchar) between 249 and 252)  -- letter U or u with accents
            SET @nchar = 'u'
        IF (UNICODE(@nchar)=199)  or (UNICODE(@nchar)=231)  -- letter ? or ? 
            SET @nchar = 'c'
        IF (UNICODE(@nchar)=209)  or (UNICODE(@nchar)=241)  -- letter ? or ? 
            SET @nchar = 'n'
        IF (UNICODE(@nchar) between 45 and 46) or (UNICODE(@nchar) between 48 and 57) or (UNICODE(@nchar)  between 64 and 90) or (UNICODE(@nchar) = 95) or (UNICODE(@nchar)  between 97 and 122)
            SET @Result = @Result + @nchar
        SET @position = @position + 1
    END
    set @Result = lower(@Result) -- e-mails in lower case
    RETURN @Result

END

回答by AlxSts

Thank you for sharing your code.

感谢您分享您的代码。

I needed something like it, not just for cleaning up e-mail addresses but for general purpose, filtering user web site input before it reaches SAP ERP via integration modules.

我需要类似的东西,不仅用于清理电子邮件地址,而且用于一般目的,在通过集成模块到达 SAP ERP 之前过滤用户网站输入。

Running in Brazil, it must obey language accents...

在巴西运行,它必须遵守语言口音......

Here goes the resulting code.

这是结果代码。

Maybe it can help someone someday, like it did to me.

也许有一天它可以帮助某人,就像对我一样。

     IF EXISTS
     (
         SELECT * 
           FROM sys.objects 
          WHERE object_id = OBJECT_ID(N'[dbo].[fnRemoveNonASCII]') 
            AND type IN (N'FN')
     )
         DROP FUNCTION dbo.fnRemoveNonASCII
     GO

     CREATE FUNCTION [dbo].[fnRemoveNonASCII] 
     (
         @nstring nvarchar(MAX) 
     )
     RETURNS nvarchar(MAX)
     AS
     BEGIN

         DECLARE @nchar nvarchar(1)      -- individual char in string
         DECLARE @nUnicode nvarchar(3)   -- ASCII for individual char in string
         DECLARE @position int           -- subscript to control loop in the string
         DECLARE @Result nvarchar(MAX)   -- return valus

         SET @Result = ''
         SET @position = 1

         WHILE @position <= LEN(@nstring)

            BEGIN

                --Unicode & ASCII are the same from 1 to 255.
                --Only Unicode goes beyond 255
                --0 to 31 are non-printable characters

                SET @nchar = SUBSTRING(@nstring, @position, 1)

                SET @nUnicode = UNICODE(@nChar) 

                IF @nUnicode = 10 
                   OR @nUnicode = 13
                   OR @nUnicode BETWEEN 32  AND 126
                   OR @nUnicode = 160
                   OR @nUnicode BETWEEN 192 AND 207
                   OR @nUnicode BETWEEN 210 AND 213
                   OR @nUnicode BETWEEN 217 AND 219
                   OR @nUnicode BETWEEN 224 AND 227
                   OR @nUnicode BETWEEN 231 AND 234
                   OR @nUnicode = 236
                   OR @nUnicode = 237
                   OR @nUnicode BETWEEN 242 AND 245
                   OR @nUnicode = 247
                   OR @nUnicode = 249
                   OR @nUnicode = 250        

                   SET @Result = @Result + @nchar

                ELSE IF @nUnicode = 9 -- TAB
                   SET @Result =  @Result + '  '

                ELSE
                   SET @Result =  @Result + ' '

                SET @position = @position + 1
         END

         RETURN @Result
     END

     /*
     ---------------------------------------------------------------------------------------------------------------

     -- Tabela dos caracteres Unicode/ASCII exportáveis

     &#9;   |   &#x9;      |    %9     =    TAB
     &#10;     |    &#xa;      |    %a     =    0A   Line Feed
     &#13;     |    &#xd;      |    %d     =    0D   Carriage Return
     &#32;      |   &#x20;  |   %20 =   <space>  
     &#33;     |    &#x21;  |   %21 =   !            
     &#34;     |    &#x22;  |   %22 =   "            
     &#35;     |    &#x23;  |   %23 =   #            
     &#36;     |    &#x24;  |   %24 =   $            
     &#37;     |    &#x25;  |   %25 =   %            
     &#38;     |    &#x26;  |   %26 =   &            
     &#39;     |    &#x27;  |   %27 =   '            
     &#40;     |    &#x28;  |   %28 =   (            
     &#41;     |    &#x29;  |   %29 =   )            
     &#42;     |    &#x2a;  |   %2a =   *            
     &#43;     |    &#x2b;  |   %2b =   +            
     &#44;     |    &#x2c;  |   %2c =   ,            
     &#45;     |    &#x2d;  |   %2d =   -            
     &#46;     |    &#x2e;  |   %2e =   .            
     &#47;     |    &#x2f;  |   %2f =   /            
     &#48;     |    &#x30;  |   %30 =   0            
     &#49;     |    &#x31;  |   %31 =   1            
     &#50;     |    &#x32;  |   %32 =   2            
     &#51;     |    &#x33;  |   %33 =   3            
     &#52;     |    &#x34;  |   %34 =   4            
     &#53;     |    &#x35;  |   %35 =   5            
     &#54;     |    &#x36;  |   %36 =   6            
     &#55;     |    &#x37;  |   %37 =   7            
     &#56;     |    &#x38;  |   %38 =   8            
     &#57;     |    &#x39;  |   %39 =   9            
     &#58;     |    &#x3a;  |   %3a =   :            
     &#59;     |    &#x3b;  |   %3b =   ;            
     &#60;     |    &#x3c;  |   %3c =   <            
     &#61;     |    &#x3d;  |   %3d =   =            
     &#62;     |    &#x3e;  |   %3e =   >            
     &#63;     |    &#x3f;  |   %3f =   ?            
     &#64;     |    &#x40;  |   %40 =   @   
     &#65;     |    &#x41;  |   %41 =   A
     &#66;     |    &#x42;  |   %42 =   B
     &#67;     |    &#x43;  |   %43 =   C
     &#68;     |    &#x44;  |   %44 =   D
     &#69;     |    &#x45;  |   %45 =   E
     &#70;     |    &#x46;  |   %46 =   F
     &#71;     |    &#x47;  |   %47 =   G
     &#72;     |    &#x48;  |   %48 =   H
     &#73;     |    &#x49;  |   %49 =   I
     &#74;     |    &#x4a;  |   %4a =   J
     &#75;     |    &#x4b;  |   %4b =   K
     &#76;     |    &#x4c;  |   %4c =   L
     &#77;     |    &#x4d;  |   %4d =   M
     &#78;     |    &#x4e;  |   %4e =   N
     &#79;     |    &#x4f;  |   %4f =   O 
     &#80;     |    &#x50;  |   %50 =   P
     &#81;     |    &#x51;  |   %51 =   Q
     &#82;     |    &#x52;  |   %52 =   R
     &#83;     |    &#x53;  |   %53 =   S
     &#84;     |    &#x54;  |   %54 =   T
     &#85;     |    &#x55;  |   %55 =   U
     &#86;     |    &#x56;  |   %56 =   V
     &#87;     |    &#x57;  |   %57 =   W
     &#88;     |    &#x58;  |   %58 =   X
     &#89;     |    &#x59;  |   %59 =   Y
     &#90;     |    &#x5a;  |   %5a =   Z
     &#91;     |    &#x5b;  |   %5b =   [
     &#92;     |    &#x5c;  |   %5c =   \
     &#93;     |    &#x5d;  |   %5d =   ]
     &#94;     |    &#x5e;  |   %5e =   ^
     &#95;     |    &#x5f;  |   %5f =   _
     &#96;     |    &#x60;  |   %60 =   `
     &#97;     |    &#x61;  |   %61 =   a
     &#98;     |    &#x62;  |   %62 =   b
     &#99;     |    &#x63;  |   %63 =   c
     &#100; |   &#x64;  |   %64 =   d
     &#101; |   &#x65;  |   %65 =   e
     &#102; |   &#x66;  |   %66 =   f
     &#103; |   &#x67;  |   %67 =   g
     &#104; |   &#x68;  |   %68 =   h
     &#105; |   &#x69;  |   %69 =   i
     &#106; |   &#x6a;  |   %6a =   j
     &#107; |   &#x6b;  |   %6b =   k
     &#108; |   &#x6c;  |   %6c =   l
     &#109; |   &#x6d;  |   %6d =   m
     &#110; |   &#x6e;  |   %6e =   n
     &#111; |   &#x6f;  |   %6f =   o
     &#112; |   &#x70;  |   %70 =   p
     &#113; |   &#x71;  |   %71 =   q
     &#114; |   &#x72;  |   %72 =   r
     &#115; |   &#x73;  |   %73 =   s
     &#116; |   &#x74;  |   %74 =   t
     &#117; |   &#x75;  |   %75 =   u
     &#118; |   &#x76;  |   %76 =   v
     &#119; |   &#x77;  |   %77 =   w
     &#120; |   &#x78;  |   %78 =   x
     &#121; |   &#x79;  |   %79 =   y
     &#122; |   &#x7a;  |   %7a =   z
     &#123; |   &#x7b;  |   %7b =   {
     &#124; |   &#x7c;  |   %7c =   |
     &#125; |   &#x7d;  |   %7d =   }
     &#126; |   &#x7e;  |   %7e =   ~
     &#160; |   &#xa0;  |   %a0 =   <nbsp>      
     &#192; |   &#xc0;  |   %c0 =   à
     &#193; |   &#xc1;  |   %c1 =   á
     &#194; |   &#xc2;  |   %c2 =   ?
     &#195; |   &#xc3;  |   %c3 =   ?
     &#196; |   &#xc4;  |   %c4 =   ?
     &#197; |   &#xc5;  |   %c5 =   ?
     &#198; |   &#xc6;  |   %c6 =   ?
     &#199; |   &#xc7;  |   %c7 =   ?
     &#200; |   &#xc8;  |   %c8 =   è
     &#201; |   &#xc9;  |   %c9 =   é
     &#202; |   &#xca;  |   %ca =   ê
     &#203; |   &#xcb;  |   %cb =   ?
     &#204; |   &#xcc;  |   %cc =   ì
     &#205; |   &#xcd;  |   %cd =   í
     &#206; |   &#xce;  |   %ce =   ?
     &#207; |   &#xcf;  |   %cf =   ?
     &#210; |   &#xd2;  |   %d2 =   ò
     &#211; |   &#xd3;  |   %d3 =   ó
     &#212; |   &#xd4;  |   %d4 =   ?
     &#213; |   &#xd5;  |   %d5 =   ?
     &#217; |   &#xd9;  |   %d9 =   ù
     &#218; |   &#xda;  |   %da =   ú
     &#219; |   &#xdb;  |   %db =   ?
     &#224; |   &#xe0;  |   %e0 =   à
     &#225; |   &#xe1;  |   %e1 =   á
     &#226; |   &#xe2;  |   %e2 =   a
     &#227; |   &#xe3;  |   %e3 =   ?
     &#231; |   &#xe7;  |   %e7 =   ?
     &#232; |   &#xe8;  |   %e8 =   è
     &#233; |   &#xe9;  |   %e9 =   é
     &#234; |   &#xea;  |   %ea =   ê
     &#236; |   &#xec;  |   %ec =   ì
     &#237; |   &#xed;  |   %ed =   í
     &#242; |   &#xf2;  |   %f2 =   ò
     &#243; |   &#xf3;  |   %f3 =   ó
     &#244; |   &#xf4;  |   %f4 =   ?
     &#245; |   &#xf5;  |   %f5 =   ?
     &#247; |   &#xf7;  |   %f7 =   ÷
     &#249; |   &#xf9;  |   %f9 =   ù
     &#250; |   &#xfa;  |   %fa =   ú
     */

     GO     

回答by brenkdar

The solution provided by amit kohan works, but was to slow for me. With large string data, the iteration trough each character is not optimal. I've made the following function, with isn't very compact but fast.

amit kohan 提供的解决方案有效,但对我来说很慢。对于大字符串数据,每个字符的迭代槽不是最佳的。我做了以下功能,不是很紧凑但很快。

CREATE FUNCTIONdbo.F_StripLowAscii
(
   @Name nvarchar(max)
)
RETURN nvarchar(max) as
BEGIN
DECLARE @Result nvarchar(max) 

    If @Name IS NULL 
      RETURN @Name 

    DECLARE @BlankRange VARCHAR(15) 
    DECLARE @FoundAt INTEGER 

    -- ASCII CHAR #0 needs a special treatment
    SET @BlankRange = '%[' + CHAR(0) + ']%' 
    SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    WHILE @FoundAt > 0 
    BEGIN 
        SET @name = left(@name, @FoundAt-1 ) + SUBSTRING(@name, @FoundAt+1, LEN(@Name))
        SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    END 


    SET @BlankRange = '%[' + CHAR(1)+'-'+CHAR(8) + ']%' 
    SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    WHILE @FoundAt > 0 
    BEGIN 
        SET @name = Replace(@Name, SUBSTRING(@Name, @FoundAt,1),'')
        SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    END 

    SET @BlankRange = '%[' + CHAR(11)+'-'+CHAR(12) + ']%' 
    SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    WHILE @FoundAt > 0 
    BEGIN 
        SET @name = Replace(@Name, SUBSTRING(@Name, @FoundAt,1),'')
        SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    END 

    SET @BlankRange = '%[' + CHAR(14)+'-'+CHAR(31) + ']%' 
    SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    WHILE @FoundAt > 0 
    BEGIN 
        SET @name = Replace(@Name, SUBSTRING(@Name, @FoundAt,1),'')
        SET @FoundAt = PATINDEX(@BlankRange ,@Name COLLATE SQL_Latin1_General_CP850_Bin) 
    END 

    RETURN @Name 

END
GO