C++ 从字符串中去除非 ASCII 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10178700/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:43:10  来源:igfitidea点击:

C++ Strip non-ASCII Characters from string

c++stringascii

提问by AnthonyW

Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.

在你开始之前;是的,我知道这是一个重复的问题,是的,我已经查看了已发布的解决方案。我的问题是我无法让他们工作。

bool invalidChar (char c)
{ 
    return !isprint((unsigned)c); 
}
void stripUnicode(string & str)
{
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end()); 
}

I tested this method on "Prus?us, ?gyptians," and it did nothing I also attempted to substitute isprintfor isalnum

我在“Prus?us,?gyptians”上测试了这种方法,但它没有做任何我也试图替代isprintisalnum

The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.

当我在程序的另一部分转换 string->wstring->string 时,会出现真正的问题。如果 string->wstring 转换中有 unicode 字符,则转换会失败。

Ref:

参考:

How can you strip non-ASCII characters from a string? (in C#)

如何从字符串中去除非 ASCII 字符?(在 C# 中)

How to strip all non alphanumeric characters from a string in c++?

如何从C++中的字符串中去除所有非字母数字字符?

Edit:

编辑:

I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:

我仍然想删除所有非 ASCII 字符,不管它是否有帮助,这是我崩溃的地方:

// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH

Error Dialog

错误对话框

MSVC++ Debug Library

MSVC++ 调试库

Debug Assertion Failed!

调试断言失败!

Program: //myproject

程序://我的项目

File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c

文件:f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c

Line: //Above

行://上面

Expression:(unsigned)(c+1)<=256

表达式:(无符号)(c+1)<=256

Edit:

编辑:

Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within shouldbe valid.

进一步使问题复杂化:我正在读取的 .txt 文件是 ANSI 编码的。里面的一切都应该是有效的。

Solution:

解决方案:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

If someone else would like to copy/paste this, I can check this question off.

如果其他人想复制/粘贴这个,我可以勾选这个问题。

EDIT:

编辑:

For future reference: try using the __isascii, iswasciicommands

供将来参考:尝试使用__isascii、iswascii命令

回答by AnthonyW

Solution:

解决方案:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

EDIT:

编辑:

For future reference: try using the __isascii, iswascii commands

供将来参考:尝试使用 __isascii、iswascii 命令

回答by James Kanze

At least one problem is in your invalidCharfunction. It should be:

至少有一个问题出在您的invalidChar函数中。它应该是:

return !isprint( static_cast<unsigned char>( c ) );

Casting a charto an unsignedis likely to give some very, very big values if the charis negative (UNIT_MAX+1 + c). Passing such a value toisprint` is undefined behavior.

如果 a为负(isprint` 是未定义的行为),则将a 转换char为 anunsigned可能会给出一些非常非常大的值。charUNIT_MAX+1 + c). Passing such a value to

回答by Adrian McCarthy

isprintdepends on the locale, so the character in question must be printable in the current locale.

isprint取决于语言环境,因此相关字符必须可在当前语言环境中打印。

If you want strictly ASCII, check the range for [0..127]. If you want printable ASCII, check the range and isprint.

如果您想要严格的 ASCII,请检查 [0..127] 的范围。如果您想要可打印的 ASCII,请检查范围和isprint.