objective-c NSString - 仅转换为纯字母(即删除重音符号+标点符号)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1231764/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
NSString - Convert to pure alphabet only (i.e. remove accents+punctuation)
提问by Peter Hosey
I'm trying to compare names without any punctuation, spaces, accents etc. At the moment I am doing the following:
我正在尝试比较没有任何标点符号、空格、重音符号等的名称。目前我正在执行以下操作:
-(NSString*) prepareString:(NSString*)a {
//remove any accents and punctuation;
a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];
a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
a=[a lowercaseString];
return a;
}
However, I need to do this for hundreds of strings and I need to make this more efficient. Any ideas?
但是,我需要对数百个字符串执行此操作,并且需要提高效率。有任何想法吗?
回答by Peter N Lewis
NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
回答by Peter Hosey
Before using any of these solutions, don't forget to use decomposedStringWithCanonicalMappingto decompose any accented letters. This will turn, for example, é (U+00E9) into e ?? (U+0065 U+0301). Then, when you strip out the non-alphanumeric characters, the unaccented letters will remain.
在使用任何这些解决方案之前,不要忘记使用decomposedStringWithCanonicalMapping分解任何带重音的字母。例如,这将把 é (U+00E9) 变成 e ?? (U+0065 U+0301)。然后,当您去除非字母数字字符时,将保留未重读的字母。
The reason why this is important is that you probably don't want, say, “d?n” and “dün”* to be treated as the same. If you stripped out all accented letters, as some of these solutions may do, you'll end up with “dn”, so those strings will compare as equal.
这很重要的原因是您可能不希望,例如,“d?n”和“dün”* 被视为相同。如果你去掉所有带重音的字母,就像这些解决方案中的一些可能做的那样,你最终会得到“dn”,所以这些字符串将比较相等。
So, you should decompose them first, so that you can strip the accents and leave the letters.
所以,你应该先分解它们,这样你就可以去掉重音并留下字母。
*Example from German. Thanks to Joris Weimar for providing it.
*来自德语的示例。感谢 Joris Weimar 提供它。
回答by Sophie Alpert
On a similar question, Ole Begemann suggests using stringByFoldingWithOptions:and I believe this is the best solution here:
在一个类似的问题上,Ole Begemann 建议使用 stringByFoldingWithOptions:我相信这是最好的解决方案:
NSString *accentedString = @"álgeBra"; NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];Depending on the nature of the strings you want to convert, you might want to set a fixed locale (e.g. English) instead of using the user's current locale. That way, you can be sure to get the same results on every machine.
NSString *accentedString = @"álgeBra"; NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];根据您要转换的字符串的性质,您可能希望设置固定的语言环境(例如英语)而不是使用用户的当前语言环境。这样,您可以确保在每台机器上获得相同的结果。
回答by uchuugaka
If you are trying to compare strings, use one of these methods. Don't try to change data.
如果您尝试比较字符串,请使用以下方法之一。不要试图改变数据。
- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale
You NEED to consider user locale to do things write with strings, particularly things like names. In most languages, characters like ? and ? are not the same other than they look similar. They are inherently distinct characters with meaning distinct from others, but the actual rules and semantics are distinct to each locale.
你需要考虑用户语言环境来做用字符串写的东西,特别是像名字这样的东西。在大多数语言中,字符像 ? 和 ?除了看起来相似之外,它们并不相同。它们本质上是不同的字符,含义与其他字符不同,但实际规则和语义因每个地区而异。
The correct way to compare and sort strings is by considering the user's locale. Anything else is naive, wrong and very 1990's. Stop doing it.
比较和排序字符串的正确方法是考虑用户的语言环境。其他任何事情都是幼稚的、错误的、非常 1990 年代的。别干了
If you are trying to pass data to a system that cannot support non-ASCII, well, this is just a wrong thing to do. Pass it as data blobs.
如果您尝试将数据传递给不支持非 ASCII 的系统,那么,这是错误的做法。将其作为数据 blob 传递。
Plus normalizing your strings first (see Peter Hosey's post) precomposing or decomposing, basically pick a normalized form.
加上首先对字符串进行规范化(参见 Peter Hosey 的帖子)预组合或分解,基本上选择规范化的形式。
- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping
No, it's not nearly as simple and easy as we tend to think. Yes, it requires informed and careful decision making. (and a bit of non-English language experience helps)
不,它远没有我们想象的那么简单和容易。是的,它需要知情和谨慎的决策。(和一些非英语语言经验有帮助)
回答by Frédéric Feytons
One important precision over the answer of BillyTheKid18756 (that was corrected by Luiz but it was not obvious in the explanation of the code):
对 BillyTheKid18756 答案的一个重要精度(由 Luiz 纠正,但在代码解释中并不明显):
DO NOT USEstringWithCStringas a second step to remove accents, it can add unwanted characters at the end of your string as the NSData is not NULL-terminated (as stringWithCString expects it).
Or use it and add an additional NULL byte to your NSData, like Luiz did in his code.
不要stringWithCString用作删除重音的第二步,它可以在字符串的末尾添加不需要的字符,因为 NSData 不是以 NULL 结尾的(如 stringWithCString 所期望的那样)。或者使用它并向您的 NSData 添加一个额外的 NULL 字节,就像 Luiz 在他的代码中所做的那样。
I think a simpler answer is to replace:
我认为一个更简单的答案是替换:
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
By:
经过:
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];
If I take back the code of BillyTheKid18756, here is the complete correct code:
如果我收回 BillyTheKid18756 的代码,这里是完整的正确代码:
// The input text
NSString *text = @"B?vérè!@$&%^&(*^(_()-*/48";
// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];
// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];
// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
回答by Alex Reynolds
Consider using the RegexKit framework. You could do something like:
考虑使用RegexKit 框架。你可以这样做:
NSString *searchString = @"This is neat.";
NSString *regexString = @"[\W]";
NSString *replaceWithString = @"";
NSString *replacedString = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];
NSLog (@"%@", replacedString);
//... Thisisneat
回答by Quinn Taylor
Consider using NSScanner, and specifically the methods -setCharactersToBeSkipped:(which accepts an NSCharacterSet) and -scanString:intoString:(which accepts a string and returns the scanned string by reference).
考虑使用NSScanner,特别是方法-setCharactersToBeSkipped:(它接受一个 NSCharacterSet)和-scanString:intoString:(它接受一个字符串并通过引用返回扫描的字符串)。
You may also want to couple this with -[NSString localizedCompare:], or perhaps -[NSString compare:options:]with the NSDiacriticInsensitiveSearchoption. That could simplify having to remove/replace accents, so you can focus on removing puncuation, whitespace, etc.
您可能还想将此与-[NSString localizedCompare:],或者可能-[NSString compare:options:]与NSDiacriticInsensitiveSearch选项结合使用。这可以简化删除/替换重音的过程,因此您可以专注于删除标点符号、空格等。
If you must use an approach like you presented in your question, at least use an NSMutableString and replaceOccurrencesOfString:withString:options:range:— that will be much more efficient than creating tons of nearly-identical autoreleased strings. It could be that just reducing the number of allocations will boost performance "enough" for the time being.
如果你必须使用你在问题中提出的方法,至少使用 NSMutableString 并且replaceOccurrencesOfString:withString:options:range:- 这将比创建大量几乎相同的自动释放字符串更有效。可能只是减少分配数量将暂时“足够”提高性能。
回答by Vegard
To give a complete example by combining the answers from Luiz and Peter, adding a few lines, you get the code below.
为了通过结合 Luiz 和 Peter 的答案给出一个完整的例子,添加几行,你会得到下面的代码。
The code does the following:
该代码执行以下操作:
- Creates a set of accepted characters
- Turn accented letters into normal letters
- Remove characters not in the set
- 创建一组可接受的字符
- 将重音字母变成普通字母
- 删除不在集合中的字符
Objective-C
目标-C
// The input text
NSString *text = @"B?vérè!@$&%^&(*^(_()-*/48";
// Create set of accepted characters
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];
// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
// Remove characters not in the set
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
Swift (2.2) example
Swift (2.2) 示例
let text = "B?vérè!@$&%^&(*^(_()-*/48"
// Create set of accepted characters
let acceptedCharacters = NSMutableCharacterSet()
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.letterCharacterSet())
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.decimalDigitCharacterSet())
acceptedCharacters.addCharactersInString(" _-.!")
// Turn accented letters into normal letters (optional)
let sanitizedData = text.dataUsingEncoding(NSASCIIStringEncoding, allowLossyConversion: true)
let sanitizedText = String(data: sanitizedData!, encoding: NSASCIIStringEncoding)
// Remove characters not in the set
let components = sanitizedText!.componentsSeparatedByCharactersInSet(acceptedCharacters.invertedSet)
let output = components.joinWithSeparator("")
Output
输出
The output for both examples would be: BuverE!_-48
两个例子的输出都是:BuverE!_-48
回答by Luiz Scheidegger
Just bumped into this, maybe its too late, but here is what worked for me:
刚刚碰到这个,也许为时已晚,但这里对我有用:
// text is the input string, and this just removes accents from the letters
// lossy encoding turns accented letters into normal letters
NSMutableData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding
allowLossyConversion:YES];
// increase length by 1 adds a 0 byte (increaseLengthBy
// guarantees to fill the new space with 0s), effectively turning
// sanitizedData into a c-string
[sanitizedData increaseLengthBy:1];
// now we just create a string with the c-string in sanitizedData
NSString *final = [NSString stringWithCString:[sanitizedData bytes]];
回答by lorean
@interface NSString (Filtering)
- (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet;
@end
@implementation NSString (Filtering)
- (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet {
NSMutableString * mutString = [NSMutableString stringWithCapacity:[self length]];
for (int i = 0; i < [self length]; i++){
char c = [self characterAtIndex:i];
if(![charSet characterIsMember:c]) [mutString appendFormat:@"%c", c];
}
return [NSString stringWithString:mutString];
}
@end

