Objective-C / Cocoa Touch 中的 HTML 字符解码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1105169/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTML character decoding in Objective-C / Cocoa Touch
提问by treznik
First of all, I found this: Objective C HTML escape/unescape, but it doesn't work for me.
首先,我发现了这个: Objective C HTML escape/unescape,但它对我不起作用。
My encoded characters (come from a RSS feed, btw) look like this: &
我的编码字符(来自 RSS 提要,顺便说一句)如下所示: &
I searched all over the net and found related discussions, but no fix for my particular encoding, I think they are called hexadecimal characters.
我在网上搜索并找到了相关讨论,但没有修复我的特定编码,我认为它们被称为十六进制字符。
采纳答案by Matt Bridges
Those are called Character Entity References. When they take the form of &#<number>;
they are called numeric entity references. Basically, it's a string representation of the byte that should be substituted. In the case of &
, it represents the character with the value of 38 in the ISO-8859-1 character encoding scheme, which is &
.
这些被称为Character Entity References。当它们采用 的形式时,&#<number>;
它们被称为数字实体引用。基本上,它是应该替换的字节的字符串表示形式。在 的情况下&
,它表示 ISO-8859-1 字符编码方案中值为 38 的字符,即&
。
The reason the ampersand has to be encoded in RSS is it's a reserved special character.
与号必须在 RSS 中编码的原因是它是一个保留的特殊字符。
What you need to do is parse the string and replace the entities with a byte matching the value between &#
and ;
. I don't know of any great ways to do this in objective C, but this stack overflow questionmight be of some help.
您需要做的是解析字符串并用与&#
和之间的值匹配的字节替换实体;
。我不知道在目标 C 中有什么好方法可以做到这一点,但是这个堆栈溢出问题可能会有所帮助。
Edit: Since answering this some two years ago there are some great solutions; see @Michael Waterfall's answer below.
编辑:自从两年前回答这个问题以来,有一些很好的解决方案;请参阅下面的@Michael Waterfall 的回答。
回答by Michael Waterfall
Check out my NSString category for HTML. Here are the methods available:
查看我的NSString 类别以获取 HTML。以下是可用的方法:
- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
回答by Walty Yeung
The one by Daniel is basically very nice, and I fixed a few issues there:
Daniel 的那个基本上非常好,我在那里解决了一些问题:
removed the skipping character for NSSCanner (otherwise spaces between two continuous entities would be ignored
[scanner setCharactersToBeSkipped:nil];
fixed the parsing when there are isolated '&' symbols (I am not sure what is the 'correct' output for this, I just compared it against firefox):
删除了 NSSCanner 的跳过字符(否则两个连续实体之间的空格将被忽略
[扫描仪 setCharactersToBeSkipped:nil];
当存在孤立的“&”符号时修复了解析(我不确定什么是“正确”输出,我只是将它与 firefox 进行了比较):
e.g.
例如
&#ABC DF & B' & C' Items (288)
here is the modified code:
这是修改后的代码:
- (NSString *)stringByDecodingXMLEntities {
NSUInteger myLength = [self length];
NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;
// Short-circuit if there are no ampersands.
if (ampIndex == NSNotFound) {
return self;
}
// Make result string with some extra capacity.
NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];
// First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
NSScanner *scanner = [NSScanner scannerWithString:self];
[scanner setCharactersToBeSkipped:nil];
NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];
do {
// Scan up to the next entity or the end of the string.
NSString *nonEntityString;
if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
[result appendString:nonEntityString];
}
if ([scanner isAtEnd]) {
goto finish;
}
// Scan either a HTML or numeric character entity reference.
if ([scanner scanString:@"&" intoString:NULL])
[result appendString:@"&"];
else if ([scanner scanString:@"'" intoString:NULL])
[result appendString:@"'"];
else if ([scanner scanString:@""" intoString:NULL])
[result appendString:@"\""];
else if ([scanner scanString:@"<" intoString:NULL])
[result appendString:@"<"];
else if ([scanner scanString:@">" intoString:NULL])
[result appendString:@">"];
else if ([scanner scanString:@"&#" intoString:NULL]) {
BOOL gotNumber;
unsigned charCode;
NSString *xForHex = @"";
// Is it hex or decimal?
if ([scanner scanString:@"x" intoString:&xForHex]) {
gotNumber = [scanner scanHexInt:&charCode];
}
else {
gotNumber = [scanner scanInt:(int*)&charCode];
}
if (gotNumber) {
[result appendFormat:@"%C", (unichar)charCode];
[scanner scanString:@";" intoString:NULL];
}
else {
NSString *unknownEntity = @"";
[scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];
[result appendFormat:@"&#%@%@", xForHex, unknownEntity];
//[scanner scanUpToString:@";" intoString:&unknownEntity];
//[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
}
}
else {
NSString *amp;
[scanner scanString:@"&" intoString:&]; //an isolated & symbol
[result appendString:amp];
/*
NSString *unknownEntity = @"";
[scanner scanUpToString:@";" intoString:&unknownEntity];
NSString *semicolon = @"";
[scanner scanString:@";" intoString:&semicolon];
[result appendFormat:@"%@%@", unknownEntity, semicolon];
NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
*/
}
}
while (![scanner isAtEnd]);
finish:
return result;
}
回答by Bryan Luby
As of iOS 7, you can decode HTML characters natively by using an NSAttributedString
with the NSHTMLTextDocumentType
attribute:
从 iOS 7 开始,您可以使用NSAttributedString
带有NSHTMLTextDocumentType
属性的本地解码 HTML 字符:
NSString *htmlString = @" & & < > ™ © ♥ ♣ ♠ ♦";
NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType};
NSAttributedString *decodedString;
decodedString = [[NSAttributedString alloc] initWithData:stringData
options:options
documentAttributes:NULL
error:NULL];
The decoded attributed string will now be displayed as: ? & & < > ? ? ? ? ? ?.
解码的属性字符串现在将显示为: ? & & < > ? ? ? ? ? ?.
Note:This will only work if called on the main thread.
注意:这仅在主线程上调用时才有效。
回答by Nikita Rybak
Nobody seems to mention one of the simplest options: Google Toolbox for Mac
(Despite the name, this works on iOS too.)
似乎没有人提到最简单的选项之一:适用于 Mac 的 Google 工具箱
(尽管名称如此,但它也适用于 iOS。)
https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h
https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h
/// Get a string where internal characters that are escaped for HTML are unescaped
//
/// For example, '&' becomes '&'
/// Handles   and 2 cases as well
///
// Returns:
// Autoreleased NSString
//
- (NSString *)gtm_stringByUnescapingFromHTML;
And I had to include only three files in the project: header, implementation and GTMDefines.h
.
我只需要在项目中包含三个文件:头文件、实现文件和GTMDefines.h
.
回答by Daniel Dickison
I ought to post this on GitHub or something. This goes in a category of NSString, uses NSScanner
for the implementation, and handles both hex and decimal numeric character entities as well as the usual symbolic ones.
我应该把这个发布到 GitHub 或其他什么地方。这属于 NSString 类别,NSScanner
用于实现,并处理十六进制和十进制数字字符实体以及通常的符号实体。
Also, it handles malformed strings (when you have an & followed by an invalid sequence of characters) relatively gracefully, which turned out to be crucial in my released appthat uses this code.
此外,它可以相对优雅地处理格式错误的字符串(当你有一个 & 后跟一个无效的字符序列时),这在我发布的使用此代码的应用程序中证明是至关重要的。
- (NSString *)stringByDecodingXMLEntities {
NSUInteger myLength = [self length];
NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;
// Short-circuit if there are no ampersands.
if (ampIndex == NSNotFound) {
return self;
}
// Make result string with some extra capacity.
NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];
// First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
NSScanner *scanner = [NSScanner scannerWithString:self];
do {
// Scan up to the next entity or the end of the string.
NSString *nonEntityString;
if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
[result appendString:nonEntityString];
}
if ([scanner isAtEnd]) {
goto finish;
}
// Scan either a HTML or numeric character entity reference.
if ([scanner scanString:@"&" intoString:NULL])
[result appendString:@"&"];
else if ([scanner scanString:@"'" intoString:NULL])
[result appendString:@"'"];
else if ([scanner scanString:@""" intoString:NULL])
[result appendString:@"\""];
else if ([scanner scanString:@"<" intoString:NULL])
[result appendString:@"<"];
else if ([scanner scanString:@">" intoString:NULL])
[result appendString:@">"];
else if ([scanner scanString:@"&#" intoString:NULL]) {
BOOL gotNumber;
unsigned charCode;
NSString *xForHex = @"";
// Is it hex or decimal?
if ([scanner scanString:@"x" intoString:&xForHex]) {
gotNumber = [scanner scanHexInt:&charCode];
}
else {
gotNumber = [scanner scanInt:(int*)&charCode];
}
if (gotNumber) {
[result appendFormat:@"%C", charCode];
}
else {
NSString *unknownEntity = @"";
[scanner scanUpToString:@";" intoString:&unknownEntity];
[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
}
[scanner scanString:@";" intoString:NULL];
}
else {
NSString *unknownEntity = @"";
[scanner scanUpToString:@";" intoString:&unknownEntity];
NSString *semicolon = @"";
[scanner scanString:@";" intoString:&semicolon];
[result appendFormat:@"%@%@", unknownEntity, semicolon];
NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
}
}
while (![scanner isAtEnd]);
finish:
return result;
}
回答by realsugar
This is the way I do it using RegexKitLiteframework:
这是我使用RegexKitLite框架的方式:
-(NSString*) decodeHtmlUnicodeCharacters: (NSString*) html {
NSString* result = [html copy];
NSArray* matches = [result arrayOfCaptureComponentsMatchedByRegex: @"\&#([\d]+);"];
if (![matches count])
return result;
for (int i=0; i<[matches count]; i++) {
NSArray* array = [matches objectAtIndex: i];
NSString* charCode = [array objectAtIndex: 1];
int code = [charCode intValue];
NSString* character = [NSString stringWithFormat:@"%C", code];
result = [result stringByReplacingOccurrencesOfString: [array objectAtIndex: 0]
withString: character];
}
return result;
}
}
Hope this will help someone.
希望这会帮助某人。
回答by Krishna Gupta
you can use just this function to solve this problem.
您可以仅使用此功能来解决此问题。
+ (NSString*) decodeHtmlUnicodeCharactersToString:(NSString*)str
{
NSMutableString* string = [[NSMutableString alloc] initWithString:str]; // #&39; replace with '
NSString* unicodeStr = nil;
NSString* replaceStr = nil;
int counter = -1;
for(int i = 0; i < [string length]; ++i)
{
unichar char1 = [string characterAtIndex:i];
for (int k = i + 1; k < [string length] - 1; ++k)
{
unichar char2 = [string characterAtIndex:k];
if (char1 == '&' && char2 == '#' )
{
++counter;
unicodeStr = [string substringWithRange:NSMakeRange(i + 2 , 2)];
// read integer value i.e, 39
replaceStr = [string substringWithRange:NSMakeRange (i, 5)]; // #&39;
[string replaceCharactersInRange: [string rangeOfString:replaceStr] withString:[NSString stringWithFormat:@"%c",[unicodeStr intValue]]];
break;
}
}
}
[string autorelease];
if (counter > 1)
return [self decodeHtmlUnicodeCharactersToString:string];
else
return string;
}
回答by Max Chuquimia
Here's a Swift version of Walty Yeung's answer:
这是Walty Yeung 回答的 Swift 版本:
extension String {
static private let mappings = [""" : "\"","&" : "&", "<" : "<", ">" : ">"," " : " ","¡" : "?","¢" : "¢","£" : " £","¤" : "¤","¥" : "¥","¦" : "|","§" : "§","¨" : "¨","©" : "?","ª" : " a","«" : "?","¬" : "?","®" : "?","¯" : "ˉ","°" : "°","±" : "±","² " : "2","³" : "3","´" : "′","µ" : "μ","¶" : "?","·" : "·","¸" : "?","¹" : "1","º" : "o","»" : "?&","frac14" : "?","½" : "?","¾" : "?","¿" : "?","×" : "×","÷" : "÷","Ð" : "D","ð" : "e","Þ" : "T","þ" : "t","Æ" : "?","æ" : "?","&OElig" : "?","&oelig" : "?","Å" : "?","Ø" : "?","Ç" : "?","ç" : "?","ß" : "?","Ñ" : "?","ñ":"?",]
func stringByDecodingXMLEntities() -> String {
guard let _ = self.rangeOfString("&", options: [.LiteralSearch]) else {
return self
}
var result = ""
let scanner = NSScanner(string: self)
scanner.charactersToBeSkipped = nil
let boundaryCharacterSet = NSCharacterSet(charactersInString: " \t\n\r;")
repeat {
var nonEntityString: NSString? = nil
if scanner.scanUpToString("&", intoString: &nonEntityString) {
if let s = nonEntityString as? String {
result.appendContentsOf(s)
}
}
if scanner.atEnd {
break
}
var didBreak = false
for (k,v) in String.mappings {
if scanner.scanString(k, intoString: nil) {
result.appendContentsOf(v)
didBreak = true
break
}
}
if !didBreak {
if scanner.scanString("&#", intoString: nil) {
var gotNumber = false
var charCodeUInt: UInt32 = 0
var charCodeInt: Int32 = -1
var xForHex: NSString? = nil
if scanner.scanString("x", intoString: &xForHex) {
gotNumber = scanner.scanHexInt(&charCodeUInt)
}
else {
gotNumber = scanner.scanInt(&charCodeInt)
}
if gotNumber {
let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt)
result.appendContentsOf(newChar)
scanner.scanString(";", intoString: nil)
}
else {
var unknownEntity: NSString? = nil
scanner.scanUpToCharactersFromSet(boundaryCharacterSet, intoString: &unknownEntity)
let h = xForHex ?? ""
let u = unknownEntity ?? ""
result.appendContentsOf("&#\(h)\(u)")
}
}
else {
scanner.scanString("&", intoString: nil)
result.appendContentsOf("&")
}
}
} while (!scanner.atEnd)
return result
}
}
回答by angelos.p
Actually the great MWFeedParser framework of Michael Waterfall (referred to his answer) has been forked by rmchaara who has update it with ARC support!
实际上迈克尔瀑布的伟大的 MWFeedParser 框架(参考他的回答)已经由 rmchaara 分叉,他已经使用 ARC 支持对其进行了更新!
You can find it in Github here
你可以在 Github这里找到它
It really works great, I used stringByDecodingHTMLEntities method and works flawlessly.
它真的很好用,我使用了 stringByDecodingHTMLEntities 方法并且完美无缺。