Ios Html Unicode To Nsstring?

December 27, 2023 Post a Comment

I'm in the process of porting an Android app to iOS and I've hit a small roadblock. I'm pulling HTML encoded data from a webpage but some of the data is presented in Unicode to di

Solution 1:

It's pretty easy to write your own HTML entity decoder. Just scan the string looking for &, read up to the following ;, then interpret the results. If it's "amp", "lt", "gt", or "quot", replace it with the relevant character. If it starts with #, it's a numeric entity. If the # is followed by an "x", treat the rest as hexadecimal, otherwise as decimal. Read the number, and then insert the character into your string (if you're writing to an NSMutableString you can use [str appendFormat:@"%C", thechar]. NSScanner can make the string scanning pretty easy, especially since it already knows how to read hex numbers.

I just whipped up a function that should do this for you. Note, I haven't actually tested this, so you should run it through its paces:

- (NSString *)stringByDecodingHTMLEntitiesInString:(NSString *)input {
    NSMutableString *results = [NSMutableString string];
    NSScanner *scanner = [NSScanner scannerWithString:input];
    [scanner setCharactersToBeSkipped:nil];
    while (![scanner isAtEnd]) {
        NSString *temp;
        if ([scanner scanUpToString:@"&" intoString:&temp]) {
            [results appendString:temp];
        }
        if ([scanner scanString:@"&" intoString:NULL]) {
            BOOL valid = YES;
            unsigned c = 0;
            NSUInteger savedLocation = [scanner scanLocation];
            if ([scanner scanString:@"#" intoString:NULL]) {
                // it's a numeric entityif ([scanner scanString:@"x" intoString:NULL]) {
                    // hexadecimalunsignedint value;
                    if ([scanner scanHexInt:&value]) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                } else {
                    // decimalint value;
                    if ([scanner scanInt:&value] && value >= 0) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                }
                if (![scanner scanString:@";" intoString:NULL]) {
                    // not ;-terminated, bail out and emit the whole entity
                    valid = NO;
                }
            } else {
                if (![scanner scanUpToString:@";" intoString:&temp]) {
                    // &; is not a valid entity
                    valid = NO;
                } elseif (![scanner scanString:@";" intoString:NULL]) {
                    // there was no trailing ;
                    valid = NO;
                } elseif ([temp isEqualToString:@"amp"]) {
                    c = '&';
                } elseif ([temp isEqualToString:@"quot"]) {
                    c = '"';
                } elseif ([temp isEqualToString:@"lt"]) {
                    c = '<';
                } elseif ([temp isEqualToString:@"gt"]) {
                    c = '>';
                } else {
                    // unknown entity
                    valid = NO;
                }
            }
            if (!valid) {
                // we errored, just emit the whole thing raw
                [results appendString:[input substringWithRange:NSMakeRange(savedLocation, [scanner scanLocation]-savedLocation)]];
            } else {
                [results appendFormat:@"%C", c];
            }
        }
    }
    return results;
}

Solution 2:

The &#(number); construct in HTML (and XML) is known as a character reference. It's not Unicode-specific, other than in that all characters in HTML are defined in terms of Unicode, whether included verbatim or encoded as a character or entity reference. (Entity references are the named ones that look like é or & and if you are scraping an HTML page you will certainly have to deal with those as well.)

There isn't a function in the standard library for decoding character or entity references. See this question for approaches to decoding HTML text content. If you only have character references and the standard XML entities like & you can get away with leveraging NSXMLParser to parse an <element>+yourstring+</element>, but this won't handle HTML-specific entities like é.

In general, screen-scraping is best done using a proper HTML parser, rather than string-hacking. This will convert all text content into text nodes, converting the character and entity references as it goes. However, again, there is no HTML parser available in the standard library. If the target page is well-formed standalone XHTML you can again use NSXMLParser. Otherwise you might like to try libxml2, which offers an HTML parser as well as XML. See this question for some background.

Solution 3:

if you get data from a website you will have an NS(Mutable)Data-Object as your receiving-buffer. You just have to transform that NSData into an NSString via: NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding] if your server is sending in Unicode. If your server is sending utf-8 or other then you have to adjust the stringencoding in your receiving-code as well.

here a list of all supported string-encoding-types

edit: take a look at this so-thread.

Html5 News

Ios Html Unicode To Nsstring?

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Ios Html Unicode To Nsstring?"