[RESOLVED] Decoding characters obtained from web pages

**Niya** · Apr 9th, 2013, 11:12 AM

wow.....I don't have a clue what you did there lol

**Evil_Giraffe** · Apr 9th, 2013, 04:05 PM

Originally Posted by Niya

wow.....I don't have a clue what you did there lol

First up, I use the StringInfo.GetTextElementEnumerator to get an IEnumerator(Of String) where each string is a single grapheme - what you and I think of as a single character (this may be several Unicode code points and each code point is potentially (but probably not) several Char instances).
Because it's an IEnumerator not an IEnumerable I can't use For Each (it would probably be two minutes work to write a function that wraps up an IEnumerator in an IEnumerable or even to find the implementation that's in the BCL already, but meh)
For each element I normalise it using Form D - this converts all strings to a plain base character plus combining characters, (such as "e" followed by "acute accent" rather than the composed "e with acute accent" code point). We then take the first Char from that string which is the base character and leave the rest (note that if the base character needs to be encoded with more than one Char, this corrupts the string, hence my warning about Surrogate Pairs - would be easy enough to account for if I could be bothered, given that Char has an IsSurrogatePair property)

Thread: [RESOLVED] Decoding characters obtained from web pages

Thread Tools

Display

Hybrid View

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Posting Permissions