|
-
Apr 9th, 2013, 11:12 AM
#1
Re: Decoding characters obtained from web pages
wow.....I don't have a clue what you did there lol
-
Apr 9th, 2013, 04:05 PM
#2
Re: Decoding characters obtained from web pages
 Originally Posted by Niya
wow.....I don't have a clue what you did there lol
First up, I use the StringInfo.GetTextElementEnumerator to get an IEnumerator(Of String) where each string is a single grapheme - what you and I think of as a single character (this may be several Unicode code points and each code point is potentially (but probably not) several Char instances).
Because it's an IEnumerator not an IEnumerable I can't use For Each (it would probably be two minutes work to write a function that wraps up an IEnumerator in an IEnumerable or even to find the implementation that's in the BCL already, but meh)
For each element I normalise it using Form D - this converts all strings to a plain base character plus combining characters, (such as "e" followed by "acute accent" rather than the composed "e with acute accent" code point). We then take the first Char from that string which is the base character and leave the rest (note that if the base character needs to be encoded with more than one Char, this corrupts the string, hence my warning about Surrogate Pairs - would be easy enough to account for if I could be bothered, given that Char has an IsSurrogatePair property)
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|