Results 1 to 9 of 9

Thread: [RESOLVED] Decoding characters obtained from web pages

Hybrid View

  1. #1
    Angel of Code Niya's Avatar
    Join Date
    Nov 2011
    Posts
    9,017

    Re: Decoding characters obtained from web pages

    wow.....I don't have a clue what you did there lol
    Treeview with NodeAdded/NodesRemoved events | BlinkLabel control | Calculate Permutations | Object Enums | ComboBox with centered items | .Net Internals article(not mine) | Wizard Control | Understanding Multi-Threading | Simple file compression | Demon Arena

    Copy/move files using Windows Shell | I'm not wanted

    C++ programmers will dismiss you as a cretinous simpleton for your inability to keep track of pointers chained 6 levels deep and Java programmers will pillory you for buying into the evils of Microsoft. Meanwhile C# programmers will get paid just a little bit more than you for writing exactly the same code and VB6 programmers will continue to whitter on about "footprints". - FunkyDexter

    There's just no reason to use garbage like InputBox. - jmcilhinney

    The threads I start are Niya and Olaf free zones. No arguing about the benefits of VB6 over .NET here please. Happiness must reign. - yereverluvinuncleber

  2. #2
    PowerPoster Evil_Giraffe's Avatar
    Join Date
    Aug 2002
    Location
    Suffolk, UK
    Posts
    2,555

    Re: Decoding characters obtained from web pages

    Quote Originally Posted by Niya View Post
    wow.....I don't have a clue what you did there lol
    First up, I use the StringInfo.GetTextElementEnumerator to get an IEnumerator(Of String) where each string is a single grapheme - what you and I think of as a single character (this may be several Unicode code points and each code point is potentially (but probably not) several Char instances).
    Because it's an IEnumerator not an IEnumerable I can't use For Each (it would probably be two minutes work to write a function that wraps up an IEnumerator in an IEnumerable or even to find the implementation that's in the BCL already, but meh)
    For each element I normalise it using Form D - this converts all strings to a plain base character plus combining characters, (such as "e" followed by "acute accent" rather than the composed "e with acute accent" code point). We then take the first Char from that string which is the base character and leave the rest (note that if the base character needs to be encoded with more than one Char, this corrupts the string, hence my warning about Surrogate Pairs - would be easy enough to account for if I could be bothered, given that Char has an IsSurrogatePair property)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width