Results 1 to 9 of 9

Thread: [RESOLVED] Decoding characters obtained from web pages

  1. #1

    Thread Starter
    Super Moderator si_the_geek's Avatar
    Join Date
    Jul 2002
    Location
    Bristol, UK
    Posts
    41,974

    Resolved [RESOLVED] Decoding characters obtained from web pages

    I am downloading pages from the web using a variety of methods, including this:
    Code:
              Dim URLString = "http://www.example.com/page.htm"
              Dim MyWebClient As Net.WebClient = New Net.WebClient()
              Dim HTML as String = MyWebClient.DownloadString(URLString)
    No matter which method is used for downloading, quite often the HTML contains characters that are encoded somehow, such as: é and º and a€˜

    If I save the text to a file, eg:
    Code:
    My.Computer.FileSystem.WriteAllText(filePath, HTML, False, System.Text.Encoding.Default)
    ...some of the issues are dealt with (eg: é becomes é , º becomes º , and a€˜ becomes ‘ )

    Is there a way I can do this conversion without saving to a file and reloading? (or hard-coding conversions as I find them!)

    I've tried several things with no luck, including this:
    Code:
        Dim encodedBytes As Byte() = System.Text.UTF8Encoding.UTF8.GetBytes(HTML)
        Dim decodedString As String = System.Text.UTF8Encoding.UTF8.GetString(encodedBytes)


    If possible, I'd also like to convert characters with accents etc to their 'simple' character (eg: instead of é and Ø I'd like to get e and O)

  2. #2
    PowerPoster Evil_Giraffe's Avatar
    Join Date
    Aug 2002
    Location
    Suffolk, UK
    Posts
    2,555

    Re: Decoding characters obtained from web pages

    You need to either set the Encoding property of the WebClient to the appropriate encoding, or use DownloadData to get a byte array and use the Encoding directly to convert as a separate step. The latter might be preferable if you are not sure, as you should try and read the beginning of the HTML document until you read the encoding declaration (<meta http-equiv="Content-Type" content="text/html; charset=XXX">). Once you know that (and you should be able to read at least that far in most encodings no matter what the document is encoded in) then you can re-interpret according to the declared encoding without needing to re-download the data (since you've got the bytes locally).

  3. #3
    Angel of Code Niya's Avatar
    Join Date
    Nov 2011
    Posts
    9,017

    Re: Decoding characters obtained from web pages

    Do you know the the encoding used by the web page ? The HTTP header is supposed to have a field that can tell you that.

    [EDIT]

    Nvm...EG's answer is better.
    Treeview with NodeAdded/NodesRemoved events | BlinkLabel control | Calculate Permutations | Object Enums | ComboBox with centered items | .Net Internals article(not mine) | Wizard Control | Understanding Multi-Threading | Simple file compression | Demon Arena

    Copy/move files using Windows Shell | I'm not wanted

    C++ programmers will dismiss you as a cretinous simpleton for your inability to keep track of pointers chained 6 levels deep and Java programmers will pillory you for buying into the evils of Microsoft. Meanwhile C# programmers will get paid just a little bit more than you for writing exactly the same code and VB6 programmers will continue to whitter on about "footprints". - FunkyDexter

    There's just no reason to use garbage like InputBox. - jmcilhinney

    The threads I start are Niya and Olaf free zones. No arguing about the benefits of VB6 over .NET here please. Happiness must reign. - yereverluvinuncleber

  4. #4
    PowerPoster Evil_Giraffe's Avatar
    Join Date
    Aug 2002
    Location
    Suffolk, UK
    Posts
    2,555

    Re: Decoding characters obtained from web pages

    Quote Originally Posted by si_the_geek View Post
    If possible, I'd also like to convert characters with accents etc to their 'simple' character (eg: instead of é and Ø I'd like to get e and O)
    Your problem there is that (apparently) Ø is not an "accented" O, but a character in its own right. The following code (very quickly dashed off, no warranties) strips combining characters off each base character, but it doesn't deal properly with graphemes where the base character is a surrogate pair (i.e. takes more than one Char to represent) outside the Basic Multilingual Plane - but I suspect you don't care about those. If the document contains them, they will get corrupted

    vbnet Code:
    1. Imports System.Text
    2. Imports System.Globalization
    3.  
    4. Module Module1
    5.  
    6.     Sub Main()
    7.         Dim accented As String = "eg: instead of é and Ø I'd like to get e and O"
    8.         Console.WriteLine(accented)
    9.         Dim deaccented As String = DeAccent(accented)
    10.         Console.WriteLine(deaccented)
    11.         Console.ReadLine()
    12.     End Sub
    13.  
    14.     Private Function DeAccent(ByVal accented As String) As String
    15.         Dim deaccentedBuilder As New StringBuilder
    16.         Dim graphemeIterator = StringInfo.GetTextElementEnumerator(accented)
    17.  
    18.         While graphemeIterator.MoveNext()
    19.             Dim grapheme As String = graphemeIterator.Current.ToString()
    20.             Dim normalisedGrapheme As String = grapheme.Normalize(NormalizationForm.FormD)
    21.             deaccentedBuilder.Append(normalisedGrapheme(0))
    22.         End While
    23.  
    24.         Return deaccentedBuilder.ToString()
    25.     End Function
    26. End Module

    Output:
    Code:
    eg: instead of é and Ø I'd like to get e and O
    eg: instead of e and Ø I'd like to get e and O

  5. #5
    Angel of Code Niya's Avatar
    Join Date
    Nov 2011
    Posts
    9,017

    Re: Decoding characters obtained from web pages

    wow.....I don't have a clue what you did there lol
    Treeview with NodeAdded/NodesRemoved events | BlinkLabel control | Calculate Permutations | Object Enums | ComboBox with centered items | .Net Internals article(not mine) | Wizard Control | Understanding Multi-Threading | Simple file compression | Demon Arena

    Copy/move files using Windows Shell | I'm not wanted

    C++ programmers will dismiss you as a cretinous simpleton for your inability to keep track of pointers chained 6 levels deep and Java programmers will pillory you for buying into the evils of Microsoft. Meanwhile C# programmers will get paid just a little bit more than you for writing exactly the same code and VB6 programmers will continue to whitter on about "footprints". - FunkyDexter

    There's just no reason to use garbage like InputBox. - jmcilhinney

    The threads I start are Niya and Olaf free zones. No arguing about the benefits of VB6 over .NET here please. Happiness must reign. - yereverluvinuncleber

  6. #6

    Thread Starter
    Super Moderator si_the_geek's Avatar
    Join Date
    Jul 2002
    Location
    Bristol, UK
    Posts
    41,974

    Re: Decoding characters obtained from web pages

    The web pages aren't necessarily consistent with their encoding, so it seems like the DownloadData option is the way to go.

    A quick test worked, so I'm hopeful that when implemented fully it will work well for the various encodings used.

    After putting a problem part of a UTF-8 page into a byte array, this was all that was needed:
    Code:
    Dim decodedString = System.Text.Encoding.GetEncoding("UTF-8").GetString(encodedBytes)


    That DeAccent routine seems to work nicely so far... I'll have a play around with the data I'm getting to see if there are any issues.

  7. #7
    Super Moderator dday9's Avatar
    Join Date
    Mar 2011
    Posts
    12,397

    Re: Decoding characters obtained from web pages

    Quote Originally Posted by Evil_Giraffe View Post
    Output:
    Code:
    eg: instead of é and Ø I'd like to get e and O
    eg: instead of e and Ø I'd like to get e and O
    I wouldn't like any e & o's!
    "Code is like humor. When you have to explain it, it is bad." - Cory House
    VbLessons | HtmlLessons | CssLessons | Code Tags | Sword of Fury - Jameram

  8. #8
    PowerPoster Evil_Giraffe's Avatar
    Join Date
    Aug 2002
    Location
    Suffolk, UK
    Posts
    2,555

    Re: Decoding characters obtained from web pages

    Quote Originally Posted by Niya View Post
    wow.....I don't have a clue what you did there lol
    First up, I use the StringInfo.GetTextElementEnumerator to get an IEnumerator(Of String) where each string is a single grapheme - what you and I think of as a single character (this may be several Unicode code points and each code point is potentially (but probably not) several Char instances).
    Because it's an IEnumerator not an IEnumerable I can't use For Each (it would probably be two minutes work to write a function that wraps up an IEnumerator in an IEnumerable or even to find the implementation that's in the BCL already, but meh)
    For each element I normalise it using Form D - this converts all strings to a plain base character plus combining characters, (such as "e" followed by "acute accent" rather than the composed "e with acute accent" code point). We then take the first Char from that string which is the base character and leave the rest (note that if the base character needs to be encoded with more than one Char, this corrupts the string, hence my warning about Surrogate Pairs - would be easy enough to account for if I could be bothered, given that Char has an IsSurrogatePair property)

  9. #9

    Thread Starter
    Super Moderator si_the_geek's Avatar
    Join Date
    Jul 2002
    Location
    Bristol, UK
    Posts
    41,974

    Re: Decoding characters obtained from web pages

    The Encoding part is working very well (apart from issues with merging it into a large project!), and the DeAccent function is good too.

    The only issues I've seen with DeAccent so far is that it doesn't "fix" a few ( ø œ æ ß ), but it hasn't corupted/broken anything, and I can hard-code conversions for the few exceptions.


    Thanks for the help

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width