[RESOLVED] Decoding characters obtained from web pages

**si_the_geek** · Apr 9th, 2013, 10:24 AM

I am downloading pages from the web using a variety of methods, including this:

Code:

          Dim URLString = "http://www.example.com/page.htm"
          Dim MyWebClient As Net.WebClient = New Net.WebClient()
          Dim HTML as String = MyWebClient.DownloadString(URLString)

No matter which method is used for downloading, quite often the HTML contains characters that are encoded somehow, such as: Ã© and Âº and a€˜

If I save the text to a file, eg:

Code:

My.Computer.FileSystem.WriteAllText(filePath, HTML, False, System.Text.Encoding.Default)

...some of the issues are dealt with (eg: Ã© becomes é , Âº becomes º , and a€˜ becomes ‘ )

Is there a way I can do this conversion without saving to a file and reloading? (or hard-coding conversions as I find them!)

I've tried several things with no luck, including this:

Code:

    Dim encodedBytes As Byte() = System.Text.UTF8Encoding.UTF8.GetBytes(HTML)
    Dim decodedString As String = System.Text.UTF8Encoding.UTF8.GetString(encodedBytes)

If possible, I'd also like to convert characters with accents etc to their 'simple' character (eg: instead of é and Ø I'd like to get e and O)

**Evil_Giraffe** · Apr 9th, 2013, 10:36 AM

You need to either set the Encoding property of the WebClient to the appropriate encoding, or use DownloadData to get a byte array and use the Encoding directly to convert as a separate step. The latter might be preferable if you are not sure, as you should try and read the beginning of the HTML document until you read the encoding declaration (<meta http-equiv="Content-Type" content="text/html; charset=XXX">). Once you know that (and you should be able to read at least that far in most encodings no matter what the document is encoded in) then you can re-interpret according to the declared encoding without needing to re-download the data (since you've got the bytes locally).

**Niya** · Apr 9th, 2013, 10:37 AM

Do you know the the encoding used by the web page ? The HTTP header is supposed to have a field that can tell you that.

[EDIT]

Nvm...EG's answer is better.

**Evil_Giraffe** · Apr 9th, 2013, 10:58 AM

Originally Posted by si_the_geek

If possible, I'd also like to convert characters with accents etc to their 'simple' character (eg: instead of é and Ø I'd like to get e and O)

Your problem there is that (apparently) Ø is not an "accented" O, but a character in its own right. The following code (very quickly dashed off, no warranties) strips combining characters off each base character, but it doesn't deal properly with graphemes where the base character is a surrogate pair (i.e. takes more than one Char to represent) outside the Basic Multilingual Plane - but I suspect you don't care about those. If the document contains them, they will get corrupted

vbnet Code:

Imports System.Text
Imports System.Globalization
 
Module Module1
 
    Sub Main()
        Dim accented As String = "eg: instead of é and Ø I'd like to get e and O"
        Console.WriteLine(accented)
        Dim deaccented As String = DeAccent(accented)
        Console.WriteLine(deaccented)
        Console.ReadLine()
    End Sub
 
    Private Function DeAccent(ByVal accented As String) As String
        Dim deaccentedBuilder As New StringBuilder
        Dim graphemeIterator = StringInfo.GetTextElementEnumerator(accented)
 
        While graphemeIterator.MoveNext()
            Dim grapheme As String = graphemeIterator.Current.ToString()
            Dim normalisedGrapheme As String = grapheme.Normalize(NormalizationForm.FormD)
            deaccentedBuilder.Append(normalisedGrapheme(0))
        End While
 
        Return deaccentedBuilder.ToString()
    End Function
End Module

Output:

Code:

eg: instead of é and Ø I'd like to get e and O
eg: instead of e and Ø I'd like to get e and O

**Niya** · Apr 9th, 2013, 11:12 AM

wow.....I don't have a clue what you did there lol

**si_the_geek** · Apr 9th, 2013, 11:12 AM

The web pages aren't necessarily consistent with their encoding, so it seems like the DownloadData option is the way to go.

A quick test worked, so I'm hopeful that when implemented fully it will work well for the various encodings used.

After putting a problem part of a UTF-8 page into a byte array, this was all that was needed:

Code:

Dim decodedString = System.Text.Encoding.GetEncoding("UTF-8").GetString(encodedBytes)

That DeAccent routine seems to work nicely so far... I'll have a play around with the data I'm getting to see if there are any issues.

**dday9** · Apr 9th, 2013, 03:53 PM

Originally Posted by Evil_Giraffe

Output:

Code:

eg: instead of é and Ø I'd like to get e and O
eg: instead of e and Ø I'd like to get e and O

I wouldn't like any e & o's!

**Evil_Giraffe** · Apr 9th, 2013, 04:05 PM

Originally Posted by Niya

wow.....I don't have a clue what you did there lol

First up, I use the StringInfo.GetTextElementEnumerator to get an IEnumerator(Of String) where each string is a single grapheme - what you and I think of as a single character (this may be several Unicode code points and each code point is potentially (but probably not) several Char instances).
Because it's an IEnumerator not an IEnumerable I can't use For Each (it would probably be two minutes work to write a function that wraps up an IEnumerator in an IEnumerable or even to find the implementation that's in the BCL already, but meh)
For each element I normalise it using Form D - this converts all strings to a plain base character plus combining characters, (such as "e" followed by "acute accent" rather than the composed "e with acute accent" code point). We then take the first Char from that string which is the base character and leave the rest (note that if the base character needs to be encoded with more than one Char, this corrupts the string, hence my warning about Surrogate Pairs - would be easy enough to account for if I could be bothered, given that Char has an IsSurrogatePair property)

**si_the_geek** · Apr 10th, 2013, 01:26 PM

The Encoding part is working very well (apart from issues with merging it into a large project!), and the DeAccent function is good too.

The only issues I've seen with DeAccent so far is that it doesn't "fix" a few ( ø œ æ ß ), but it hasn't corupted/broken anything, and I can hard-code conversions for the few exceptions.

Thanks for the help

Thread: [RESOLVED] Decoding characters obtained from web pages

Thread Tools

Display

[RESOLVED] Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Re: Decoding characters obtained from web pages

Posting Permissions