Results 1 to 6 of 6

Thread: WebClient.Encoding and diamond question marks

  1. #1
    New Member
    Join Date
    May 07
    Posts
    5

    WebClient.Encoding and diamond question marks

    I'm trying to scrape a webpage using this code, cut out a piece of text I want from the HTML code, and finally show it in a textbox. The problem I'm having is that special/foreign characters are shown as diamond question marks. The webpage says it's charset=ISO-8859-1.

    I tried setting the encoding with the line that is commented out below, but it made no difference. I tried the other encodings as well like UTF8.

    What shall I do to read this webpage in a better way?

    Here's the code I have so far:

    Code:
            Dim myWebClient As New Net.WebClient()
            'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
            Dim myStream As IO.Stream = myWebClient.OpenRead(url)
            Dim sr As New IO.StreamReader(myStream)
            Dim myString As String = sr.ReadToEnd
            myStream.Close()

  2. #2
    PowerPoster techgnome's Avatar
    Join Date
    May 02
    Posts
    21,786

    Re: WebClient.Encoding and diamond question marks

    it's likely that the encoding is fine... what isn't fine is the text box... that diamond questionmark means that it recognizes that there is a character there, but it's not displayable given the current font .. I take that back... it probably is the encoding, but it's on the textbox side, not the stream/webclient side. What's happening is that your stream comes back and it's in a particular encoding but the text box may not be in synch, so when you pass it the string to display, it doesn't know what to do with it. So it does what it can. I don't know that you can set the encoding on a text box, as it is pretty basic. You may want to look at the RichTextBox, it might allow for a little more flexibility.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.-I also subscribe to all threads I participate, so there's no need to pm when there's an update.*
    *Proof positive that searching the forums does work: View Thread *
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *
    * Use Offensive Programming, not Defensive Programming. * On Error Resume Next is error ignoring, not error handling(tm).
    "There is a major problem with your code, and VB wants to tell you what it is.. but you have decided to put your fingers in your ears and shout 'I'm not listening!'" - si_the_geek on using OERN

  3. #3
    PowerPoster dunfiddlin's Avatar
    Join Date
    Jun 12
    Posts
    5,960

    Re: WebClient.Encoding and diamond question marks

    How are these characters encoded in the HTML itself? CSS? Individual &H values? Could we have a sample? Have you tried looking at the Hex version of the text to determine exactly what is being saved? What about if you use the DownloadString method rather than a Stream?
    As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"

    Praise for Dunfiddlin's work: What starts out as triumph soon becomes finessed into a tragedy of power, leaving only a sense of nihilism and the inevitability of a new reality.

    Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!

  4. #4
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 07
    Location
    Pointless Forest 38.517,-92.023
    Posts
    7,281

    Re: WebClient.Encoding and diamond question marks

    Could you post the webpage url that is giving you problems?
    My First Computer --- Documentation Link (RT?M) --- Using the Debugger ---
    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein
    "They who can give up essential liberty to obtain a little temporary safety, deserve neither liberty nor safety." Benjamin Franklin

  5. #5
    Hyperactive Member
    Join Date
    Jul 11
    Location
    UK
    Posts
    438

    Re: WebClient.Encoding and diamond question marks

    A StreamReader uses UTF-8 encoding by default, so you could try modifying your code slightly:
    vb.net Code:
    1. Dim myString As String ' declared here to broaden its Scope
    2.  
    3. Using myWebClient As New Net.WebClient()
    4.  
    5.     Using myStream As IO.Stream = myWebClient.OpenRead(url)
    6.         Using sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
    7.             myString = sr.ReadToEnd
    8.         End Using ' Stream
    9.     End Using ' StreamReader
    10.  
    11.  
    12. End Using ' WebClient

    The WebClient.Encoding Property is used by the WebClient when it uploads or downloads Strings, so you might also try:

    vb.net Code:
    1. Dim myString As String ' declared here to broaden its Scope
    2.  
    3. Using myWebClient As New Net.WebClient()
    4.     myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
    5.     mystring = myWebClient.DownloadString(url)
    6.  
    7. End Using ' WebClient
    although it defaults to the system's default encoding, so you could probably get away without explicitly setting the encoding in your particular case. It's best to set it explicitly when you can, though. See the MSDN for more details, and also the Remarks section here.
    Last edited by Inferrd; Nov 15th, 2012 at 12:32 AM. Reason: Noticed a Scoping issue with the way I modified the original code

  6. #6
    New Member
    Join Date
    May 07
    Posts
    5

    Re: WebClient.Encoding and diamond question marks

    Thanks everyone for your suggestions, but especially Inferrd. Your first suggestion enabled me to move on. I didn't know you could set the encoding in StreamReader that way.
    Code:
    Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
    or even just
    Code:
    Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
    works.

    However...

    I'm actually scraping more than one webpage. The other one now started giving me problems with this new approach:

    ö becomes ö
    ä becomes ä
    é becomes é
    etc...

    So not diamond question marks the other way around this time.

    I've examined myWebClient.Encoding and can see no difference between the attributes of the two websites. I'm beginning to suspect that one of them is reporting one type of encoding, but is actually another one. At the same time, they both render properly in Chrome.

    So this ugly solution is what I've come up with so far, basically hardcoding one behavior if the link sent to the function is a specific one. I guess I'll see how many special cases I have to define. Do you guys have any suggestion on how to improve this? Or maybe explain what's going on with these webpages?

    Code:
        Private Function ProcessURL(url As String) As String
            Dim myWebClient As New Net.WebClient()
    
            'this next line has no effect
            'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
    
            Dim myStream As IO.Stream = myWebClient.OpenRead(url)
    
            Dim myString As String
    
            If url Like "*codeword*" Then
                'either approach works here
                'Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
                Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
                myString = sr.ReadToEnd()
            Else
                Dim sr As New IO.StreamReader(myStream)
                myString = sr.ReadToEnd()
            End If
    
            myStream.Close()
            Return myString
        End Function
    P.S. I tried to "dim" the StreamReader before the if statement, but then I don't know how to enforce the encoding on it afterwards. Sorry, I don't know the proper lingo...
    Last edited by InterClaw; Nov 16th, 2012 at 03:26 AM. Reason: spelling

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •