Results 1 to 6 of 6

Thread: WebClient.Encoding and diamond question marks

Hybrid View

  1. #1

    Thread Starter
    New Member
    Join Date
    May 2007
    Posts
    5

    WebClient.Encoding and diamond question marks

    I'm trying to scrape a webpage using this code, cut out a piece of text I want from the HTML code, and finally show it in a textbox. The problem I'm having is that special/foreign characters are shown as diamond question marks. The webpage says it's charset=ISO-8859-1.

    I tried setting the encoding with the line that is commented out below, but it made no difference. I tried the other encodings as well like UTF8.

    What shall I do to read this webpage in a better way?

    Here's the code I have so far:

    Code:
            Dim myWebClient As New Net.WebClient()
            'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
            Dim myStream As IO.Stream = myWebClient.OpenRead(url)
            Dim sr As New IO.StreamReader(myStream)
            Dim myString As String = sr.ReadToEnd
            myStream.Close()

  2. #2
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,531

    Re: WebClient.Encoding and diamond question marks

    it's likely that the encoding is fine... what isn't fine is the text box... that diamond questionmark means that it recognizes that there is a character there, but it's not displayable given the current font .. I take that back... it probably is the encoding, but it's on the textbox side, not the stream/webclient side. What's happening is that your stream comes back and it's in a particular encoding but the text box may not be in synch, so when you pass it the string to display, it doesn't know what to do with it. So it does what it can. I don't know that you can set the encoding on a text box, as it is pretty basic. You may want to look at the RichTextBox, it might allow for a little more flexibility.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  3. #3
    PowerPoster dunfiddlin's Avatar
    Join Date
    Jun 2012
    Posts
    8,245

    Re: WebClient.Encoding and diamond question marks

    How are these characters encoded in the HTML itself? CSS? Individual &H values? Could we have a sample? Have you tried looking at the Hex version of the text to determine exactly what is being saved? What about if you use the DownloadString method rather than a Stream?
    As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"

    Reviews: "dunfiddlin likes his DataTables" - jmcilhinney

    Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!

  4. #4
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,754

    Re: WebClient.Encoding and diamond question marks

    Could you post the webpage url that is giving you problems?
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  5. #5
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,335

    Re: WebClient.Encoding and diamond question marks

    A StreamReader uses UTF-8 encoding by default, so you could try modifying your code slightly:
    vb.net Code:
    1. Dim myString As String ' declared here to broaden its Scope
    2.  
    3. Using myWebClient As New Net.WebClient()
    4.  
    5.     Using myStream As IO.Stream = myWebClient.OpenRead(url)
    6.         Using sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
    7.             myString = sr.ReadToEnd
    8.         End Using ' Stream
    9.     End Using ' StreamReader
    10.  
    11.  
    12. End Using ' WebClient

    The WebClient.Encoding Property is used by the WebClient when it uploads or downloads Strings, so you might also try:

    vb.net Code:
    1. Dim myString As String ' declared here to broaden its Scope
    2.  
    3. Using myWebClient As New Net.WebClient()
    4.     myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
    5.     mystring = myWebClient.DownloadString(url)
    6.  
    7. End Using ' WebClient
    although it defaults to the system's default encoding, so you could probably get away without explicitly setting the encoding in your particular case. It's best to set it explicitly when you can, though. See the MSDN for more details, and also the Remarks section here.
    Last edited by Inferrd; Nov 15th, 2012 at 01:32 AM. Reason: Noticed a Scoping issue with the way I modified the original code

  6. #6

    Thread Starter
    New Member
    Join Date
    May 2007
    Posts
    5

    Re: WebClient.Encoding and diamond question marks

    Thanks everyone for your suggestions, but especially Inferrd. Your first suggestion enabled me to move on. I didn't know you could set the encoding in StreamReader that way.
    Code:
    Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
    or even just
    Code:
    Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
    works.

    However...

    I'm actually scraping more than one webpage. The other one now started giving me problems with this new approach:

    ö becomes ö
    ä becomes ä
    é becomes é
    etc...

    So not diamond question marks the other way around this time.

    I've examined myWebClient.Encoding and can see no difference between the attributes of the two websites. I'm beginning to suspect that one of them is reporting one type of encoding, but is actually another one. At the same time, they both render properly in Chrome.

    So this ugly solution is what I've come up with so far, basically hardcoding one behavior if the link sent to the function is a specific one. I guess I'll see how many special cases I have to define. Do you guys have any suggestion on how to improve this? Or maybe explain what's going on with these webpages?

    Code:
        Private Function ProcessURL(url As String) As String
            Dim myWebClient As New Net.WebClient()
    
            'this next line has no effect
            'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
    
            Dim myStream As IO.Stream = myWebClient.OpenRead(url)
    
            Dim myString As String
    
            If url Like "*codeword*" Then
                'either approach works here
                'Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
                Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
                myString = sr.ReadToEnd()
            Else
                Dim sr As New IO.StreamReader(myStream)
                myString = sr.ReadToEnd()
            End If
    
            myStream.Close()
            Return myString
        End Function
    P.S. I tried to "dim" the StreamReader before the if statement, but then I don't know how to enforce the encoding on it afterwards. Sorry, I don't know the proper lingo...
    Last edited by InterClaw; Nov 16th, 2012 at 04:26 AM. Reason: spelling

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width