WebClient.Encoding and diamond question marks
I'm trying to scrape a webpage using this code, cut out a piece of text I want from the HTML code, and finally show it in a textbox. The problem I'm having is that special/foreign characters are shown as diamond question marks. The webpage says it's charset=ISO-8859-1.
I tried setting the encoding with the line that is commented out below, but it made no difference. I tried the other encodings as well like UTF8.
What shall I do to read this webpage in a better way?
Here's the code I have so far:
Code:
Dim myWebClient As New Net.WebClient()
'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
Dim myStream As IO.Stream = myWebClient.OpenRead(url)
Dim sr As New IO.StreamReader(myStream)
Dim myString As String = sr.ReadToEnd
myStream.Close()
Re: WebClient.Encoding and diamond question marks
it's likely that the encoding is fine... what isn't fine is the text box... that diamond questionmark means that it recognizes that there is a character there, but it's not displayable given the current font .. I take that back... it probably is the encoding, but it's on the textbox side, not the stream/webclient side. What's happening is that your stream comes back and it's in a particular encoding but the text box may not be in synch, so when you pass it the string to display, it doesn't know what to do with it. So it does what it can. I don't know that you can set the encoding on a text box, as it is pretty basic. You may want to look at the RichTextBox, it might allow for a little more flexibility.
-tg
Re: WebClient.Encoding and diamond question marks
How are these characters encoded in the HTML itself? CSS? Individual &H values? Could we have a sample? Have you tried looking at the Hex version of the text to determine exactly what is being saved? What about if you use the DownloadString method rather than a Stream?
Re: WebClient.Encoding and diamond question marks
Could you post the webpage url that is giving you problems?
Re: WebClient.Encoding and diamond question marks
A StreamReader uses UTF-8 encoding by default, so you could try modifying your code slightly:
vb.net Code:
Dim myString As String ' declared here to broaden its Scope
Using myWebClient As New Net.WebClient()
Using myStream As IO.Stream = myWebClient.OpenRead(url)
Using sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
myString = sr.ReadToEnd
End Using ' Stream
End Using ' StreamReader
End Using ' WebClient
The WebClient.Encoding Property is used by the WebClient when it uploads or downloads Strings, so you might also try:
vb.net Code:
Dim myString As String ' declared here to broaden its Scope
Using myWebClient As New Net.WebClient()
myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
mystring = myWebClient.DownloadString(url)
End Using ' WebClient
although it defaults to the system's default encoding, so you could probably get away without explicitly setting the encoding in your particular case. It's best to set it explicitly when you can, though. See the MSDN for more details, and also the Remarks section here.
Re: WebClient.Encoding and diamond question marks
Thanks everyone for your suggestions, but especially Inferrd. Your first suggestion enabled me to move on. I didn't know you could set the encoding in StreamReader that way.
Code:
Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
or even just
Code:
Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
works.
However...
I'm actually scraping more than one webpage. The other one now started giving me problems with this new approach:
ö becomes ö
ä becomes ä
é becomes é
etc...
So not diamond question marks the other way around this time.:ehh:
I've examined myWebClient.Encoding and can see no difference between the attributes of the two websites. I'm beginning to suspect that one of them is reporting one type of encoding, but is actually another one. At the same time, they both render properly in Chrome.
So this ugly solution is what I've come up with so far, basically hardcoding one behavior if the link sent to the function is a specific one. I guess I'll see how many special cases I have to define. Do you guys have any suggestion on how to improve this? Or maybe explain what's going on with these webpages? :)
Code:
Private Function ProcessURL(url As String) As String
Dim myWebClient As New Net.WebClient()
'this next line has no effect
'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
Dim myStream As IO.Stream = myWebClient.OpenRead(url)
Dim myString As String
If url Like "*codeword*" Then
'either approach works here
'Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
myString = sr.ReadToEnd()
Else
Dim sr As New IO.StreamReader(myStream)
myString = sr.ReadToEnd()
End If
myStream.Close()
Return myString
End Function
P.S. I tried to "dim" the StreamReader before the if statement, but then I don't know how to enforce the encoding on it afterwards. Sorry, I don't know the proper lingo...:)