-
Nov 14th, 2012, 11:30 AM
#1
Thread Starter
New Member
WebClient.Encoding and diamond question marks
I'm trying to scrape a webpage using this code, cut out a piece of text I want from the HTML code, and finally show it in a textbox. The problem I'm having is that special/foreign characters are shown as diamond question marks. The webpage says it's charset=ISO-8859-1.
I tried setting the encoding with the line that is commented out below, but it made no difference. I tried the other encodings as well like UTF8.
What shall I do to read this webpage in a better way?
Here's the code I have so far:
Code:
Dim myWebClient As New Net.WebClient()
'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
Dim myStream As IO.Stream = myWebClient.OpenRead(url)
Dim sr As New IO.StreamReader(myStream)
Dim myString As String = sr.ReadToEnd
myStream.Close()
-
Nov 14th, 2012, 11:50 AM
#2
Re: WebClient.Encoding and diamond question marks
it's likely that the encoding is fine... what isn't fine is the text box... that diamond questionmark means that it recognizes that there is a character there, but it's not displayable given the current font .. I take that back... it probably is the encoding, but it's on the textbox side, not the stream/webclient side. What's happening is that your stream comes back and it's in a particular encoding but the text box may not be in synch, so when you pass it the string to display, it doesn't know what to do with it. So it does what it can. I don't know that you can set the encoding on a text box, as it is pretty basic. You may want to look at the RichTextBox, it might allow for a little more flexibility.
-tg
-
Nov 14th, 2012, 12:07 PM
#3
Re: WebClient.Encoding and diamond question marks
How are these characters encoded in the HTML itself? CSS? Individual &H values? Could we have a sample? Have you tried looking at the Hex version of the text to determine exactly what is being saved? What about if you use the DownloadString method rather than a Stream?
As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"
Reviews: "dunfiddlin likes his DataTables" - jmcilhinney
Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!
-
Nov 14th, 2012, 12:39 PM
#4
Re: WebClient.Encoding and diamond question marks
Could you post the webpage url that is giving you problems?
-
Nov 15th, 2012, 01:20 AM
#5
Re: WebClient.Encoding and diamond question marks
A StreamReader uses UTF-8 encoding by default, so you could try modifying your code slightly:
vb.net Code:
Dim myString As String ' declared here to broaden its Scope Using myWebClient As New Net.WebClient() Using myStream As IO.Stream = myWebClient.OpenRead(url) Using sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1")) myString = sr.ReadToEnd End Using ' Stream End Using ' StreamReader End Using ' WebClient
The WebClient.Encoding Property is used by the WebClient when it uploads or downloads Strings, so you might also try:
vb.net Code:
Dim myString As String ' declared here to broaden its Scope Using myWebClient As New Net.WebClient() myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1") mystring = myWebClient.DownloadString(url) End Using ' WebClient
although it defaults to the system's default encoding, so you could probably get away without explicitly setting the encoding in your particular case. It's best to set it explicitly when you can, though. See the MSDN for more details, and also the Remarks section here.
Last edited by Inferrd; Nov 15th, 2012 at 01:32 AM.
Reason: Noticed a Scoping issue with the way I modified the original code
-
Nov 16th, 2012, 04:23 AM
#6
Thread Starter
New Member
Re: WebClient.Encoding and diamond question marks
Thanks everyone for your suggestions, but especially Inferrd. Your first suggestion enabled me to move on. I didn't know you could set the encoding in StreamReader that way.
Code:
Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
or even just
Code:
Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
works.
However...
I'm actually scraping more than one webpage. The other one now started giving me problems with this new approach:
ö becomes ö
ä becomes ä
é becomes é
etc...
So not diamond question marks the other way around this time.
I've examined myWebClient.Encoding and can see no difference between the attributes of the two websites. I'm beginning to suspect that one of them is reporting one type of encoding, but is actually another one. At the same time, they both render properly in Chrome.
So this ugly solution is what I've come up with so far, basically hardcoding one behavior if the link sent to the function is a specific one. I guess I'll see how many special cases I have to define. Do you guys have any suggestion on how to improve this? Or maybe explain what's going on with these webpages?
Code:
Private Function ProcessURL(url As String) As String
Dim myWebClient As New Net.WebClient()
'this next line has no effect
'myWebClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-1")
Dim myStream As IO.Stream = myWebClient.OpenRead(url)
Dim myString As String
If url Like "*codeword*" Then
'either approach works here
'Dim sr As New IO.StreamReader(myStream, System.Text.Encoding.GetEncoding("ISO-8859-1"))
Dim sr As New IO.StreamReader(myStream, myWebClient.Encoding)
myString = sr.ReadToEnd()
Else
Dim sr As New IO.StreamReader(myStream)
myString = sr.ReadToEnd()
End If
myStream.Close()
Return myString
End Function
P.S. I tried to "dim" the StreamReader before the if statement, but then I don't know how to enforce the encoding on it afterwards. Sorry, I don't know the proper lingo...
Last edited by InterClaw; Nov 16th, 2012 at 04:26 AM.
Reason: spelling
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|