|
-
May 15th, 2013, 07:50 AM
#1
Thread Starter
New Member
Getting webpage content of a website
Hi,
I have created a vb.net app to get source code of given url, it works really fine for 100s of webpages but i came across this particular website:
http://www.sbuniv.edu/
it gives as http status code:200 OK
But when i tred to displayed the html source code, it shows some small square box like "how it appear when you open word in text".
The site is in English and its utf-8. i am not sure what the exactly the issue is. i know its nothing to do with my code and it is something to do with the site.
It think misses some header info but it show the content type in my application, so i am puzzled.
---------------------------
---------------------------
�
---------------------------
OK
---------------------------
-
May 15th, 2013, 09:01 AM
#2
Re: Getting webpage content of a website
I did get "how it appear when you open word in text" believe it or not, but how are you displaying the text? What code are you using? Are there any relevant errors that occur? Without this information, we really can't help you to much. We can only speculate,
-
May 15th, 2013, 09:29 AM
#3
Thread Starter
New Member
Re: Getting webpage content of a website
There is no error.
Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(url)
request.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0"
Dim response As System.Net.HttpWebResponse = request.GetResponse()
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
source = sr.ReadToEnd()
response.Close()
sr.Close()
MessageBox.Show("source:" & source)
Catch ex1 As Exception
MessageBox.Show("errorlist0:" & ex1.Message)
error_trace += "errortraceno1"
error_log += "errortraceno1:" & ex1.Message.ToString
End Try
-
May 15th, 2013, 10:15 AM
#4
Re: Getting webpage content of a website
Hmm, I'm getting some odd output as well, but you also gotta understand I'm terrible at webscrapping. As for getting the full html source of a page use this:
Code:
Dim sourceString As String = New System.Net.WebClient().DownloadString("http://www.google.edu/")
Console.WriteLine(sourceString)
As you can tell, it works for google, but it's not working for the site you gave in post #1. I'll try to look into it further.
-
May 15th, 2013, 10:20 AM
#5
Re: Getting webpage content of a website
Well after tinkering with it this works:
Code:
Dim wc As New System.Net.WebClient
wc.Encoding = System.Text.Encoding.UTF8
Dim sourceString As String = wc.DownloadString("http://www.google.edu/")
Console.WriteLine(sourceString)
I just set the encoding for the webclient to UTF8
-
May 15th, 2013, 10:45 AM
#6
Re: Getting webpage content of a website
Interesting - I wonder if this is a "weak attempt" at hiding page source??
-
May 15th, 2013, 11:00 AM
#7
Thread Starter
New Member
Re: Getting webpage content of a website
hi,
thanks mate. infact i tried searching how to add utf 8 encoding using httpwebrequest in google and then only i posted here.
can you tell me how to do it using httpwebrequest rather than webclient. As httpwebreuest have many advantage over webclient.
Also i have one more query, let say a site have wrong header protocol, so its not possible to directly get the httpstatuscode for it from header. if their any other option to get http status code other than header info. you know not all sites have error free design
-
May 15th, 2013, 11:21 AM
#8
Thread Starter
New Member
Re: Getting webpage content of a website
I also tried this 2 encoding:
Code:
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream(), Encoding.UTF-8)
Code:
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream(), Encoding.GetEncoding("ISO-8859-1"))
But nothing working for me
-
May 15th, 2013, 11:35 AM
#9
Thread Starter
New Member
Re: Getting webpage content of a website
This webclient method code also not getting the html source 
 Originally Posted by dday9
Well after tinkering with it this works:
Code:
Dim wc As New System.Net.WebClient
wc.Encoding = System.Text.Encoding.UTF8
Dim sourceString As String = wc.DownloadString("http://www.google.edu/")
Console.WriteLine(sourceString)
I just set the encoding for the webclient to UTF8
-
May 15th, 2013, 12:34 PM
#10
Re: Getting webpage content of a website
Hmm, that should've worked... I gotta say, I'm not to sure then. Like I said, I'm not expert in web scrapping, so perhaps someone else should step in. This is about the most I could help.
-
May 15th, 2013, 01:13 PM
#11
Thread Starter
New Member
Re: Getting webpage content of a website
i tried many methods tried byte reading, readtoend, Encoding.GetString(responseBuffer). nothing working for me.
-
May 15th, 2013, 01:43 PM
#12
Re: Getting webpage content of a website
This is not related to the encoding but rather to the fact that the web server use compression. To fix it just add this line to your code (just after you've created your request object):
Code:
request.AutomaticDecompression = Net.DecompressionMethods.GZip Or Net.DecompressionMethods.Deflate
-
May 15th, 2013, 03:57 PM
#13
Re: Getting webpage content of a website
 Originally Posted by Joacim Andersson
This is not related to the encoding but rather to the fact that the web server use compression. [/code]
Does compression help to obfuscate the source of the page as well? Or am I totally off-base on this question anyway?
-
May 15th, 2013, 04:01 PM
#14
Re: Getting webpage content of a website
No it doesn't really obfuscate anything since the compression itself is open source, and any modern web browser will uncompress it so View Source will show everything. But then again if you try to read a zipped text document using Notepad it will look like it's been obfuscated. The compression is really only used to save on bandwidth usage.
-
May 15th, 2013, 04:07 PM
#15
Re: Getting webpage content of a website
Oh well - I thought I might be able to hide some of my web code from casual view - guess I'll step out of this thread now!
Bows head - backs out the door...
-
May 16th, 2013, 12:53 AM
#16
Thread Starter
New Member
Re: Getting webpage content of a website
Hi Joacim, it works thanks
 Originally Posted by Joacim Andersson
This is not related to the encoding but rather to the fact that the web server use compression. To fix it just add this line to your code (just after you've created your request object):
Code:
request.AutomaticDecompression = Net.DecompressionMethods.GZip Or Net.DecompressionMethods.Deflate
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|