Results 1 to 16 of 16

Thread: Getting webpage content of a website

  1. #1

    Thread Starter
    New Member
    Join Date
    May 2013
    Posts
    11

    Getting webpage content of a website

    Hi,

    I have created a vb.net app to get source code of given url, it works really fine for 100s of webpages but i came across this particular website:
    http://www.sbuniv.edu/

    it gives as http status code:200 OK
    But when i tred to displayed the html source code, it shows some small square box like "how it appear when you open word in text".
    The site is in English and its utf-8. i am not sure what the exactly the issue is. i know its nothing to do with my code and it is something to do with the site.
    It think misses some header info but it show the content type in my application, so i am puzzled.

    ---------------------------

    ---------------------------
    �
    ---------------------------
    OK
    ---------------------------

  2. #2
    Super Moderator dday9's Avatar
    Join Date
    Mar 2011
    Posts
    12,382

    Re: Getting webpage content of a website

    I did get "how it appear when you open word in text" believe it or not, but how are you displaying the text? What code are you using? Are there any relevant errors that occur? Without this information, we really can't help you to much. We can only speculate,
    "Code is like humor. When you have to explain it, it is bad." - Cory House
    VbLessons | HtmlLessons | CssLessons | Code Tags | Sword of Fury - Jameram

  3. #3

    Thread Starter
    New Member
    Join Date
    May 2013
    Posts
    11

    Re: Getting webpage content of a website

    There is no error.

    Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(url)
    request.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0"
    Dim response As System.Net.HttpWebResponse = request.GetResponse()
    Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
    source = sr.ReadToEnd()
    response.Close()
    sr.Close()
    MessageBox.Show("source:" & source)
    Catch ex1 As Exception
    MessageBox.Show("errorlist0:" & ex1.Message)
    error_trace += "errortraceno1"
    error_log += "errortraceno1:" & ex1.Message.ToString

    End Try

  4. #4
    Super Moderator dday9's Avatar
    Join Date
    Mar 2011
    Posts
    12,382

    Re: Getting webpage content of a website

    Hmm, I'm getting some odd output as well, but you also gotta understand I'm terrible at webscrapping. As for getting the full html source of a page use this:
    Code:
            Dim sourceString As String = New System.Net.WebClient().DownloadString("http://www.google.edu/")
            Console.WriteLine(sourceString)
    As you can tell, it works for google, but it's not working for the site you gave in post #1. I'll try to look into it further.
    "Code is like humor. When you have to explain it, it is bad." - Cory House
    VbLessons | HtmlLessons | CssLessons | Code Tags | Sword of Fury - Jameram

  5. #5
    Super Moderator dday9's Avatar
    Join Date
    Mar 2011
    Posts
    12,382

    Re: Getting webpage content of a website

    Well after tinkering with it this works:
    Code:
            Dim wc As New System.Net.WebClient
            wc.Encoding = System.Text.Encoding.UTF8
            Dim sourceString As String = wc.DownloadString("http://www.google.edu/")
            Console.WriteLine(sourceString)
    I just set the encoding for the webclient to UTF8
    "Code is like humor. When you have to explain it, it is bad." - Cory House
    VbLessons | HtmlLessons | CssLessons | Code Tags | Sword of Fury - Jameram

  6. #6
    MS SQL Powerposter szlamany's Avatar
    Join Date
    Mar 2004
    Location
    Connecticut
    Posts
    18,263

    Re: Getting webpage content of a website

    Interesting - I wonder if this is a "weak attempt" at hiding page source??

    *** Read the sticky in the DB forum about how to get your question answered quickly!! ***

    Please remember to rate posts! Rate any post you find helpful - even in old threads! Use the link to the left - "Rate this Post".

    Some Informative Links:
    [ SQL Rules to Live By ] [ Reserved SQL keywords ] [ When to use INDEX HINTS! ] [ Passing Multi-item Parameters to STORED PROCEDURES ]
    [ Solution to non-domain Windows Authentication ] [ Crazy things we do to shrink log files ] [ SQL 2005 Features ] [ Loading Pictures from DB ]

    MS MVP 2006, 2007, 2008

  7. #7

    Thread Starter
    New Member
    Join Date
    May 2013
    Posts
    11

    Re: Getting webpage content of a website

    hi,
    thanks mate. infact i tried searching how to add utf 8 encoding using httpwebrequest in google and then only i posted here.
    can you tell me how to do it using httpwebrequest rather than webclient. As httpwebreuest have many advantage over webclient.

    Also i have one more query, let say a site have wrong header protocol, so its not possible to directly get the httpstatuscode for it from header. if their any other option to get http status code other than header info. you know not all sites have error free design

  8. #8

    Thread Starter
    New Member
    Join Date
    May 2013
    Posts
    11

    Re: Getting webpage content of a website

    I also tried this 2 encoding:

    Code:
    Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream(), Encoding.UTF-8)
    Code:
    Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream(), Encoding.GetEncoding("ISO-8859-1"))
    But nothing working for me

  9. #9

    Thread Starter
    New Member
    Join Date
    May 2013
    Posts
    11

    Re: Getting webpage content of a website

    This webclient method code also not getting the html source

    Quote Originally Posted by dday9 View Post
    Well after tinkering with it this works:
    Code:
            Dim wc As New System.Net.WebClient
            wc.Encoding = System.Text.Encoding.UTF8
            Dim sourceString As String = wc.DownloadString("http://www.google.edu/")
            Console.WriteLine(sourceString)
    I just set the encoding for the webclient to UTF8

  10. #10
    Super Moderator dday9's Avatar
    Join Date
    Mar 2011
    Posts
    12,382

    Re: Getting webpage content of a website

    Hmm, that should've worked... I gotta say, I'm not to sure then. Like I said, I'm not expert in web scrapping, so perhaps someone else should step in. This is about the most I could help.
    "Code is like humor. When you have to explain it, it is bad." - Cory House
    VbLessons | HtmlLessons | CssLessons | Code Tags | Sword of Fury - Jameram

  11. #11

    Thread Starter
    New Member
    Join Date
    May 2013
    Posts
    11

    Re: Getting webpage content of a website

    i tried many methods tried byte reading, readtoend, Encoding.GetString(responseBuffer). nothing working for me.

  12. #12
    I'm about to be a PowerPoster! Joacim Andersson's Avatar
    Join Date
    Jan 1999
    Location
    Sweden
    Posts
    14,649

    Re: Getting webpage content of a website

    This is not related to the encoding but rather to the fact that the web server use compression. To fix it just add this line to your code (just after you've created your request object):
    Code:
    request.AutomaticDecompression = Net.DecompressionMethods.GZip Or Net.DecompressionMethods.Deflate

  13. #13
    MS SQL Powerposter szlamany's Avatar
    Join Date
    Mar 2004
    Location
    Connecticut
    Posts
    18,263

    Re: Getting webpage content of a website

    Quote Originally Posted by Joacim Andersson View Post
    This is not related to the encoding but rather to the fact that the web server use compression. [/code]
    Does compression help to obfuscate the source of the page as well? Or am I totally off-base on this question anyway?

    *** Read the sticky in the DB forum about how to get your question answered quickly!! ***

    Please remember to rate posts! Rate any post you find helpful - even in old threads! Use the link to the left - "Rate this Post".

    Some Informative Links:
    [ SQL Rules to Live By ] [ Reserved SQL keywords ] [ When to use INDEX HINTS! ] [ Passing Multi-item Parameters to STORED PROCEDURES ]
    [ Solution to non-domain Windows Authentication ] [ Crazy things we do to shrink log files ] [ SQL 2005 Features ] [ Loading Pictures from DB ]

    MS MVP 2006, 2007, 2008

  14. #14
    I'm about to be a PowerPoster! Joacim Andersson's Avatar
    Join Date
    Jan 1999
    Location
    Sweden
    Posts
    14,649

    Re: Getting webpage content of a website

    No it doesn't really obfuscate anything since the compression itself is open source, and any modern web browser will uncompress it so View Source will show everything. But then again if you try to read a zipped text document using Notepad it will look like it's been obfuscated. The compression is really only used to save on bandwidth usage.

  15. #15
    MS SQL Powerposter szlamany's Avatar
    Join Date
    Mar 2004
    Location
    Connecticut
    Posts
    18,263

    Re: Getting webpage content of a website

    Oh well - I thought I might be able to hide some of my web code from casual view - guess I'll step out of this thread now!

    Bows head - backs out the door...

    *** Read the sticky in the DB forum about how to get your question answered quickly!! ***

    Please remember to rate posts! Rate any post you find helpful - even in old threads! Use the link to the left - "Rate this Post".

    Some Informative Links:
    [ SQL Rules to Live By ] [ Reserved SQL keywords ] [ When to use INDEX HINTS! ] [ Passing Multi-item Parameters to STORED PROCEDURES ]
    [ Solution to non-domain Windows Authentication ] [ Crazy things we do to shrink log files ] [ SQL 2005 Features ] [ Loading Pictures from DB ]

    MS MVP 2006, 2007, 2008

  16. #16

    Thread Starter
    New Member
    Join Date
    May 2013
    Posts
    11

    Re: Getting webpage content of a website

    Hi Joacim, it works thanks
    Quote Originally Posted by Joacim Andersson View Post
    This is not related to the encoding but rather to the fact that the web server use compression. To fix it just add this line to your code (just after you've created your request object):
    Code:
    request.AutomaticDecompression = Net.DecompressionMethods.GZip Or Net.DecompressionMethods.Deflate

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width