Results 1 to 26 of 26

Thread: Saving a web page

  1. #1

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219

    Saving a web page

    hi, after my browswer navigates to the url that i want, I want my program to save that page in a file called result.htm Here is how it gets to the url. wbWeb.Navigate(url) CAn some one help?
    -Rob

  2. #2
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    Upon further inspection it seems you can get it via the document object. Make a reference to the COM component 'Microsoft HTML Object Library' then add a NavigateComplete2 event:
    VB Code:
    1. Private Sub wbWeb_NavigateComplete2(ByVal sender As Object, ByVal e As AxSHDocVw.DWebBrowserEvents2_NavigateComplete2Event) Handles wbWeb.NavigateComplete2
    2.         Dim doc As mshtml.HTMLDocument = wbWeb.Document
    3.         dim sData As String= doc.documentElement.innerHTML() 'this is the html of the page
    4.     End Sub

    NOTE: If you are navigating to other pages and not just this one then you probably want to set up a flag of some sort so you only get the html on this page.

  3. #3

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    how do i initiate that event? And how do i Pharse it?
    Last edited by VBGangsta; Oct 8th, 2003 at 09:42 PM.
    -Rob

  4. #4
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    You don't initiate the event it automatically gets called when the document is finished loading. You should already have the parsing code from the previous topics you've posted. Its the RegularExpressions stuff.

  5. #5

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    I tried pharsing it but no luck. It doesnt get all the code. BUt i thought of a differnt way. First it navigates to the web page that i need the html from. i save that file as result.htm. Tehn i open it and then save it as a text file. then i phase it with yur code. this works because i have tried it but i dont know how to download that webpage. Do you?
    -Rob

  6. #6
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    Originally posted by Edneeis
    Upon further inspection it seems you can get it via the document object. Make a reference to the COM component 'Microsoft HTML Object Library' then add a NavigateComplete2 event:
    VB Code:
    1. Private Sub wbWeb_NavigateComplete2(ByVal sender As Object, ByVal e As AxSHDocVw.DWebBrowserEvents2_NavigateComplete2Event) Handles wbWeb.NavigateComplete2
    2.         Dim doc As mshtml.HTMLDocument = wbWeb.Document
    3.         dim sData As String= doc.documentElement.innerHTML() 'this is the html of the page
    4.     End Sub

    NOTE: If you are navigating to other pages and not just this one then you probably want to set up a flag of some sort so you only get the html on this page.
    Thats what this does. There is no need to save it as an html page then a text file that code puts all the page in sData.

  7. #7

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    I am trying that but it keeps saying that sData cant be referred to before it is declared. But it is declared before it. I added the reference. Heres the code.
    VB Code:
    1. Public Class Form1
    2.     Inherits System.Windows.Forms.Form
    3.  
    4. #Region " Windows Form Designer generated code "
    5.  
    6.         Private Sub wbWeb_NavigateComplete2(ByVal sender As Object, ByVal e As AxSHDocVw.DWebBrowserEvents2_NavigateComplete2Event) Handles AxWebBrowser1.NavigateComplete2
    7.         Dim doc As mshtml.HTMLDocument = AxWebBrowser1.Document
    8.         Dim sData As String = doc.documentElement.innerHTML() 'this is the html of the page
    9.     End Sub
    10.  
    11.     Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    12.  
    13.         AxWebBrowser1.Navigate("http://www.outwar.com/rankings.php?type=2&find=120&submit=go")
    14.  
    15.     End Sub
    16.  
    17.     Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
    18.         Dim sr As New IO.StreamReader(sData)
    19.         Dim sData As String = sr.ReadToEnd
    20.         sr.Close()
    21.  
    22.         Dim pattern As String = "(?<=\>)\w+(?=\<\/a\>)"
    23.         Dim reg As New System.Text.RegularExpressions.Regex(pattern)
    24.         Dim mcol As System.Text.regularexpressions.MatchCollection = reg.Matches(sData)
    25.         For Each m As System.Text.RegularExpressions.Match In mcol
    26.             ListBox1.Items.Add(m.Value)
    27.         Next
    28.     End Sub
    29.  
    30. End Class
    -Rob

  8. #8
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    You should read up on scope. If you declare a variable in one sub you can't access it in another. It actually doesn't exist outside of the sub it was declared in. If you need something to be reached from different subs/functions then declare it in the form itself.

    I assume the button fills a list of some sort from the data on the web. So really you'll need to navigate there with every button click, right?

    Try this:
    VB Code:
    1. 'all in the form
    2.  
    3.     Private CatchData As Boolean = False
    4.  
    5.     Private Sub wbWeb_NavigateComplete2(ByVal sender As Object, ByVal e As AxSHDocVw.DWebBrowserEvents2_NavigateComplete2Event) Handles wbWeb.NavigateComplete2
    6.         If CatchData Then
    7.             'get html
    8.             Dim doc As mshtml.HTMLDocument = wbWeb.Document
    9.             Dim sData As String = doc.documentElement.innerHTML()
    10.  
    11.             'convert html to list
    12.             Dim pattern As String = "(?<=\> )\w+(?=\<\/a\> )"
    13.             Dim reg As New System.Text.RegularExpressions.Regex(pattern)
    14.             Dim mcol As System.Text.regularexpressions.MatchCollection = reg.Matches(sData)
    15.             For Each m As System.Text.RegularExpressions.Match In mcol
    16.                 ListBox1.Items.Add(m.Value)
    17.             Next
    18.             'reset flag
    19.             CatchData = False
    20.         End If
    21.     End Sub
    22.  
    23.     Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
    24.         'set flag
    25.         CatchData = True
    26.         wbWeb.Navigate("http://www.outwar.com/rankings.php?type=2&find=120&submit=go")
    27.     End Sub

  9. #9
    Frenzied Member dynamic_sysop's Avatar
    Join Date
    Jun 2003
    Location
    Ashby, Leicestershire.
    Posts
    1,142
    as Edneeis said the string that holds the html must be available to all subs , not declared inside a sub, here's a quick example using a richtextbox to receive the string that holds the html / save the html file to the HD ...
    VB Code:
    1. [COLOR=BLUE]Dim[/COLOR] htmlDoc [COLOR=BLUE]As[/COLOR] mshtml.HTMLDocument [COLOR=GREEN]'/// reference to Microsoft.mshtml.
    2. [/COLOR]    [COLOR=BLUE]Dim[/COLOR] source [COLOR=BLUE]As[/COLOR] [COLOR=BLUE]String[/COLOR] [COLOR=GREEN]'/// this must not be inside a sub, but available to all subs.
    3. [/COLOR]    [COLOR=GREEN]'/// below the windows generated code area^^^.
    4.  
    5. [/COLOR]    [COLOR=BLUE]Private[/COLOR] [COLOR=BLUE]Sub[/COLOR] Button1_Click([COLOR=BLUE]ByVal[/COLOR] sender [COLOR=BLUE]As[/COLOR] System.Object, [COLOR=BLUE]ByVal[/COLOR] e [COLOR=BLUE]As[/COLOR] System.EventArgs) [COLOR=BLUE]Handles[/COLOR] Button1.Click
    6.         AxWebBrowser1.Navigate("http://vbforums.com")
    7.     [COLOR=BLUE]End[/COLOR] [COLOR=BLUE]Sub
    8.  
    9. [/COLOR]    [COLOR=BLUE]Private[/COLOR] [COLOR=BLUE]Sub[/COLOR] AxWebBrowser1_NavigateComplete2([COLOR=BLUE]ByVal[/COLOR] sender [COLOR=BLUE]As[/COLOR] [COLOR=BLUE]Object[/COLOR], [COLOR=BLUE]ByVal[/COLOR] e [COLOR=BLUE]As[/COLOR] AxSHDocVw.DWebBrowserEvents2_NavigateComplete2Event) [COLOR=BLUE]Handles[/COLOR] AxWebBrowser1.NavigateComplete2
    10.         htmlDoc = [COLOR=BLUE]DirectCast[/COLOR](AxWebBrowser1.Document, mshtml.HTMLDocument)
    11.         source = htmlDoc.documentElement.innerHTML
    12.  
    13.         RichTextBox1.Text = source [COLOR=GREEN]'/// test to see if source holds the html from the webpage.
    14. [/COLOR]        RichTextBox1.SaveFile("C:\someHtml.htm", RichTextBoxStreamType.PlainText) [COLOR=GREEN]'/// save the htm file to a location on the harddrive.
    15. [/COLOR]    [COLOR=BLUE]End[/COLOR] [COLOR=BLUE]Sub[/COLOR]

    by the way , if you want the text but not the html tags, you can use the InnerText property rather than InnerHtml , eg:
    VB Code:
    1. source = htmlDoc.documentElement.innerText
    2. '/// gets the text of the website without the html tags ^^^.
    Last edited by dynamic_sysop; Oct 9th, 2003 at 03:21 AM.
    ~
    if a post is resolved, please mark it as [Resolved]
    protected string get_Signature(){return Censored;}
    [vbcode][php] please use code tags when posting any code [/php][/vbcode]

  10. #10

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    Here is the vide i now have. I dont know whats going on, after i press the button to navigate to the page (which it does) nothing happens, nothing gets added to the list box.

    VB Code:
    1. Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    2.  
    3.         wbweb.Navigate("http://www.outwar.com/rankings.php?type=2&find=120&submit=go")
    4.  
    5.     End Sub
    6.     Private Sub wbweb_NavigateComplete2(ByVal sender As Object, ByVal e As AxSHDocVw.DWebBrowserEvents2_NavigateComplete2Event) Handles wbweb.NavigateComplete2
    7.         If CatchData Then
    8.             'get html
    9.             Dim sdata2 As String
    10.             Dim doc As mshtml.HTMLDocument = wbweb.Document
    11.             Dim sData As String = doc.documentElement.innerHTML
    12.             sData = sData2
    13.             'convert html to list
    14.             Dim pattern As String = "(?<=\> )\w+(?=\<\/a\> )"
    15.             Dim reg As New System.Text.RegularExpressions.Regex(pattern)
    16.             Dim mcol As System.Text.regularexpressions.MatchCollection = reg.Matches(sData)
    17.             For Each m As System.Text.RegularExpressions.Match In mcol
    18.                 ListBox1.Items.Add(m.Value)
    19.  
    20.             Next
    21.             'Reset(flag)
    22.             CatchData = False
    23.         End If
    24.     End Sub
    -Rob

  11. #11
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    You didn't add all the code I posted. You forgot to declare the CatchData variable in the form and to set it to true before the navigate call.

  12. #12

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    i did its above the windows generated code. Still nit working. Does it work for u?
    -Rob

  13. #13
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    Getting the webpage worked but I can't login to get the correct html on the page that you are looking for. You still don't have CatchData=True right above the Navigate call.
    VB Code:
    1. Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    2.         [b]CatchData=True[/b]
    3.         wbweb.Navigate("http://www.outwar.com/rankings.php?type=2&find=120&submit=go")
    4. End Sub

  14. #14

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    ok i did that, but i looked at the document but it is broken down further into sub folders when i set a break points. Are u sure this will work?
    -Rob

  15. #15
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    What doesn't work about it? Are you not getting the data into the string (sData)? Is it not finding the html stuff you are looking for? Are you getting an error?

  16. #16

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    nothing is being added to the list some I am guessing that it is not getting the html properly. Why is it saving it to a .doc CAnt you just put it in a variable?
    -Rob

  17. #17
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    It doesn't save it as a .doc. Doc is the name of a variable and is of the HTMLDocument type. you have to cast to this so you can get the InnerHTML of the document. But the page never gets saved anywhere. All its text gets put into the variable sData.

    You need to debug the app. Set a breakpoint or show the sData variable in a msgbox so you can see if you are getting the right html. Make sure it contains html like what you posted before.

  18. #18

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    This is what i get as HTML? Why is this Happening?



    doc.documentElement.innerHTML "<HEAD><TITLE>Outwar.com Round 12 - The land of Monsters, Gangsters, and Pop Stars!</TITLE>
    <META http-equiv=Content-Language content=en-us>
    <META http-equiv=Content-Type content="text/html; charset=windows-1252">
    <STYLE type=text/css>
    <!--
    #dek {POSITION:absolute;VISIBILITY:hidden;Z-INDEX:200;}
    //-->
    </STYLE>
    <LINK href="style.css" type=text/css rel=STYLESHEET>
    <SCRIPT language=JavaScript>
    <!--

    function SymError()
    {
    return true;
    }

    window.onerror = SymError;

    var SymRealWinOpen = window.open;

    function SymWinOpen(url, name, attributes)
    {
    return (new Object());
    }

    window.open = SymWinOpen;

    //-->
    </SCRIPT>
    </HEAD>" String
    -Rob

  19. #19
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    I don't know you'll have to try and find another way I guess. Or check other parts of the document object. I tried.

  20. #20

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    I found another way but i need to know how to save the webpage i navigate to as a .htm file.
    -Rob

  21. #21

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    hey edneeis, i talked to other progrmmers that do similsr things that i am doing and they said that doing the way you said is the right(and only) way to do it. That code is very close except its not getting all the information i need, do u know a way off getting the whole html form the page not just the innertext? Thanks
    -Rob

  22. #22
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    No InnerText should be the inner text of the whole document. I was reading up and it seem that the problem is that you are getting the DOM after it is executed but all the other methods bypass the login mechamism for the site and just get the page (so it wont have the data you want). Sorry you'll have to research the Document object and see if you can find something. I don't know that much about it.

  23. #23

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    What is this topic called so I know what to research?
    -Rob

  24. #24
    Your Ad Here! Edneeis's Avatar
    Join Date
    Feb 2000
    Location
    Moreno Valley, CA (SoCal)
    Posts
    7,339
    I don't know that'll be part of the research, anything on the WebBroswer control's Document property.

  25. #25

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    I fund out that everyother webpage on the web works correctly with that code except the webpage i want. This so because there is something unique about the website and i have yet to figure it out but i will.
    -Rob

  26. #26

    Thread Starter
    Addicted Member VBGangsta's Avatar
    Join Date
    Aug 2003
    Location
    New York
    Posts
    219
    I think it may be possibly be getting the frame of the page but I am not sure yet
    -Rob

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width