Results 1 to 17 of 17

Thread: [RESOLVED] Downloading html text with API

  1. #1

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Resolved [RESOLVED] Downloading html text with API

    I wrote this function to download html text, but the problem I'm having is it seems to be reading it from a hidden cache.

    Code:
    Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInet As Long) As Integer
    Private Declare Function InternetOpen Lib "wininet.dll" Alias "InternetOpenA" (ByVal sAgent As String, ByVal lAccessType As Long, ByVal sProxyName As String, ByVal sProxyBypass As String, ByVal lFlags As Long) As Long
    Private Declare Function InternetOpenUrl Lib "wininet.dll" Alias "InternetOpenUrlA" (ByVal hOpen As Long, ByVal sUrl As String, ByVal sHeaders As String, ByVal lLength As Long, ByVal lFlags As Long, ByVal lContext As Long) As Long
    Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal sBuffer As String, ByVal lNumBytesToRead As Long, lNumberOfBytesRead As Long) As Integer
    
    Public Function DownloadURL(ByVal URL As String) As String
        Const INTERNET_OPEN_TYPE_PRECONFIG = 0
        Const INTERNET_OPEN_TYPE_DIRECT = 1
        Const INTERNET_OPEN_TYPE_PROXY = 3
        Const scUserAgent = "VB Project"
        Const INTERNET_FLAG_RELOAD = &H80000000
        Dim lngOpen As Long
        Dim lngOpenURL As Long
        Dim blnReturn As Boolean
        Dim strReadBuffer As String * 2048
        Dim lngNumberOfBytesRead As Long
        Dim strBuffer As String
    
        lngOpen = InternetOpen(scUserAgent, INTERNET_OPEN_TYPE_PRECONFIG, vbNullString, vbNullString, 0)
        lngOpenURL = InternetOpenUrl(lngOpen, URL, vbNullString, 0, INTERNET_FLAG_RELOAD, 0)
        Do
            strReadBuffer = vbNullString
            blnReturn = InternetReadFile(lngOpenURL, strReadBuffer, Len(strReadBuffer), lngNumberOfBytesRead)
            strBuffer = strBuffer & Left$(strReadBuffer, lngNumberOfBytesRead)
            If Not CBool(lngNumberOfBytesRead) Then Exit Do
        Loop
        If lngOpenURL <> 0 Then InternetCloseHandle (lngOpenURL)
        If lngOpen <> 0 Then InternetCloseHandle (lngOpen)
        DownloadURL = strBuffer
    End Function
    Here's what I did:

    1) Ran my function to grab a wiki page.
    2) Manually corrected some text on that same wiki page using Firefox.
    3) Manually cleared all internet cache on Internet Explorer.
    4) Manually cleared all internet cache on Firefox. (Just because.)
    5) Re-ran my function to grab the new corrected version of that same wiki page.

    The function is still returning the original text, from before I made the changes in step 2. If I go to the page in Firefox I get the new correct text, but for whatever reason my function is returning the old text.

    I also notice that Internet Explorer (11) is showing the old text when I go to the page, even though I deleted all my cache. I manually Ctrl+F5'ed to force a refresh, but still no joy. Still in IE11, I chose Edit Page and the edit text shows the correct text. If I now hit the Back button, IE11 finally shows the correct text. But even after all that, my function still returns the old text.

    Please help! There's something like 75 pages that I crawl, and there's no way I can manually do all that hoop-jumping each time I need to update my data. Especially since I don't know what may or may not have changed, as (obviously) I'm not the only person who maintains this particular wiki.

  2. #2
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: Downloading html text with API

    Try including the INTERNET_FLAG_DONT_CACHE a.k.a. INTERNET_FLAG_NO_CACHE_WRITE flag:

    Quote Originally Posted by Bonnie West View Post
    Code:
    
    Private Declare Function CloseHandle Lib "kernel32.dll" (ByVal hObject As Long) As Long
    Private Declare Function CreateFileW Lib "kernel32.dll" (ByVal lpFileName As Long, ByVal dwDesiredAccess As Long, ByVal dwShareMode As Long, ByVal lpSecurityAttributes As Long, ByVal dwCreationDisposition As Long, Optional ByVal dwFlagsAndAttributes As Long, Optional ByVal hTemplateFile As Long) As Long
    Private Declare Function GetQueueStatus Lib "user32.dll" (ByVal Flags As Long) As Long
    Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInternet As Long) As Long
    Private Declare Function InternetOpenW Lib "wininet.dll" (ByVal lpszAgent As Long, ByVal dwAccessType As Long, ByVal lpszProxyName As Long, ByVal lpszProxyBypass As Long, ByVal dwFlags As Long) As Long
    Private Declare Function InternetOpenUrlW Lib "wininet.dll" (ByVal hInternet As Long, ByVal lpszUrl As Long, ByVal lpszHeaders As Long, ByVal dwHeadersLength As Long, ByVal dwFlags As Long, ByVal dwContext As Long) As Long
    Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal lpBuffer As Long, ByVal dwNumberOfBytesToRead As Long, ByRef lpdwNumberOfBytesRead As Long) As Long
    Private Declare Function SysReAllocStringLen Lib "oleaut32.dll" (ByVal pBSTR As Long, Optional ByVal pszStrPtr As Long, Optional ByVal Length As Long) As Long
    Private Declare Function WriteFile Lib "kernel32.dll" (ByVal hFile As Long, ByVal lpBuffer As Long, ByVal nNumberOfBytesToWrite As Long, Optional ByRef lpNumberOfBytesWritten As Long, Optional ByVal lpOverlapped As Long) As Long
    
    'Downloads the remote file specified by the sURL argument to the local file pointed
    'by the sFileName parameter. The optional Chunk parameter determines the number
    'of bytes to be downloaded at a time. Bigger chunks download faster while smaller
    'ones enables the GUI to be more responsive. Returns the total number of bytes
    'successfully written to disk. Maximum download size of 2,047.99 MB only.
    
    Public Function DownloadURL2File(ByRef sURL As String, ByRef sFileName As String, Optional ByVal Chunk As Long = 1024&) As Long
        Const INTERNET_OPEN_TYPE_DIRECT = 1&, INTERNET_FLAG_DONT_CACHE = &H4000000, INTERNET_FLAG_RELOAD = &H80000000
        Const GENERIC_WRITE = &H40000000, FILE_SHARE_NONE = 0&, CREATE_ALWAYS = 2&, QS_ALLINPUT = &H4FF&
        Const INVALID_HANDLE_VALUE = -1&, ERROR_INSUFFICIENT_BUFFER = &H7A&
        Dim hInternet As Long, hURL As Long, hFile As Long, nBytesRead As Long, nBytesWritten As Long
        Dim bSuccess As Boolean, sBuffer_Ptr As Long, sBuffer_Size As Long, sBuffer As String
    
        Select Case True
            Case LenB(sURL) = 0&, LenB(sFileName) = 0&, Chunk < 2&:  Exit Function
        End Select
    
        hInternet = InternetOpenW(StrPtr(App.Title), INTERNET_OPEN_TYPE_DIRECT, 0&, 0&, 0&)
        If hInternet Then
            hURL = InternetOpenUrlW(hInternet, StrPtr(sURL), 0&, 0&, INTERNET_FLAG_DONT_CACHE Or INTERNET_FLAG_RELOAD, 0&)
            If hURL Then
                hFile = CreateFileW(StrPtr(sFileName), GENERIC_WRITE, FILE_SHARE_NONE, 0&, CREATE_ALWAYS) 'Overwrite existing
                If hFile <> INVALID_HANDLE_VALUE Then
                    Do: SysReAllocStringLen VarPtr(sBuffer), , (sBuffer_Size + Chunk) * 0.5!
                        sBuffer_Size = LenB(sBuffer):   sBuffer_Ptr = StrPtr(sBuffer)
                        Do While InternetReadFile(hURL, sBuffer_Ptr, sBuffer_Size, nBytesRead)
                            If nBytesRead Then
                                bSuccess = (WriteFile(hFile, sBuffer_Ptr, nBytesRead, nBytesWritten) <> 0&) _
                                            And (nBytesWritten = nBytesRead): Debug.Assert bSuccess
                                If bSuccess Then DownloadURL2File = DownloadURL2File + nBytesWritten
                                If GetQueueStatus(QS_ALLINPUT) And &HFFFF0000 Then DoEvents
                            Else
                                Exit Do
                            End If
                        Loop
                    Loop While Err.LastDllError = ERROR_INSUFFICIENT_BUFFER
                    hFile = CloseHandle(hFile):                               Debug.Assert hFile
                End If
                hURL = InternetCloseHandle(hURL):                             Debug.Assert hURL
            End If
            hInternet = InternetCloseHandle(hInternet):                       Debug.Assert hInternet
        End If
    End Function
     
    

    BTW, the return values of your InternetCloseHandle and InternetReadFile APIs should be Long rather than Integer.
    Last edited by Bonnie West; Sep 24th, 2015 at 09:15 AM.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  3. #3

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Downloading html text with API

    No joy, still getting the old cached value.

    The old cached value is also showing when I open the page in IE11. How do I clear the cache in IE11? (I only ever use Firefox.) If I could just do that maybe this would work. When I choose "Gear Icon" => Safety => Delete Browsing History, check Temporary Internet Files and clear it, IE11 says files are cleared. But it's clearly lying to me.

    Note that I don't need this to work for other people, just me, so if there's some manual task I have to do to clear the cache to make this work that's fine by me.

  4. #4
    PowerPoster
    Join Date
    Jan 2008
    Posts
    11,074

    Re: Downloading html text with API

    Quote Originally Posted by Ellis Dee View Post
    No joy, still getting the old cached value.

    The old cached value is also showing when I open the page in IE11. How do I clear the cache in IE11? (I only ever use Firefox.) If I could just do that maybe this would work. When I choose "Gear Icon" => Safety => Delete Browsing History, check Temporary Internet Files and clear it, IE11 says files are cleared. But it's clearly lying to me.

    Note that I don't need this to work for other people, just me, so if there's some manual task I have to do to clear the cache to make this work that's fine by me.
    If you can't clear it using code then clear it yourself by navigating to the cache folders and do a select all and delete providing it allows you to do this - I can do it on my PC but that's me.


    Anything I post is an example only and is not intended to be the only solution, the total solution nor the final solution to your request nor do I claim that it is. If you find it useful then it is entirely up to you to make whatever changes necessary you feel are adequate for your purposes.

  5. #5

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Downloading html text with API

    Quote Originally Posted by jmsrickland View Post
    If you can't clear it using code then clear it yourself by navigating to the cache folders and do a select all and delete providing it allows you to do this - I can do it on my PC but that's me.
    Good thought, but still no joy.

    I navigated to "C:\Users\[Me]\AppData\Local\Microsoft\Windows\Temporary Internet Files" (after choosing "Show System Files" in explorer) and deleted every one of the hundreds of files still in there.

    Re-ran my crawler and...still the old values are retrieved.

    EDIT: Though when I went to empty my recycle bin just to be sure, the recycle bin shows as empty even though I just deleted hundreds of files.

  6. #6
    VB-aholic & Lovin' It LaVolpe's Avatar
    Join Date
    Oct 2007
    Location
    Beside Waldo
    Posts
    19,541

    Re: Downloading html text with API

    Maybe supply an added fake query, i.e... "www.somesite.com/?rnd=1234"

    The above tweak seems to work every time for me. Now, I do replace 1234 with a truly random number and need to parse the URL to see if a query is already included, and if so, so that I can append my random query to the end of the URL using proper prefix of ? or &. The rnd variable is static, but was just 3 characters I chose on a whim.

    Easy enough to test.
    Insomnia is just a byproduct of, "It can't be done"

    Classics Enthusiast? Here's my 1969 Mustang Mach I Fastback. Her sister '67 Coupe has been adopted

    Newbie? Novice? Bored? Spend a few minutes browsing the FAQ section of the forum.
    Read the HitchHiker's Guide to Getting Help on the Forums.
    Here is the list of TAGs you can use to format your posts
    Here are VB6 Help Files online


    {Alpha Image Control} {Memory Leak FAQ} {Unicode Open/Save Dialog} {Resource Image Viewer/Extractor}
    {VB and DPI Tutorial} {Manifest Creator} {UserControl Button Template} {stdPicture Render Usage}

  7. #7

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Downloading html text with API

    Sweet, that works!

    Let me tell you how far down the rabbit-hole I was going. First, I found this article which mentions the simple Temporary Internet Files folder, which I had already tried manually clearing. But the second thing it mentions is index.dat:

    And then, where is the index.dat file located in Windows 7 | 8? Index.dat are files hidden on your computer that contain all of the Web sites that you have ever visited. Every URL, and every Web page is listed there. To access it, you will have to type in Explorers address bar the following location and click go:

    C:\Users\username\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5

    Only then will you be able to see the index.dat file. Conclusion ? The Content.IE5 folder is super hidden!
    Typing that address into Windows explorer -- substituting my username, of course -- did indeed bring up a HUGE cache of internet files that were not previously visible. Despite only ever using Firefox on this computer, the index.dat monstrosity contained tens of thousands of files taking up many gigabytes of space. What the heck? So I manually deleted all of them. STILL no joy.

    So then I found this thread, where apparently the discussion is based around trying to clear out the complete, permanent, un-deletable and invisible cache of every website you've ever visited that is automatically maintained by Windows.

    Jeezum-crow, what happened to my innocence? I though the index.dat thing was bad, but when deleting even that didn't clear my cache I'm starting to think that Windows truly does remember everywhere you've ever been on the internet, regardless of what browser you used to get there.

  8. #8
    PowerPoster
    Join Date
    Jan 2008
    Posts
    11,074

    Re: [RESOLVED] Downloading html text with API

    Maybe if you had rebooted it might have completed the deletion


    Anything I post is an example only and is not intended to be the only solution, the total solution nor the final solution to your request nor do I claim that it is. If you find it useful then it is entirely up to you to make whatever changes necessary you feel are adequate for your purposes.

  9. #9

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: [RESOLVED] Downloading html text with API

    I'll test it next time I have to reboot.

  10. #10
    VB-aholic & Lovin' It LaVolpe's Avatar
    Join Date
    Oct 2007
    Location
    Beside Waldo
    Posts
    19,541

    Re: [RESOLVED] Downloading html text with API

    Sweet, that works!
    @Ellis. Just to be clear, that random query must be random each time you use it. May actually do Randomize Timer at startup. If not random for each call, possible you can get cached data if the URL is exactly same as last time it was used. Just wanted to make that point crystal clear.
    Insomnia is just a byproduct of, "It can't be done"

    Classics Enthusiast? Here's my 1969 Mustang Mach I Fastback. Her sister '67 Coupe has been adopted

    Newbie? Novice? Bored? Spend a few minutes browsing the FAQ section of the forum.
    Read the HitchHiker's Guide to Getting Help on the Forums.
    Here is the list of TAGs you can use to format your posts
    Here are VB6 Help Files online


    {Alpha Image Control} {Memory Leak FAQ} {Unicode Open/Save Dialog} {Resource Image Viewer/Extractor}
    {VB and DPI Tutorial} {Manifest Creator} {UserControl Button Template} {stdPicture Render Usage}

  11. #11

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: [RESOLVED] Downloading html text with API

    Yep, I figured as much but appreciate the clarification.

    I wrote a wrapper function that generates a random 4-digit number as the query before passing it to the (updated) function in the OP, just for this particular issue in this particular program. I'm hesitant to add the functionality to my general utility, but instead will write a wrapper any time I need it.

    This particular program already uses random numbers elsewhere so there's already a Randomize (no Timer) statement in Sub Main().

    EDIT: And despite the fact that it's been many years since I have been a regular poster on these boards, I still can't rep you until I spread it around more, LaVolpe.
    Last edited by Ellis Dee; Sep 24th, 2015 at 11:26 PM.

  12. #12
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: [RESOLVED] Downloading html text with API

    I remembered just now that someone had an issue similar to yours quite some time ago and this tiny, API-free routine worked for him:

    Quote Originally Posted by Bonnie West View Post
    Code:
    Public Sub SaveWebPageToFile(ByRef URL As String, ByRef FileName As String, Optional ByRef Charset As String = "utf-8")
        Const adSaveCreateOverWrite = 2&
        Dim oHttpReq As Object
    
        Set oHttpReq = CreateObject("WinHttp.WinHttpRequest.5.1")
        oHttpReq.Open "GET", URL
        oHttpReq.Send
    
        With CreateObject("ADODB.Stream")
            .Open
            .Charset = Charset
            .WriteText oHttpReq.ResponseText
            .SaveToFile FileName, adSaveCreateOverWrite
            .Close
        End With
    End Sub
    I believe it should also work for you without requiring you to employ workarounds.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  13. #13

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: [RESOLVED] Downloading html text with API

    That looks interesting, but requires a dependency. The app I'm writing is specifically designed to be dependency-free (only pure native VB6 plus API) so that end users don't have to run any kind of installation regardless what OS they're using. Some are still running XP (Luddites!), and a couple are even running the app on WINE.

  14. #14
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: [RESOLVED] Downloading html text with API

    Quote Originally Posted by Ellis Dee View Post
    That looks interesting, but requires a dependency. The app I'm writing is specifically designed to be dependency-free (only pure native VB6 plus API) so that end users don't have to run any kind of installation regardless what OS they're using.
    As stated here, the Windows HTTP Services is actually not a dependency on currently supported platforms (and on some unsupported ones):

    Quote Originally Posted by MSDN
    Run-time requirements

    WinHTTP 5.1 offers improvements over version 5.0. It is included in the operating system. For more information about new features, see What's New in WinHTTP 5.1 and What's New in Windows Server 2008 and Windows Vista.
    BTW, in addition to the WinHttpRequest COM object, WinHTTP also has Interfaces and Functions that I believe are also usable in VB.

    Quote Originally Posted by Ellis Dee View Post
    Some are still running XP (Luddites!), ...
    According to WinHTTP Versions, WinHTTP 5.1 has been available in Windows XP since SP1:

    Quote Originally Posted by Bonnie West View Post
    Quote Originally Posted by MSDN
    With version 5.1, WinHTTP is an operating-system component of the following operating systems:

    • Windows 2000, Service Pack 3 and later (except Datacenter Server)
    • Windows XP with Service Pack 1 (SP1) and later
    • Windows Server 2003 with Service Pack 1 (SP1) and later
    Quote Originally Posted by Ellis Dee View Post
    ... and a couple are even running the app on WINE.
    Well, I don't know whether WINE already supports WinHTTP or not, but if they plan on being compatible with Windows, then they ought to implement that API.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  15. #15
    VB-aholic & Lovin' It LaVolpe's Avatar
    Join Date
    Oct 2007
    Location
    Beside Waldo
    Posts
    19,541

    Re: [RESOLVED] Downloading html text with API

    @Bonnie. I've used Microsoft.XMLHTTP & MSXML2.ServerXMLHTTP objects and GET also would retrieve cached data. That is when I discovered the fake query workaround and applied it. Do those libraries wrap WinHTTP? Don't know.

    Edited. If this quote is correct, answers the question I had
    Msxml2.XMLHTTP and Msxml2.ServerXMLHTTP are two components share the similar interface for fetching XML files over HTTP protocal. The former is built upon URLMon, which relies on WinINet. The later is built upon WinHTTP, which is a server friendly replacement for WinINet. To put it simple - ServerXMLHTTP = XML + WinHTTP.
    @Ellis. I had no issues using a random number in excess of 4 digits, i.e., Int(Rnd*vbWhite)
    Last edited by LaVolpe; Sep 25th, 2015 at 08:21 AM.
    Insomnia is just a byproduct of, "It can't be done"

    Classics Enthusiast? Here's my 1969 Mustang Mach I Fastback. Her sister '67 Coupe has been adopted

    Newbie? Novice? Bored? Spend a few minutes browsing the FAQ section of the forum.
    Read the HitchHiker's Guide to Getting Help on the Forums.
    Here is the list of TAGs you can use to format your posts
    Here are VB6 Help Files online


    {Alpha Image Control} {Memory Leak FAQ} {Unicode Open/Save Dialog} {Resource Image Viewer/Extractor}
    {VB and DPI Tutorial} {Manifest Creator} {UserControl Button Template} {stdPicture Render Usage}

  16. #16
    PowerPoster
    Join Date
    Jul 2001
    Location
    Tucson, AZ
    Posts
    2,166

    Re: [RESOLVED] Downloading html text with API

    Just a link to interesting article on web caching.

    http://www.cisco.com/web/about/ac123...0800c8903.html

  17. #17

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: [RESOLVED] Downloading html text with API

    Quote Originally Posted by Bonnie West View Post
    As stated here, the Windows HTTP Services is actually not a dependency on currently supported platforms (and on some unsupported ones):
    I was actually referring to the ADO calls, but on closer inspection that doesn't look to be part of the actual code, but instead is just a way to save text to a file. Removing that bit leaves a nice, straightforward option.

    For saving text to a file, I usually just use this:

    Code:
    Public Sub SaveStringAs(File As String, Text As String)
        Dim FileNumber As Long
    
        FileNumber = FreeFile
        Open File For Output As #FileNumber
        Print #FileNumber, Text
        Close #FileNumber
    End Sub

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width