[RESOLVED] Downloading html text with API

I wrote this function to download html text, but the problem I'm having is it seems to be reading it from a hidden cache.

Code:

Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInet As Long) As Integer Private Declare Function InternetOpen Lib "wininet.dll" Alias "InternetOpenA" (ByVal sAgent As String, ByVal lAccessType As Long, ByVal sProxyName As String, ByVal sProxyBypass As String, ByVal lFlags As Long) As Long Private Declare Function InternetOpenUrl Lib "wininet.dll" Alias "InternetOpenUrlA" (ByVal hOpen As Long, ByVal sUrl As String, ByVal sHeaders As String, ByVal lLength As Long, ByVal lFlags As Long, ByVal lContext As Long) As Long Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal sBuffer As String, ByVal lNumBytesToRead As Long, lNumberOfBytesRead As Long) As Integer Public Function DownloadURL(ByVal URL As String) As String Const INTERNET_OPEN_TYPE_PRECONFIG = 0 Const INTERNET_OPEN_TYPE_DIRECT = 1 Const INTERNET_OPEN_TYPE_PROXY = 3 Const scUserAgent = "VB Project" Const INTERNET_FLAG_RELOAD = &H80000000 Dim lngOpen As Long Dim lngOpenURL As Long Dim blnReturn As Boolean Dim strReadBuffer As String * 2048 Dim lngNumberOfBytesRead As Long Dim strBuffer As String lngOpen = InternetOpen(scUserAgent, INTERNET_OPEN_TYPE_PRECONFIG, vbNullString, vbNullString, 0) lngOpenURL = InternetOpenUrl(lngOpen, URL, vbNullString, 0, INTERNET_FLAG_RELOAD, 0) Do strReadBuffer = vbNullString blnReturn = InternetReadFile(lngOpenURL, strReadBuffer, Len(strReadBuffer), lngNumberOfBytesRead) strBuffer = strBuffer & Left$(strReadBuffer, lngNumberOfBytesRead) If Not CBool(lngNumberOfBytesRead) Then Exit Do Loop If lngOpenURL <> 0 Then InternetCloseHandle (lngOpenURL) If lngOpen <> 0 Then InternetCloseHandle (lngOpen) DownloadURL = strBuffer End Function

Here's what I did:

1) Ran my function to grab a wiki page.
2) Manually corrected some text on that same wiki page using Firefox.
3) Manually cleared all internet cache on Internet Explorer.
4) Manually cleared all internet cache on Firefox. (Just because.)
5) Re-ran my function to grab the new corrected version of that same wiki page.

The function is still returning the original text, from before I made the changes in step 2. If I go to the page in Firefox I get the new correct text, but for whatever reason my function is returning the old text.

I also notice that Internet Explorer (11) is showing the old text when I go to the page, even though I deleted all my cache. I manually Ctrl+F5'ed to force a refresh, but still no joy. Still in IE11, I chose Edit Page and the edit text shows the correct text. If I now hit the Back button, IE11 finally shows the correct text. But even after all that, my function still returns the old text.

Please help! There's something like 75 pages that I crawl, and there's no way I can manually do all that hoop-jumping each time I need to update my data. Especially since I don't know what may or may not have changed, as (obviously) I'm not the only person who maintains this particular wiki.

Re: Downloading html text with API

Try including the INTERNET_FLAG_DONT_CACHE a.k.a. INTERNET_FLAG_NO_CACHE_WRITE flag:

Quote:

Originally Posted by Bonnie West

Code:

Private Declare Function CloseHandle Lib "kernel32.dll" (ByVal hObject As Long) As Long Private Declare Function CreateFileW Lib "kernel32.dll" (ByVal lpFileName As Long, ByVal dwDesiredAccess As Long, ByVal dwShareMode As Long, ByVal lpSecurityAttributes As Long, ByVal dwCreationDisposition As Long, Optional ByVal dwFlagsAndAttributes As Long, Optional ByVal hTemplateFile As Long) As Long Private Declare Function GetQueueStatus Lib "user32.dll" (ByVal Flags As Long) As Long Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInternet As Long) As Long Private Declare Function InternetOpenW Lib "wininet.dll" (ByVal lpszAgent As Long, ByVal dwAccessType As Long, ByVal lpszProxyName As Long, ByVal lpszProxyBypass As Long, ByVal dwFlags As Long) As Long Private Declare Function InternetOpenUrlW Lib "wininet.dll" (ByVal hInternet As Long, ByVal lpszUrl As Long, ByVal lpszHeaders As Long, ByVal dwHeadersLength As Long, ByVal dwFlags As Long, ByVal dwContext As Long) As Long Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal lpBuffer As Long, ByVal dwNumberOfBytesToRead As Long, ByRef lpdwNumberOfBytesRead As Long) As Long Private Declare Function SysReAllocStringLen Lib "oleaut32.dll" (ByVal pBSTR As Long, Optional ByVal pszStrPtr As Long, Optional ByVal Length As Long) As Long Private Declare Function WriteFile Lib "kernel32.dll" (ByVal hFile As Long, ByVal lpBuffer As Long, ByVal nNumberOfBytesToWrite As Long, Optional ByRef lpNumberOfBytesWritten As Long, Optional ByVal lpOverlapped As Long) As Long 'Downloads the remote file specified by the sURL argument to the local file pointed 'by the sFileName parameter. The optional Chunk parameter determines the number 'of bytes to be downloaded at a time. Bigger chunks download faster while smaller 'ones enables the GUI to be more responsive. Returns the total number of bytes 'successfully written to disk. Maximum download size of 2,047.99 MB only. Public Function DownloadURL2File(ByRef sURL As String, ByRef sFileName As String, Optional ByVal Chunk As Long = 1024&) As Long Const INTERNET_OPEN_TYPE_DIRECT = 1&, INTERNET_FLAG_DONT_CACHE = &H4000000, INTERNET_FLAG_RELOAD = &H80000000 Const GENERIC_WRITE = &H40000000, FILE_SHARE_NONE = 0&, CREATE_ALWAYS = 2&, QS_ALLINPUT = &H4FF& Const INVALID_HANDLE_VALUE = -1&, ERROR_INSUFFICIENT_BUFFER = &H7A& Dim hInternet As Long, hURL As Long, hFile As Long, nBytesRead As Long, nBytesWritten As Long Dim bSuccess As Boolean, sBuffer_Ptr As Long, sBuffer_Size As Long, sBuffer As String Select Case True Case LenB(sURL) = 0&, LenB(sFileName) = 0&, Chunk < 2&: Exit Function End Select hInternet = InternetOpenW(StrPtr(App.Title), INTERNET_OPEN_TYPE_DIRECT, 0&, 0&, 0&) If hInternet Then hURL = InternetOpenUrlW(hInternet, StrPtr(sURL), 0&, 0&, INTERNET_FLAG_DONT_CACHE Or INTERNET_FLAG_RELOAD, 0&) If hURL Then hFile = CreateFileW(StrPtr(sFileName), GENERIC_WRITE, FILE_SHARE_NONE, 0&, CREATE_ALWAYS) 'Overwrite existing If hFile <> INVALID_HANDLE_VALUE Then Do: SysReAllocStringLen VarPtr(sBuffer), , (sBuffer_Size + Chunk) * 0.5! sBuffer_Size = LenB(sBuffer): sBuffer_Ptr = StrPtr(sBuffer) Do While InternetReadFile(hURL, sBuffer_Ptr, sBuffer_Size, nBytesRead) If nBytesRead Then bSuccess = (WriteFile(hFile, sBuffer_Ptr, nBytesRead, nBytesWritten) <> 0&) _ And (nBytesWritten = nBytesRead): Debug.Assert bSuccess If bSuccess Then DownloadURL2File = DownloadURL2File + nBytesWritten If GetQueueStatus(QS_ALLINPUT) And &HFFFF0000 Then DoEvents Else Exit Do End If Loop Loop While Err.LastDllError = ERROR_INSUFFICIENT_BUFFER hFile = CloseHandle(hFile): Debug.Assert hFile End If hURL = InternetCloseHandle(hURL): Debug.Assert hURL End If hInternet = InternetCloseHandle(hInternet): Debug.Assert hInternet End If End Function

BTW, the return values of your InternetCloseHandle and InternetReadFile APIs should be Long rather than Integer. ;)

Re: Downloading html text with API

No joy, still getting the old cached value.

The old cached value is also showing when I open the page in IE11. How do I clear the cache in IE11? (I only ever use Firefox.) If I could just do that maybe this would work. When I choose "Gear Icon" => Safety => Delete Browsing History, check Temporary Internet Files and clear it, IE11 says files are cleared. But it's clearly lying to me.

Note that I don't need this to work for other people, just me, so if there's some manual task I have to do to clear the cache to make this work that's fine by me.

Re: Downloading html text with API

Quote:

Originally Posted by Ellis Dee

No joy, still getting the old cached value.

The old cached value is also showing when I open the page in IE11. How do I clear the cache in IE11? (I only ever use Firefox.) If I could just do that maybe this would work. When I choose "Gear Icon" => Safety => Delete Browsing History, check Temporary Internet Files and clear it, IE11 says files are cleared. But it's clearly lying to me.

Note that I don't need this to work for other people, just me, so if there's some manual task I have to do to clear the cache to make this work that's fine by me.

If you can't clear it using code then clear it yourself by navigating to the cache folders and do a select all and delete providing it allows you to do this - I can do it on my PC but that's me.

Re: Downloading html text with API

Quote:

Originally Posted by jmsrickland

If you can't clear it using code then clear it yourself by navigating to the cache folders and do a select all and delete providing it allows you to do this - I can do it on my PC but that's me.

Good thought, but still no joy.

I navigated to "C:\Users\[Me]\AppData\Local\Microsoft\Windows\Temporary Internet Files" (after choosing "Show System Files" in explorer) and deleted every one of the hundreds of files still in there.

Re-ran my crawler and...still the old values are retrieved.

EDIT: Though when I went to empty my recycle bin just to be sure, the recycle bin shows as empty even though I just deleted hundreds of files.

Re: Downloading html text with API

Maybe supply an added fake query, i.e... "www.somesite.com/?rnd=1234"

The above tweak seems to work every time for me. Now, I do replace 1234 with a truly random number and need to parse the URL to see if a query is already included, and if so, so that I can append my random query to the end of the URL using proper prefix of ? or &. The rnd variable is static, but was just 3 characters I chose on a whim.

Easy enough to test.

Re: Downloading html text with API

Sweet, that works!

Let me tell you how far down the rabbit-hole I was going. First, I found this article which mentions the simple Temporary Internet Files folder, which I had already tried manually clearing. But the second thing it mentions is index.dat:

Quote:

And then, where is the index.dat file located in Windows 7 | 8? Index.dat are files hidden on your computer that contain all of the Web sites that you have ever visited. Every URL, and every Web page is listed there. To access it, you will have to type in Explorers address bar the following location and click go:

C:\Users\username\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5

Only then will you be able to see the index.dat file. Conclusion ? The Content.IE5 folder is super hidden!

Typing that address into Windows explorer -- substituting my username, of course -- did indeed bring up a HUGE cache of internet files that were not previously visible. Despite only ever using Firefox on this computer, the index.dat monstrosity contained tens of thousands of files taking up many gigabytes of space. What the heck? So I manually deleted all of them. STILL no joy.

So then I found this thread, where apparently the discussion is based around trying to clear out the complete, permanent, un-deletable and invisible cache of every website you've ever visited that is automatically maintained by Windows.

Jeezum-crow, what happened to my innocence? I though the index.dat thing was bad, but when deleting even that didn't clear my cache I'm starting to think that Windows truly does remember everywhere you've ever been on the internet, regardless of what browser you used to get there.