[RESOLVED] Downloading html text with API
I wrote this function to download html text, but the problem I'm having is it seems to be reading it from a hidden cache.
Code:
Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInet As Long) As Integer
Private Declare Function InternetOpen Lib "wininet.dll" Alias "InternetOpenA" (ByVal sAgent As String, ByVal lAccessType As Long, ByVal sProxyName As String, ByVal sProxyBypass As String, ByVal lFlags As Long) As Long
Private Declare Function InternetOpenUrl Lib "wininet.dll" Alias "InternetOpenUrlA" (ByVal hOpen As Long, ByVal sUrl As String, ByVal sHeaders As String, ByVal lLength As Long, ByVal lFlags As Long, ByVal lContext As Long) As Long
Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal sBuffer As String, ByVal lNumBytesToRead As Long, lNumberOfBytesRead As Long) As Integer
Public Function DownloadURL(ByVal URL As String) As String
Const INTERNET_OPEN_TYPE_PRECONFIG = 0
Const INTERNET_OPEN_TYPE_DIRECT = 1
Const INTERNET_OPEN_TYPE_PROXY = 3
Const scUserAgent = "VB Project"
Const INTERNET_FLAG_RELOAD = &H80000000
Dim lngOpen As Long
Dim lngOpenURL As Long
Dim blnReturn As Boolean
Dim strReadBuffer As String * 2048
Dim lngNumberOfBytesRead As Long
Dim strBuffer As String
lngOpen = InternetOpen(scUserAgent, INTERNET_OPEN_TYPE_PRECONFIG, vbNullString, vbNullString, 0)
lngOpenURL = InternetOpenUrl(lngOpen, URL, vbNullString, 0, INTERNET_FLAG_RELOAD, 0)
Do
strReadBuffer = vbNullString
blnReturn = InternetReadFile(lngOpenURL, strReadBuffer, Len(strReadBuffer), lngNumberOfBytesRead)
strBuffer = strBuffer & Left$(strReadBuffer, lngNumberOfBytesRead)
If Not CBool(lngNumberOfBytesRead) Then Exit Do
Loop
If lngOpenURL <> 0 Then InternetCloseHandle (lngOpenURL)
If lngOpen <> 0 Then InternetCloseHandle (lngOpen)
DownloadURL = strBuffer
End Function
Here's what I did:
1) Ran my function to grab a wiki page.
2) Manually corrected some text on that same wiki page using Firefox.
3) Manually cleared all internet cache on Internet Explorer.
4) Manually cleared all internet cache on Firefox. (Just because.)
5) Re-ran my function to grab the new corrected version of that same wiki page.
The function is still returning the original text, from before I made the changes in step 2. If I go to the page in Firefox I get the new correct text, but for whatever reason my function is returning the old text.
I also notice that Internet Explorer (11) is showing the old text when I go to the page, even though I deleted all my cache. I manually Ctrl+F5'ed to force a refresh, but still no joy. Still in IE11, I chose Edit Page and the edit text shows the correct text. If I now hit the Back button, IE11 finally shows the correct text. But even after all that, my function still returns the old text.
Please help! There's something like 75 pages that I crawl, and there's no way I can manually do all that hoop-jumping each time I need to update my data. Especially since I don't know what may or may not have changed, as (obviously) I'm not the only person who maintains this particular wiki.
Re: Downloading html text with API
Try including the INTERNET_FLAG_DONT_CACHE a.k.a. INTERNET_FLAG_NO_CACHE_WRITE flag:
Quote:
Originally Posted by
Bonnie West
Code:
Private Declare Function CloseHandle Lib "kernel32.dll" (ByVal hObject As Long) As Long
Private Declare Function CreateFileW Lib "kernel32.dll" (ByVal lpFileName As Long, ByVal dwDesiredAccess As Long, ByVal dwShareMode As Long, ByVal lpSecurityAttributes As Long, ByVal dwCreationDisposition As Long, Optional ByVal dwFlagsAndAttributes As Long, Optional ByVal hTemplateFile As Long) As Long
Private Declare Function GetQueueStatus Lib "user32.dll" (ByVal Flags As Long) As Long
Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInternet As Long) As Long
Private Declare Function InternetOpenW Lib "wininet.dll" (ByVal lpszAgent As Long, ByVal dwAccessType As Long, ByVal lpszProxyName As Long, ByVal lpszProxyBypass As Long, ByVal dwFlags As Long) As Long
Private Declare Function InternetOpenUrlW Lib "wininet.dll" (ByVal hInternet As Long, ByVal lpszUrl As Long, ByVal lpszHeaders As Long, ByVal dwHeadersLength As Long, ByVal dwFlags As Long, ByVal dwContext As Long) As Long
Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal lpBuffer As Long, ByVal dwNumberOfBytesToRead As Long, ByRef lpdwNumberOfBytesRead As Long) As Long
Private Declare Function SysReAllocStringLen Lib "oleaut32.dll" (ByVal pBSTR As Long, Optional ByVal pszStrPtr As Long, Optional ByVal Length As Long) As Long
Private Declare Function WriteFile Lib "kernel32.dll" (ByVal hFile As Long, ByVal lpBuffer As Long, ByVal nNumberOfBytesToWrite As Long, Optional ByRef lpNumberOfBytesWritten As Long, Optional ByVal lpOverlapped As Long) As Long
'Downloads the remote file specified by the sURL argument to the local file pointed
'by the sFileName parameter. The optional Chunk parameter determines the number
'of bytes to be downloaded at a time. Bigger chunks download faster while smaller
'ones enables the GUI to be more responsive. Returns the total number of bytes
'successfully written to disk. Maximum download size of 2,047.99 MB only.
Public Function DownloadURL2File(ByRef sURL As String, ByRef sFileName As String, Optional ByVal Chunk As Long = 1024&) As Long
Const INTERNET_OPEN_TYPE_DIRECT = 1&, INTERNET_FLAG_DONT_CACHE = &H4000000, INTERNET_FLAG_RELOAD = &H80000000
Const GENERIC_WRITE = &H40000000, FILE_SHARE_NONE = 0&, CREATE_ALWAYS = 2&, QS_ALLINPUT = &H4FF&
Const INVALID_HANDLE_VALUE = -1&, ERROR_INSUFFICIENT_BUFFER = &H7A&
Dim hInternet As Long, hURL As Long, hFile As Long, nBytesRead As Long, nBytesWritten As Long
Dim bSuccess As Boolean, sBuffer_Ptr As Long, sBuffer_Size As Long, sBuffer As String
Select Case True
Case LenB(sURL) = 0&, LenB(sFileName) = 0&, Chunk < 2&: Exit Function
End Select
hInternet = InternetOpenW(StrPtr(App.Title), INTERNET_OPEN_TYPE_DIRECT, 0&, 0&, 0&)
If hInternet Then
hURL = InternetOpenUrlW(hInternet, StrPtr(sURL), 0&, 0&, INTERNET_FLAG_DONT_CACHE Or INTERNET_FLAG_RELOAD, 0&)
If hURL Then
hFile = CreateFileW(StrPtr(sFileName), GENERIC_WRITE, FILE_SHARE_NONE, 0&, CREATE_ALWAYS) 'Overwrite existing
If hFile <> INVALID_HANDLE_VALUE Then
Do: SysReAllocStringLen VarPtr(sBuffer), , (sBuffer_Size + Chunk) * 0.5!
sBuffer_Size = LenB(sBuffer): sBuffer_Ptr = StrPtr(sBuffer)
Do While InternetReadFile(hURL, sBuffer_Ptr, sBuffer_Size, nBytesRead)
If nBytesRead Then
bSuccess = (WriteFile(hFile, sBuffer_Ptr, nBytesRead, nBytesWritten) <> 0&) _
And (nBytesWritten = nBytesRead): Debug.Assert bSuccess
If bSuccess Then DownloadURL2File = DownloadURL2File + nBytesWritten
If GetQueueStatus(QS_ALLINPUT) And &HFFFF0000 Then DoEvents
Else
Exit Do
End If
Loop
Loop While Err.LastDllError = ERROR_INSUFFICIENT_BUFFER
hFile = CloseHandle(hFile): Debug.Assert hFile
End If
hURL = InternetCloseHandle(hURL): Debug.Assert hURL
End If
hInternet = InternetCloseHandle(hInternet): Debug.Assert hInternet
End If
End Function
BTW, the return values of your InternetCloseHandle and InternetReadFile APIs should be Long rather than Integer. ;)
Re: Downloading html text with API
No joy, still getting the old cached value.
The old cached value is also showing when I open the page in IE11. How do I clear the cache in IE11? (I only ever use Firefox.) If I could just do that maybe this would work. When I choose "Gear Icon" => Safety => Delete Browsing History, check Temporary Internet Files and clear it, IE11 says files are cleared. But it's clearly lying to me.
Note that I don't need this to work for other people, just me, so if there's some manual task I have to do to clear the cache to make this work that's fine by me.
Re: Downloading html text with API
Quote:
Originally Posted by
Ellis Dee
No joy, still getting the old cached value.
The old cached value is also showing when I open the page in IE11. How do I clear the cache in IE11? (I only ever use Firefox.) If I could just do that maybe this would work. When I choose "Gear Icon" => Safety => Delete Browsing History, check Temporary Internet Files and clear it, IE11 says files are cleared. But it's clearly lying to me.
Note that I don't need this to work for other people, just me, so if there's some manual task I have to do to clear the cache to make this work that's fine by me.
If you can't clear it using code then clear it yourself by navigating to the cache folders and do a select all and delete providing it allows you to do this - I can do it on my PC but that's me.
Re: Downloading html text with API
Quote:
Originally Posted by
jmsrickland
If you can't clear it using code then clear it yourself by navigating to the cache folders and do a select all and delete providing it allows you to do this - I can do it on my PC but that's me.
Good thought, but still no joy.
I navigated to "C:\Users\[Me]\AppData\Local\Microsoft\Windows\Temporary Internet Files" (after choosing "Show System Files" in explorer) and deleted every one of the hundreds of files still in there.
Re-ran my crawler and...still the old values are retrieved.
EDIT: Though when I went to empty my recycle bin just to be sure, the recycle bin shows as empty even though I just deleted hundreds of files.
Re: Downloading html text with API
Maybe supply an added fake query, i.e... "www.somesite.com/?rnd=1234"
The above tweak seems to work every time for me. Now, I do replace 1234 with a truly random number and need to parse the URL to see if a query is already included, and if so, so that I can append my random query to the end of the URL using proper prefix of ? or &. The rnd variable is static, but was just 3 characters I chose on a whim.
Easy enough to test.
Re: Downloading html text with API
Sweet, that works!
Let me tell you how far down the rabbit-hole I was going. First, I found this article which mentions the simple Temporary Internet Files folder, which I had already tried manually clearing. But the second thing it mentions is index.dat:
Quote:
And then, where is the index.dat file located in Windows 7 | 8? Index.dat are files hidden on your computer that contain all of the Web sites that you have ever visited. Every URL, and every Web page is listed there. To access it, you will have to type in Explorers address bar the following location and click go:
C:\Users\username\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5
Only then will you be able to see the index.dat file. Conclusion ? The Content.IE5 folder is super hidden!
Typing that address into Windows explorer -- substituting my username, of course -- did indeed bring up a HUGE cache of internet files that were not previously visible. Despite only ever using Firefox on this computer, the index.dat monstrosity contained tens of thousands of files taking up many gigabytes of space. What the heck? So I manually deleted all of them. STILL no joy.
So then I found this thread, where apparently the discussion is based around trying to clear out the complete, permanent, un-deletable and invisible cache of every website you've ever visited that is automatically maintained by Windows.
Jeezum-crow, what happened to my innocence? I though the index.dat thing was bad, but when deleting even that didn't clear my cache I'm starting to think that Windows truly does remember everywhere you've ever been on the internet, regardless of what browser you used to get there.
Re: [RESOLVED] Downloading html text with API
Maybe if you had rebooted it might have completed the deletion
Re: [RESOLVED] Downloading html text with API
I'll test it next time I have to reboot.
Re: [RESOLVED] Downloading html text with API
Quote:
Sweet, that works!
@Ellis. Just to be clear, that random query must be random each time you use it. May actually do Randomize Timer at startup. If not random for each call, possible you can get cached data if the URL is exactly same as last time it was used. Just wanted to make that point crystal clear.
Re: [RESOLVED] Downloading html text with API
Yep, I figured as much but appreciate the clarification.
I wrote a wrapper function that generates a random 4-digit number as the query before passing it to the (updated) function in the OP, just for this particular issue in this particular program. I'm hesitant to add the functionality to my general utility, but instead will write a wrapper any time I need it.
This particular program already uses random numbers elsewhere so there's already a Randomize (no Timer) statement in Sub Main().
EDIT: And despite the fact that it's been many years since I have been a regular poster on these boards, I still can't rep you until I spread it around more, LaVolpe.
Re: [RESOLVED] Downloading html text with API
I remembered just now that someone had an issue similar to yours quite some time ago and this tiny, API-free routine worked for him:
Quote:
Originally Posted by
Bonnie West
Code:
Public Sub SaveWebPageToFile(ByRef URL As String, ByRef FileName As String, Optional ByRef Charset As String = "utf-8")
Const adSaveCreateOverWrite = 2&
Dim oHttpReq As Object
Set oHttpReq = CreateObject("WinHttp.WinHttpRequest.5.1")
oHttpReq.Open "GET", URL
oHttpReq.Send
With CreateObject("ADODB.Stream")
.Open
.Charset = Charset
.WriteText oHttpReq.ResponseText
.SaveToFile FileName, adSaveCreateOverWrite
.Close
End With
End Sub
I believe it should also work for you without requiring you to employ workarounds.
Re: [RESOLVED] Downloading html text with API
That looks interesting, but requires a dependency. The app I'm writing is specifically designed to be dependency-free (only pure native VB6 plus API) so that end users don't have to run any kind of installation regardless what OS they're using. Some are still running XP (Luddites!), and a couple are even running the app on WINE.
Re: [RESOLVED] Downloading html text with API
Quote:
Originally Posted by
Ellis Dee
That looks interesting, but requires a dependency. The app I'm writing is specifically designed to be dependency-free (only pure native VB6 plus API) so that end users don't have to run any kind of installation regardless what OS they're using.
As stated here, the Windows HTTP Services is actually not a dependency on currently supported platforms (and on some unsupported ones):
Quote:
Originally Posted by MSDN
BTW, in addition to the WinHttpRequest COM object, WinHTTP also has Interfaces and Functions that I believe are also usable in VB.
Quote:
Originally Posted by
Ellis Dee
Some are still running XP (Luddites!), ...
According to WinHTTP Versions, WinHTTP 5.1 has been available in Windows XP since SP1:
Quote:
Originally Posted by
Bonnie West
Quote:
Originally Posted by MSDN
With version 5.1, WinHTTP is an operating-system component of the following operating systems:
- Windows 2000, Service Pack 3 and later (except Datacenter Server)
- Windows XP with Service Pack 1 (SP1) and later
- Windows Server 2003 with Service Pack 1 (SP1) and later
Quote:
Originally Posted by
Ellis Dee
... and a couple are even running the app on WINE.
Well, I don't know whether WINE already supports WinHTTP or not, but if they plan on being compatible with Windows, then they ought to implement that API.
Re: [RESOLVED] Downloading html text with API
@Bonnie. I've used Microsoft.XMLHTTP & MSXML2.ServerXMLHTTP objects and GET also would retrieve cached data. That is when I discovered the fake query workaround and applied it. Do those libraries wrap WinHTTP? Don't know.
Edited. If this quote is correct, answers the question I had
Quote:
Msxml2.XMLHTTP and Msxml2.ServerXMLHTTP are two components share the similar interface for fetching XML files over HTTP protocal. The former is built upon URLMon, which relies on WinINet. The later is built upon WinHTTP, which is a server friendly replacement for WinINet. To put it simple - ServerXMLHTTP = XML + WinHTTP.
@Ellis. I had no issues using a random number in excess of 4 digits, i.e., Int(Rnd*vbWhite)
Re: [RESOLVED] Downloading html text with API
Re: [RESOLVED] Downloading html text with API
Quote:
Originally Posted by
Bonnie West
As stated
here, the
Windows HTTP Services is actually
not a dependency on currently supported platforms (and on some unsupported ones):
I was actually referring to the ADO calls, but on closer inspection that doesn't look to be part of the actual code, but instead is just a way to save text to a file. Removing that bit leaves a nice, straightforward option.
For saving text to a file, I usually just use this:
Code:
Public Sub SaveStringAs(File As String, Text As String)
Dim FileNumber As Long
FileNumber = FreeFile
Open File For Output As #FileNumber
Print #FileNumber, Text
Close #FileNumber
End Sub