getting html through winsock
I am trying to use winsock to get an html page. I'm able to get something with the code below, but i'm receiving header information as well. How do I do this properly.
Code:
'ws (winsock object) is declared and instantiated outside as a global
Dim bt() As Byte
bt = System.Text.ASCIIEncoding.ASCII.GetBytes("GET / HTTP/1.1" & Chr(10) & Chr(13) & Chr(10) & Chr(13))
Dim x As Integer = 1
ws.SendData(bt)
Note: I know other ways to get webpages, I just want to learn how to do it using winsock.
Re: getting html through winsock
By saying headers, you mean stylesheets and the <head> tag? Do you get only that or that as well as all the HTML? Shouldn't you get that instead of just the <body> tag? :ehh:
Re: getting html through winsock
You'll always get at least a few headers. You just need to trim them off (making use of the important ones as you go).
Re: getting html through winsock
They look like headers. I think they're headers and the body is there as well.. I'll try just removing them and see how it goes. Thanks.
Re: getting html through winsock
Can you post an example of the data you receive?
Re: getting html through winsock
Quote:
Originally Posted by benmartin101
They look like headers. I think they're headers and the body is there as well.. I'll try just removing them and see how it goes. Thanks.
Is this what you are looking for?
If getHTML is a string containing everything returned by a POST or GET then
Code:
Str = Mid$(getHTML, InStr(getHTML, "<html"))
will strip the header info.
You will of course have to take into consideration the fact that "html" may be capitalized. And if you want other info such as the DOCTYPE then the InStr code will have to change accordingly.
Re: getting html through winsock
You'll also have to consider that while technically "improper" some servers will return garbage after the end of the valid content. These servers generally expect the user agent (your HTTP client) to respect the Content-Length or Transfer-Encoding header.
In the most general terms you can't rely on <HTML> and </HTML> as delimiters either. A text, image, css, script, etc. file won't have these, and as suggested HTML data can have prefixes and even suffixes outside the page markup itself.
In the end this is why rolling your own code to handle HTTP requests is usually a waste of time. The effort to create code that works better then "doesn't crash, most of the time" just isn't worth it.
Sometimes of course the effort is useful in expanding your knowledge.
http://www.ietf.org/rfc/rfc2616.txt
Re: getting html through winsock
i did that before to grab html page using winsock. below is my code...
Code:
Private Sub Winsock1_Connect()
Dim Chunks As String
Chunks = "GET /index.html" & " HTTP/1.1" & vbCrLf
Chunks = Chunks & "Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*" & vbCrLf
Chunks = Chunks & "Accept -language: en -us" & vbCrLf
Chunks = Chunks & "Accept -encoding: gzip , deflate" & vbCrLf
Chunks = Chunks & "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" & vbCrLf
Chunks = Chunks & "Host: " & Server.Text & vbCrLf
Chunks = Chunks & "Connection: Keep -Alive" & vbCrLf & vbCrLf
Winsock1.SendData (Chunks)
End Sub
Private Sub Winsock1_DataArrival(ByVal bytesTotal As Long)
Dim Data As String
Winsock1.GetData Data, vbString, bytesTotal
HTMLSource.Text = Data
End Sub
i think that is how i get the source of a html page using winsock.