PDA

Click to See Complete Forum and Search --> : getting html through winsock


benmartin101
Mar 6th, 2007, 04:48 PM
I am trying to use winsock to get an html page. I'm able to get something with the code below, but i'm receiving header information as well. How do I do this properly.


'ws (winsock object) is declared and instantiated outside as a global
Dim bt() As Byte
bt = System.Text.ASCIIEncoding.ASCII.GetBytes("GET / HTTP/1.1" & Chr(10) & Chr(13) & Chr(10) & Chr(13))
Dim x As Integer = 1

ws.SendData(bt)


Note: I know other ways to get webpages, I just want to learn how to do it using winsock.

manavo11
Mar 6th, 2007, 04:53 PM
By saying headers, you mean stylesheets and the <head> tag? Do you get only that or that as well as all the HTML? Shouldn't you get that instead of just the <body> tag? :ehh:

dilettante
Mar 6th, 2007, 05:47 PM
You'll always get at least a few headers. You just need to trim them off (making use of the important ones as you go).

benmartin101
Mar 7th, 2007, 05:31 PM
They look like headers. I think they're headers and the body is there as well.. I'll try just removing them and see how it goes. Thanks.

manavo11
Mar 7th, 2007, 08:08 PM
Can you post an example of the data you receive?

ccoder
Mar 8th, 2007, 04:49 PM
They look like headers. I think they're headers and the body is there as well.. I'll try just removing them and see how it goes. Thanks.
Is this what you are looking for?

If getHTML is a string containing everything returned by a POST or GET then

Str = Mid$(getHTML, InStr(getHTML, "<html"))

will strip the header info.

You will of course have to take into consideration the fact that "html" may be capitalized. And if you want other info such as the DOCTYPE then the InStr code will have to change accordingly.

dilettante
Mar 9th, 2007, 04:37 PM
You'll also have to consider that while technically "improper" some servers will return garbage after the end of the valid content. These servers generally expect the user agent (your HTTP client) to respect the Content-Length or Transfer-Encoding header.

In the most general terms you can't rely on <HTML> and </HTML> as delimiters either. A text, image, css, script, etc. file won't have these, and as suggested HTML data can have prefixes and even suffixes outside the page markup itself.

In the end this is why rolling your own code to handle HTTP requests is usually a waste of time. The effort to create code that works better then "doesn't crash, most of the time" just isn't worth it.


Sometimes of course the effort is useful in expanding your knowledge.

http://www.ietf.org/rfc/rfc2616.txt

Nerd-Man
Mar 10th, 2007, 08:54 AM
i did that before to grab html page using winsock. below is my code...

Private Sub Winsock1_Connect()
Dim Chunks As String
Chunks = "GET /index.html" & " HTTP/1.1" & vbCrLf
Chunks = Chunks & "Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*" & vbCrLf
Chunks = Chunks & "Accept -language: en -us" & vbCrLf
Chunks = Chunks & "Accept -encoding: gzip , deflate" & vbCrLf
Chunks = Chunks & "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" & vbCrLf
Chunks = Chunks & "Host: " & Server.Text & vbCrLf
Chunks = Chunks & "Connection: Keep -Alive" & vbCrLf & vbCrLf
Winsock1.SendData (Chunks)
End Sub

Private Sub Winsock1_DataArrival(ByVal bytesTotal As Long)
Dim Data As String
Winsock1.GetData Data, vbString, bytesTotal
HTMLSource.Text = Data
End Sub

i think that is how i get the source of a html page using winsock.