-
Please Help, Need A.L.I.C.E Web page Spider
See link below --- I can even pay a little for a complete soln. I'm trying in VB6
http://www.nabble.com/Help%2C-Need-A...-t1720827.html
I need a spider created that will crawl a webpage and grammatically parse the page and create AIML (Artificial Inelligence Markup Language) data. This data will be saved into an AIML file and used to teach a chatterbot the contents of the web page.
The way we see it working is:
1. The spider crawls a page examining the text of each sentence.
2. Then using a grammatic parser it will reformulate that sentence data into possible patterns and responses to be entered as data in the AIML file.
3. Then it will format this into a standard AIML file and allow you to save this code or copy and paste it to another source
This will require someone experience in AIML as well as grammatic sentence parsing.
The idea for this project is to get the stand-alone script, possibilities of this being a desktop VB script. But I hear that existing perl and php extensions may make this easier. Open to other suggestions. Preferably a desktop application to start
-
Re: Please Help, Need A.L.I.C.E Web page Spider
I believe this is a forum for PROGRAMMERS to discuss PROGRAMMING, not for people to find someone to write their programs for them
-
Re: Please Help, Need A.L.I.C.E Web page Spider
smUX's post is a little harsh and he shouldn't have responded like that, but basically he is right in that the Classic Visual Basic forum is for questions concerning problems people are having while developing their applications. Sometimes (rarely actually) someone will write a whole app for someone but I strongly doubt that that will happen with something as complex as you desire. I could however move the thread to the Open Positions forum and while that forum is designed for jobs, someone might be willing to earn a few bucks helping you with this. Let me know what you want me to do.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Could it be in both ? I'm already trying to program it myself so I also need hints.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Yeah, just call me Mr Coarse :-)
Ty, I'd suggest that if you're planning to code it yourself you should come to us when you have PROBLEMS with the code rather than asking us to write it for you (whether renumeration is provided or not)
I've been known to write apps for people for free if they're in the same genre as the sort of stuff I find fun to code...this idea is close but I know nothing about AIML. I could easily write the parsing shizzle and get the outputted stuff into an array for someone else to write the rest of the code for :-)
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Man that would be sooo great smux
-
Re: Please Help, Need A.L.I.C.E Web page Spider
I said *could* not would :-P
If you need help with any coding problems, ask in the forum...if I know the answer (and am on) I will help :-)
-
Re: Please Help, Need A.L.I.C.E Web page Spider
We don't generally allow the same thread in two places, so post or attach what code you have here and let us know the problems you are having.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Just 1 web page?
And what type of page is it?
What exactly is in the page?
Anyway basically you just grab the page text into a string and work with that in VB .. lots of examples in the networking and ASP section ..
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Ok, right now I'm just starting and trying to open the url. I expect the parsing and converting to aiml will be main issues. I'll post more as problems arise since my vb is rusty. This is for a work project.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
I'm pretty sure I have the url open but what's the easiet way to parse everything on the webpage. Thanks.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
What are you using to open the webpage?
-
Re: Please Help, Need A.L.I.C.E Web page Spider
-
Re: Please Help, Need A.L.I.C.E Web page Spider
if you're using inet then you would use "HTML = inet1.openurl..." when you use inet to open a URL, and the variable HTML would contain (you can guess :-)) the HTML from that URL.
Parsing a HTML file for links is very simple. You can either go the split() way followed by instr() at each part of the array you generate or you can just use instr. If you just want the URLs, the split() way would be easiest. You simply do a linkarray = split(lcase(HTML),"<a href") (using lcase ensures that it matches whether the source code has lower or upper case :-)) which will basically split the HTML up into arrays using the "<a href" as a split point...meaning that each part of the array now holds one URL. You then would need to find out which array holds the first URL you want to parse, that's up to you to work out though :-)
Once you know where your first URL is, you would parse it out by doing an a=instr(linkarray(x),chr(34))+1 (chr(34) is a " which is usually used to encapsulate a URL in HTML source :-)) and then you would do the same again, but before "linkarray(x)" you would have the value a so it looks from there...like this: b=instr(a,linkarray(x),chr(34))...This now gives you the start and end points for the URL, so all you now need to do is a mid() to get the URL out :-)
Note that if you're using split() you NEED to dim the variable thus: "DIM linkarray() as string"...the () tells the program that it will be redimmed later on with the right value (which is what split does :-))
-
Re: Please Help, Need A.L.I.C.E Web page Spider
The only disadvantage of using Inet is that it isn't Asyncronous, so it pauses processing in your program until it has the HTML, so you won't be able to use multithreading in your app
-
Re: Please Help, Need A.L.I.C.E Web page Spider
this grabs all URLs .. i still prefer winsock, but this is a quick way to do it .. use MSXML 4.0 if your server supports it so you can set timeouts ..
VB Code:
'// REFERENCE Microsoft XML, Version 2.0
'// GRAB URLS //
Private Sub Command1_Click()
Dim sText As String
Dim sArray() As String
Dim i As Integer
sText = SendRequest("http://www.thatlinkofmine.com")
If Len(sText) Then
sArray = Split(sText, "<a href", , vbTextCompare)
For i = 0 To UBound(sArray)
Debug.Print Replace(StripText(sArray(i), "=", ">"), """", "")
Next i
Else
Debug.Print "Nothing to display"
End If
End Sub
'// GET TEXT FROM WEB PAGE
Private Function SendRequest(ByVal strUrl As String) _
As String
On Error Resume Next
Dim objHTTP As New MSXML.XMLHTTPRequest
objHTTP.Open "GET", strUrl, False
objHTTP.setRequestHeader "Content-Type", "text/html"
If Err = 0 Then
objHTTP.send
SendRequest = objHTTP.responseText
Else
MsgBox "Error " & Err.Number & _
vbNewLine & Err.Description
End If
End Function
'// STRIP TEXT FROM STRING
Public Function StripText(ByVal source As String, _
ByVal start As String, ByVal finish As String) As String
Dim iPos As Integer, iPoe As Integer
iPos = (InStr(1, source, start, 3) + Len(start))
iPoe = InStr(iPos, source, finish, 3)
StripText = Trim$(Mid$(source, iPos, (iPoe - iPos)))
End Function
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Thanks Guys. I actually trying to parse every word on the page so I guess I'll have to modify your suggestions slightly.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Every single word? That's probably even easier. First you would use a split() to split the whole thing by spaces then you'd have all the words in an array. you might want to first filter out the HTML of the file though...splitting at each < and deleting everything at/before the > afterwards should be enough, although a bit much in this case I think. Someone else might have a better idea :-)
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Well if you're looking to capture sentances, what you'd do is use split() to put all of the sentances into an array first, you'd need to look for all sentance ending punctuation (! . ?). Then you can split each sentance up by the spaces, and look at each individual words.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
this cleans all the HTML tags and just shows the text with spaces ..
VB Code:
Private Sub Command1_Click()
Dim strHTML As String
Dim key1 As Long
Dim key2 As Long
strHTML = "<meta name=""keywords"" content=""test"">This Text<font size=""2"">More Text</font><img src=""test.jpg"">Lots of Text"
Do While InStr(strHTML, ">") > 0
key1 = InStr(1, strHTML, "<", 1)
key1 = key1 + Len("<")
key2 = InStr(key1, strHTML, ">", 1)
strHTML = Replace(strHTML, "<" & Trim(Mid(strHTML, key1, (key2 - key1))) & ">", " ")
Loop
Debug.Print strHTML
End Sub
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Actually my bad, I'd better just look for whole sentences for my array right now. Dayjo, would I have to do a split() for each punctuation like (!,., ?) or is there a way to enter them all at once with one split(). Also would it still have the html junk in it, like mine does right now ?
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Rory, I'm getting an infinite loop with your sample I'm not sure why ?
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Quote:
Originally Posted by tyademosu
Actually my bad, I'd better just look for whole sentences for my array right now. Dayjo, would I have to do a split() for each punctuation like (!,., ?) or is there a way to enter them all at once with one split(). Also would it still have the html junk in it, like mine does right now ?
You might also have to worry about sentences like "The car costs 27,999.99."
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Quote:
Originally Posted by tyademosu
Rory, I'm getting an infinite loop with your sample I'm not sure why ?
Yeah I did with yahoo too, though it works with all my sites .. oh well .. hey Ill throw something together a little later tonight .. in the middle of work right now ..
-
Re: Please Help, Need A.L.I.C.E Web page Spider
i am using this api(thanks to iprank) to download an html file directly to file without the webbrowser control ,the advantage being that u can skip loading the images if any in an html file...
VB Code:
Private Declare Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Public Function DownloadFile(URL As String, LocalFilename As String) As Boolean
Dim lngRetVal As Long
lngRetVal = URLDownloadToFile(0, URL, LocalFilename, 0, 0)
If lngRetVal = 0 Then DownloadFile = True
End Function
Private Sub Form_Load()
DownloadFile "http://www.somesite.com", "c:\sample.txt"
End Sub
hope it helps!!
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Ok this works and strips all HTML tags .. splits the < then replaces everything up to the next > which closes the HTML tags .. of course you also get alot of javascript stuff that is not replaced .. but gives you an idea .. I dont know what your web page contains so will need some more info or an example ..
VB Code:
'// REFERENCE Microsoft XML, Version 2.0
'// GRAB URLS //
Private Sub Command1_Click()
'// DECLARATIONS
Dim sText As String
Dim sArray() As String
Dim iPor As String
Dim iPoe As Integer
Dim i As Integer
sText = SendRequest("http://www.yahoo.com") ' URL TO GRAB
If Len(sText) Then ' THERE IS TEXT
sArray = Split(sText, "<") ' SPLIT BY TAG START
For i = 0 To UBound(sArray) ' LOOP THROUGH
iPoe = InStr(sArray(i), ">") ' GET REPLACE LENGTH
If iPoe Then
iPor = "<" & Mid$(sArray(i), 1, (iPoe - 1)) & ">" ' OUR REPLACE STRING
sText = Trim$(Replace(sText, iPor, " ")) ' REPLACE IN TEXT
End If
Next i ' NEXT TAG START
Debug.Print sText ' DISPLAY FINAL TEXT
Else
Debug.Print "Nothing to display"
End If
End Sub
'// GET TEXT FROM WEB PAGE
Private Function SendRequest(ByVal strUrl As String) _
As String
On Error Resume Next
Dim objHTTP As New MSXML.XMLHTTPRequest ' CREATE OBJECT
objHTTP.Open "GET", strUrl, False ' START REQUEST
objHTTP.setRequestHeader "Content-Type", "text/html"
If Err = 0 Then ' NO ERRORS
objHTTP.send ' SEND REQUEST
SendRequest = objHTTP.responseText ' GET TEXT
Else
MsgBox "Error " & Err.Number & _
vbNewLine & Err.Description
End If
End Function
-
Re: Please Help, Need A.L.I.C.E Web page Spider
the only issue being that it saves in unix format so i use a function something like this one
http://www.a1vbcode.com/snippet-2938.asp
to convert it to a dos compatible one..
-
1 Attachment(s)
Re: Please Help, Need A.L.I.C.E Web page Spider
Rory thanks for your input. Attached is the output for the main page I'm trying to parse at http://www.cookbookwiki.com/rice with the code you suggested.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
We went a little futher in another thread after yours .. see this thread ..
http://www.vbforums.com/showthread.php?t=412360
and the code in my post #13
http://www.vbforums.com/showpost.php...5&postcount=13
Maybe strips a little too much but will give you something more to work with ..
For Example .. you can comment out the ones you dont want, in the following section called CUSTOM .. you may also want to comment out the RemoveLines part, as that gets rid of .. lines... which it seems you may want ..? If you keep the lines .. then you can do an array and split the Lines ..
VB Code:
i = RemoveLines(i)
i = RemoveTags(i, "<style", "</style>")
i = RemoveTags(i, "<script", "</script>")
i = RemoveTags(i, "<!--", "-->")
i = RemoveTags(i, "<", ">")
'// START CUSTOM
i = RemoveTags(i, "&#", ";") ' SPECIAL SYMBOLS
i = RemoveChars(i, " #&#"#>#<#[#]#""#;#:#.#,#'#/#$#%#?#!#|#(#)#=#-#+#&#*#©#®")
i = RemoveDigits(i, "0 1 2 3 4 5 6 7 8 9")
i = RemoveCommon(i, "a b c d e f g h i j k l m n o p q r s t u v w x y z")
i = RemoveCommon(i, "at and com is or of to that this then the was what with where who when")
'// END CUSTOM
i = RemoveMultiple(i, " ") ' GET RID OF MULTIPLE SPACES
i = StrConv(i, vbProperCase) ' UPPER CASE FIRST LETTER
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Here is a modified version i just tested out ... seems to work pretty well ..
Adds line numbers to the text box also ..
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Thanks so much again, Rory.
My ultmate goal for now is to read text from a FAQ webpage and store the Q/A pairs somehow. They will then all be written to an aiml output file in the format below
<aiml>
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
.
.
.
<aiml>
see -- http://www.alicebot.org/aiml.html
I can even compensate just a little for a complete solution or maybe you can continue to coach me.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Id have to read up on this AI stuff .. the files i mean .. but basically, do you know what the faq is as far as the HTML goes ..?
It should be easy once you know what you are working with .. in other words, what does the Faq look like.. the questions and answers ..?
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Any basic faq webpage inputed by the user should do i.e question followed by answer .... etc. Initially I wanted to use all webpages but I think that is more of a research project and a little beyond basic aiml capabilities right now. The format of the aiml output file really should be ok in the format I posted above - for now anyways.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
They're going to differ depending on the web site developer's html design ..
do you have any site link samples in mind ..?
-
Re: Please Help, Need A.L.I.C.E Web page Spider
True. Well, very elementry question/answer structure for now maybe
http://www.vbforums.com/faq.php?faq=...b_why_register
and
http://gmail.google.com/mail/help/about.html -- or simpler
using the webrowsercontrol....innertext might be the fastest way to start. --
Msn messenger tyademosu(aht)hotmail.com
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Here a great faq page.
http://www.talkorigins.org/origins/faqs-qa.html.
I'll see if I can make the aiml out format easier i.e just providing the answers and letting the chat bot create the responses (teaching the chat bot) -- this might not be possible yet.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
if all you want is the text.. u can use the HTML OBject...
add a reference to the Microsoft HTML object library
VB Code:
Dim HTML As HTMLDocument
Dim hText As String
Private Sub Form_Load()
Dim tHTML As New HTMLDocument
Set HTML = tHTML.createDocumentFromUrl("http://www.yahoo.com", vbNullString)
Do While HTML.readyState <> "complete"
DoEvents
Debug.Print HTML.readyState 'just to see it working
Loop
hText = HTML.documentElement.innerText
Debug.Print hText 'there is all the text.. NO HTML Tags
End Sub
that will ensure you get ONLY the text from the page..
then u can split it out into sentences...
here is a good idea.. if u will have MS word available to you.. USE IT!
Create a new doc thru vb.. dump the hText into it then use code from word to split it up. Let Word do all the work for you!
VB Code:
For x = 1To ThisDocument.Sentences.Count
Debug.Print ThisDocument.Sentences(x).Text
Next
i tested this with odd sentences.. like ones with #'s 27,199.00.
and it works perfectly ;)
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Quote:
Originally Posted by tyademosu
Which ever method you get the text only .. each page is different .. for example the google link you posted, i was able to get them, but i also got the question links above them .. with some modifications i can get exactly what i need, but then if you take the page you posted below .. thats different also .. though that one is much easier as all it has is questions and answers .. we simply split the ? marks ..
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Static I like your Word suggestion. Can you please elaborate : how do I create a new doc and dump into etc
-
Re: Please Help, Need A.L.I.C.E Web page Spider
heres a few more lines of code for that...
the final result is an array of sentences...
im sure you will need to play around a bit, remove blanks etc...
but it works....
[Highlight=VB]
VB Code:
Dim HTML As HTMLDocument
Dim hText As String
Private Sub Form_Load()
Dim tHTML As New HTMLDocument
Set HTML = tHTML.createDocumentFromUrl("http://www.yahoo.com", vbNullString)
Do While HTML.readyState <> "complete"
DoEvents
Debug.Print HTML.readyState 'just to see it working
Loop
hText = HTML.documentElement.innerText
Dim tmp As String
'Start new word app
Dim wrd As New Word.Application
Dim Doc As Word.document
'New Document
Set Doc = wrd.Documents.Add
Dim dSentences() As String
'just so we can see it....
wrd.Visible = True
'type the "text" into the doc
wrd.selection.TypeText Text:=hText
'set each sentence to an element in the array dSentences
ReDim dSentences(Doc.Sentences.Count - 1)
For x = 1 To Doc.Sentences.Count
dSentences(x - 1) = Doc.Sentences(x).Text
Next
'loop through and print the array
For x = 0 To UBound(dSentences)
Debug.Print dSentences(x)
Next
Doc.Close False
wrd.Quit False
Set Doc = Nothing
Set wrd = Nothing
End Sub
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Thanks -- my vb6 recognizes wordctl.document -- but there is no word.document or word.application.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
sorry. you need to add a reference to the Microsoft Word x.0 Object Library
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Wow, Pretty incredible static. I found the word reference shortly after I posted my false alarm. I think I'm almost done, my next task is to take the questions and answers and dump them to a *.aiml file in the format :
<aiml>
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
.
.
.
<aiml>
I appreciate any inputs. Many thanks again.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Static, Rory, everyone thanks for your suggestions. The word suggestion was the turning point.
This code works for the last faq page I suggested and maybe similar formats. Unfortunately my chatbot is having trouble learning the output but I dont think that's a vb6 problem. Hopefully I'll find a way to make the code work for other faq pages. My ultimate goal is to teach a chatbot info from any kind of page but again that might be a more of an aiml research project.
Private Sub Crawl_Click()
WebBrowser1.Navigate Url.Text
End Sub
Private Sub Form_Load()
WebBrowser1.Navigate Url.Text
End Sub
Private Sub WebBrowser1_DocumentComplete(ByVal pDisp As Object, Url As Variant)
Dim hText As String, terms() As String, x As Integer, i As Integer, question1 As Integer
i = 0
If (pDisp Is WebBrowser1.Application) Then
'Debug.Print WebBrowser1.Document.documentElement.innerText
hText = WebBrowser1.Document.documentElement.innerText
'terms() = Split(webpage, ".")
Dim tmp As String
'Start new word app
Dim wrd As New Word.Application
Dim Doc As Word.Document
'New Document
Set Doc = wrd.Documents.Add
Dim dSentences() As String
Dim questions() As String, answesrs() As String
'just so we can see it....
'wrd.Visible = True
'type the "text" into the doc
wrd.selection.TypeText Text:=hText
'set each sentence to an element in the array dSentences
ReDim dSentences(Doc.Sentences.Count - 1)
For x = 1 To Doc.Sentences.Count
dSentences(x - 1) = Doc.Sentences(x).Text
If InStr(Doc.Sentences(x).Text, "?") Then
i = i + 1
End If
Next
ReDim questions(i - 1)
ReDim answers(i - 1)
Dim j As Integer
j = 0
Dim response As String
i = 0
'loop through populate the arrays
For x = 0 To UBound(dSentences)
Debug.Print dSentences(x)
If InStr(dSentences(x), "?") Then
'questions array
questions(i) = dSentences(x)
i = i + 1
'populate answer array
response = Empty
x = x + 1
Do While InStr(dSentences(x), "?") = False And x <> UBound(dSentences)
If dSentences(x) <> Empty Then
response = response & dSentences(x)
End If
If x <> UBound(dSentences) Then
x = x + 1
End If
Loop
x = x - 1
If dSentences(x) <> Empty Then
answers(j) = response
j = j + 1
End If ' populate answer array
End If
Next
'write aiml file
Set fs = CreateObject("Scripting.FileSystemObject")
Set a = fs.CreateTextFile("C:\Documents and Settings\Administrator\Desktop\testfile.aiml", True)
a.Write "<aiml>"
For x = 0 To UBound(questions)
a.Writeline
a.Writeline
a.Write "<category>"
a.Writeline
a.Write "<pattern>" & questions(x) & "</pattern>"
a.Writeline
a.Write "<template>" & answers(x) & "</template>"
a.Writeline
a.Write "</category>"
Next
a.Writeline
a.Writeline
a.Write "</aiml>" ' aiml file written
Doc.Close False
wrd.Quit False
Set Doc = Nothing
Set wrd = Nothing
a.Close 'close aiml file
End If
End Sub
-
Re: Please Help, Need A.L.I.C.E Web page Spider
tyademosu - if you are all set with this post.. could you mark it resolved?
(Click thread tools - Mark thread resolved)
THANKS! :wave:
-
Re: Please Help, Need A.L.I.C.E Web page Spider
The job is actually only about half done for me. I'm hoping I can leave this open for a while longer.
-
Re: Please Help, Need A.L.I.C.E Web page Spider
Quick question can anyone recommened a fast way to remove the "?" at the end of each question. Thanks again.
format ...john? ....jane? etc
-
Re: Please Help, Need A.L.I.C.E Web page Spider
False alarm. Replace() seems work great.