I need a spider created that will crawl a webpage and grammatically parse the page and create AIML (Artificial Inelligence Markup Language) data. This data will be saved into an AIML file and used to teach a chatterbot the contents of the web page.
The way we see it working is:
1. The spider crawls a page examining the text of each sentence.
2. Then using a grammatic parser it will reformulate that sentence data into possible patterns and responses to be entered as data in the AIML file.
3. Then it will format this into a standard AIML file and allow you to save this code or copy and paste it to another source
This will require someone experience in AIML as well as grammatic sentence parsing.
The idea for this project is to get the stand-alone script, possibilities of this being a desktop VB script. But I hear that existing perl and php extensions may make this easier. Open to other suggestions. Preferably a desktop application to start
Last edited by tyademosu; Jun 16th, 2006 at 05:51 PM.
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
smUX's post is a little harsh and he shouldn't have responded like that, but basically he is right in that the Classic Visual Basic forum is for questions concerning problems people are having while developing their applications. Sometimes (rarely actually) someone will write a whole app for someone but I strongly doubt that that will happen with something as complex as you desire. I could however move the thread to the Open Positions forum and while that forum is designed for jobs, someone might be willing to earn a few bucks helping you with this. Let me know what you want me to do.
Ty, I'd suggest that if you're planning to code it yourself you should come to us when you have PROBLEMS with the code rather than asking us to write it for you (whether renumeration is provided or not)
I've been known to write apps for people for free if they're in the same genre as the sort of stuff I find fun to code...this idea is close but I know nothing about AIML. I could easily write the parsing shizzle and get the outputted stuff into an array for someone else to write the rest of the code for :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
Ok, right now I'm just starting and trying to open the url. I expect the parsing and converting to aiml will be main issues. I'll post more as problems arise since my vb is rusty. This is for a work project.
if you're using inet then you would use "HTML = inet1.openurl..." when you use inet to open a URL, and the variable HTML would contain (you can guess :-)) the HTML from that URL.
Parsing a HTML file for links is very simple. You can either go the split() way followed by instr() at each part of the array you generate or you can just use instr. If you just want the URLs, the split() way would be easiest. You simply do a linkarray = split(lcase(HTML),"<a href") (using lcase ensures that it matches whether the source code has lower or upper case :-)) which will basically split the HTML up into arrays using the "<a href" as a split point...meaning that each part of the array now holds one URL. You then would need to find out which array holds the first URL you want to parse, that's up to you to work out though :-)
Once you know where your first URL is, you would parse it out by doing an a=instr(linkarray(x),chr(34))+1 (chr(34) is a " which is usually used to encapsulate a URL in HTML source :-)) and then you would do the same again, but before "linkarray(x)" you would have the value a so it looks from there...like this: b=instr(a,linkarray(x),chr(34))...This now gives you the start and end points for the URL, so all you now need to do is a mid() to get the URL out :-)
Note that if you're using split() you NEED to dim the variable thus: "DIM linkarray() as string"...the () tells the program that it will be redimmed later on with the right value (which is what split does :-))
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
The only disadvantage of using Inet is that it isn't Asyncronous, so it pauses processing in your program until it has the HTML, so you won't be able to use multithreading in your app
If I helped you out, please consider adding to my reputation!
-- "The faulty interface lies between the chair and the keyboard" --
Every single word? That's probably even easier. First you would use a split() to split the whole thing by spaces then you'd have all the words in an array. you might want to first filter out the HTML of the file though...splitting at each < and deleting everything at/before the > afterwards should be enough, although a bit much in this case I think. Someone else might have a better idea :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
Well if you're looking to capture sentances, what you'd do is use split() to put all of the sentances into an array first, you'd need to look for all sentance ending punctuation (! . ?). Then you can split each sentance up by the spaces, and look at each individual words.
Actually my bad, I'd better just look for whole sentences for my array right now. Dayjo, would I have to do a split() for each punctuation like (!,., ?) or is there a way to enter them all at once with one split(). Also would it still have the html junk in it, like mine does right now ?
Actually my bad, I'd better just look for whole sentences for my array right now. Dayjo, would I have to do a split() for each punctuation like (!,., ?) or is there a way to enter them all at once with one split(). Also would it still have the html junk in it, like mine does right now ?
You might also have to worry about sentences like "The car costs 27,999.99."
Rory, I'm getting an infinite loop with your sample I'm not sure why ?
Yeah I did with yahoo too, though it works with all my sites .. oh well .. hey Ill throw something together a little later tonight .. in the middle of work right now ..
i am using this api(thanks to iprank) to download an html file directly to file without the webbrowser control ,the advantage being that u can skip loading the images if any in an html file...
VB Code:
Private Declare Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Public Function DownloadFile(URL As String, LocalFilename As String) As Boolean
Ok this works and strips all HTML tags .. splits the < then replaces everything up to the next > which closes the HTML tags .. of course you also get alot of javascript stuff that is not replaced .. but gives you an idea .. I dont know what your web page contains so will need some more info or an example ..
VB Code:
'// REFERENCE Microsoft XML, Version 2.0
'// GRAB URLS //
Private Sub Command1_Click()
'// DECLARATIONS
Dim sText As String
Dim sArray() As String
Dim iPor As String
Dim iPoe As Integer
Dim i As Integer
sText = SendRequest("http://www.yahoo.com") ' URL TO GRAB
Rory thanks for your input. Attached is the output for the main page I'm trying to parse at http://www.cookbookwiki.com/rice with the code you suggested.
Maybe strips a little too much but will give you something more to work with ..
For Example .. you can comment out the ones you dont want, in the following section called CUSTOM .. you may also want to comment out the RemoveLines part, as that gets rid of .. lines... which it seems you may want ..? If you keep the lines .. then you can do an array and split the Lines ..
My ultmate goal for now is to read text from a FAQ webpage and store the Q/A pairs somehow. They will then all be written to an aiml output file in the format below
<aiml>
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
.
.
.
<aiml>
see -- http://www.alicebot.org/aiml.html
I can even compensate just a little for a complete solution or maybe you can continue to coach me.
Last edited by tyademosu; Jun 28th, 2006 at 10:52 AM.
Any basic faq webpage inputed by the user should do i.e question followed by answer .... etc. Initially I wanted to use all webpages but I think that is more of a research project and a little beyond basic aiml capabilities right now. The format of the aiml output file really should be ok in the format I posted above - for now anyways.
I'll see if I can make the aiml out format easier i.e just providing the answers and letting the chat bot create the responses (teaching the chat bot) -- this might not be possible yet.
if all you want is the text.. u can use the HTML OBject...
add a reference to the Microsoft HTML object library
VB Code:
Dim HTML As HTMLDocument
Dim hText As String
Private Sub Form_Load()
Dim tHTML As New HTMLDocument
Set HTML = tHTML.createDocumentFromUrl("http://www.yahoo.com", vbNullString)
Do While HTML.readyState <> "complete"
DoEvents
Debug.Print HTML.readyState 'just to see it working
Loop
hText = HTML.documentElement.innerText
Debug.Print hText 'there is all the text.. NO HTML Tags
End Sub
that will ensure you get ONLY the text from the page..
then u can split it out into sentences...
here is a good idea.. if u will have MS word available to you.. USE IT!
Create a new doc thru vb.. dump the hText into it then use code from word to split it up. Let Word do all the work for you!
VB Code:
For x = 1To ThisDocument.Sentences.Count
Debug.Print ThisDocument.Sentences(x).Text
Next
i tested this with odd sentences.. like ones with #'s 27,199.00.
and it works perfectly
JPnyc rocks!! (Just ask him!)
If u have your answer please go to the thread tools and click "Mark Thread Resolved"
using the webrowsercontrol....innertext might be the fastest way to start. --
Msn messenger tyademosu(aht)hotmail.com
Which ever method you get the text only .. each page is different .. for example the google link you posted, i was able to get them, but i also got the question links above them .. with some modifications i can get exactly what i need, but then if you take the page you posted below .. thats different also .. though that one is much easier as all it has is questions and answers .. we simply split the ? marks ..
heres a few more lines of code for that...
the final result is an array of sentences...
im sure you will need to play around a bit, remove blanks etc...
but it works....
[Highlight=VB]
VB Code:
Dim HTML As HTMLDocument
Dim hText As String
Private Sub Form_Load()
Dim tHTML As New HTMLDocument
Set HTML = tHTML.createDocumentFromUrl("http://www.yahoo.com", vbNullString)
Do While HTML.readyState <> "complete"
DoEvents
Debug.Print HTML.readyState 'just to see it working
Loop
hText = HTML.documentElement.innerText
Dim tmp As String
'Start new word app
Dim wrd As New Word.Application
Dim Doc As Word.document
'New Document
Set Doc = wrd.Documents.Add
Dim dSentences() As String
'just so we can see it....
wrd.Visible = True
'type the "text" into the doc
wrd.selection.TypeText Text:=hText
'set each sentence to an element in the array dSentences
ReDim dSentences(Doc.Sentences.Count - 1)
For x = 1 To Doc.Sentences.Count
dSentences(x - 1) = Doc.Sentences(x).Text
Next
'loop through and print the array
For x = 0 To UBound(dSentences)
Debug.Print dSentences(x)
Next
Doc.Close False
wrd.Quit False
Set Doc = Nothing
Set wrd = Nothing
End Sub
JPnyc rocks!! (Just ask him!)
If u have your answer please go to the thread tools and click "Mark Thread Resolved"