Page 1 of 2 12 LastLast
Results 1 to 40 of 48

Thread: Please Help, Need A.L.I.C.E Web page Spider

  1. #1

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Please Help, Need A.L.I.C.E Web page Spider

    See link below --- I can even pay a little for a complete soln. I'm trying in VB6

    http://www.nabble.com/Help%2C-Need-A...-t1720827.html

    I need a spider created that will crawl a webpage and grammatically parse the page and create AIML (Artificial Inelligence Markup Language) data. This data will be saved into an AIML file and used to teach a chatterbot the contents of the web page.

    The way we see it working is:

    1. The spider crawls a page examining the text of each sentence.

    2. Then using a grammatic parser it will reformulate that sentence data into possible patterns and responses to be entered as data in the AIML file.

    3. Then it will format this into a standard AIML file and allow you to save this code or copy and paste it to another source

    This will require someone experience in AIML as well as grammatic sentence parsing.

    The idea for this project is to get the stand-alone script, possibilities of this being a desktop VB script. But I hear that existing perl and php extensions may make this easier. Open to other suggestions. Preferably a desktop application to start
    Last edited by tyademosu; Jun 16th, 2006 at 05:51 PM.

  2. #2
    PowerPoster
    Join Date
    May 2006
    Location
    Location, location!
    Posts
    2,673

    Re: Please Help, Need A.L.I.C.E Web page Spider

    I believe this is a forum for PROGRAMMERS to discuss PROGRAMMING, not for people to find someone to write their programs for them
    Well, everyone else has been doing it :-)
    Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
    Expect more to come in future
    If I have helped you, RATE ME! :-)

    I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!

    And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.

  3. #3
    Former Admin/Moderator MartinLiss's Avatar
    Join Date
    Sep 1999
    Location
    San Jose, CA
    Posts
    33,431

    Re: Please Help, Need A.L.I.C.E Web page Spider

    smUX's post is a little harsh and he shouldn't have responded like that, but basically he is right in that the Classic Visual Basic forum is for questions concerning problems people are having while developing their applications. Sometimes (rarely actually) someone will write a whole app for someone but I strongly doubt that that will happen with something as complex as you desire. I could however move the thread to the Open Positions forum and while that forum is designed for jobs, someone might be willing to earn a few bucks helping you with this. Let me know what you want me to do.

  4. #4

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Could it be in both ? I'm already trying to program it myself so I also need hints.

  5. #5
    PowerPoster
    Join Date
    May 2006
    Location
    Location, location!
    Posts
    2,673

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Yeah, just call me Mr Coarse :-)

    Ty, I'd suggest that if you're planning to code it yourself you should come to us when you have PROBLEMS with the code rather than asking us to write it for you (whether renumeration is provided or not)

    I've been known to write apps for people for free if they're in the same genre as the sort of stuff I find fun to code...this idea is close but I know nothing about AIML. I could easily write the parsing shizzle and get the outputted stuff into an array for someone else to write the rest of the code for :-)
    Well, everyone else has been doing it :-)
    Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
    Expect more to come in future
    If I have helped you, RATE ME! :-)

    I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!

    And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.

  6. #6

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Man that would be sooo great smux

  7. #7
    PowerPoster
    Join Date
    May 2006
    Location
    Location, location!
    Posts
    2,673

    Re: Please Help, Need A.L.I.C.E Web page Spider

    I said *could* not would :-P

    If you need help with any coding problems, ask in the forum...if I know the answer (and am on) I will help :-)
    Well, everyone else has been doing it :-)
    Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
    Expect more to come in future
    If I have helped you, RATE ME! :-)

    I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!

    And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.

  8. #8

  9. #9
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Just 1 web page?
    And what type of page is it?

    What exactly is in the page?

    Anyway basically you just grab the page text into a string and work with that in VB .. lots of examples in the networking and ASP section ..

  10. #10

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Ok, right now I'm just starting and trying to open the url. I expect the parsing and converting to aiml will be main issues. I'll post more as problems arise since my vb is rusty. This is for a work project.

  11. #11

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    I'm pretty sure I have the url open but what's the easiet way to parse everything on the webpage. Thanks.

  12. #12
    Fanatic Member
    Join Date
    Aug 2005
    Location
    South Africa
    Posts
    760

    Re: Please Help, Need A.L.I.C.E Web page Spider

    What are you using to open the webpage?
    If I helped you out, please consider adding to my reputation!

    -- "The faulty interface lies between the chair and the keyboard" --

    VB6 Programs By Me:
    ** Dictionary, Thesaurus & Rhyme-Generator In One ** WMP Recent Files List Editor ** Pretty Impressive Clock ** Extract Firefox History **

  13. #13

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    the inet control

  14. #14
    PowerPoster
    Join Date
    May 2006
    Location
    Location, location!
    Posts
    2,673

    Re: Please Help, Need A.L.I.C.E Web page Spider

    if you're using inet then you would use "HTML = inet1.openurl..." when you use inet to open a URL, and the variable HTML would contain (you can guess :-)) the HTML from that URL.

    Parsing a HTML file for links is very simple. You can either go the split() way followed by instr() at each part of the array you generate or you can just use instr. If you just want the URLs, the split() way would be easiest. You simply do a linkarray = split(lcase(HTML),"<a href") (using lcase ensures that it matches whether the source code has lower or upper case :-)) which will basically split the HTML up into arrays using the "<a href" as a split point...meaning that each part of the array now holds one URL. You then would need to find out which array holds the first URL you want to parse, that's up to you to work out though :-)

    Once you know where your first URL is, you would parse it out by doing an a=instr(linkarray(x),chr(34))+1 (chr(34) is a " which is usually used to encapsulate a URL in HTML source :-)) and then you would do the same again, but before "linkarray(x)" you would have the value a so it looks from there...like this: b=instr(a,linkarray(x),chr(34))...This now gives you the start and end points for the URL, so all you now need to do is a mid() to get the URL out :-)

    Note that if you're using split() you NEED to dim the variable thus: "DIM linkarray() as string"...the () tells the program that it will be redimmed later on with the right value (which is what split does :-))
    Well, everyone else has been doing it :-)
    Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
    Expect more to come in future
    If I have helped you, RATE ME! :-)

    I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!

    And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.

  15. #15
    Fanatic Member
    Join Date
    Aug 2005
    Location
    South Africa
    Posts
    760

    Re: Please Help, Need A.L.I.C.E Web page Spider

    The only disadvantage of using Inet is that it isn't Asyncronous, so it pauses processing in your program until it has the HTML, so you won't be able to use multithreading in your app
    If I helped you out, please consider adding to my reputation!

    -- "The faulty interface lies between the chair and the keyboard" --

    VB6 Programs By Me:
    ** Dictionary, Thesaurus & Rhyme-Generator In One ** WMP Recent Files List Editor ** Pretty Impressive Clock ** Extract Firefox History **

  16. #16
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    this grabs all URLs .. i still prefer winsock, but this is a quick way to do it .. use MSXML 4.0 if your server supports it so you can set timeouts ..

    VB Code:
    1. '// REFERENCE Microsoft XML, Version 2.0
    2. '// GRAB URLS //
    3.  
    4. Private Sub Command1_Click()
    5.     Dim sText As String
    6.     Dim sArray() As String
    7.     Dim i As Integer
    8.     sText = SendRequest("http://www.thatlinkofmine.com")
    9.     If Len(sText) Then
    10.         sArray = Split(sText, "<a href", , vbTextCompare)
    11.         For i = 0 To UBound(sArray)
    12.             Debug.Print Replace(StripText(sArray(i), "=", ">"), """", "")
    13.         Next i
    14.     Else
    15.         Debug.Print "Nothing to display"
    16.     End If
    17. End Sub
    18.  
    19. '// GET TEXT FROM WEB PAGE
    20. Private Function SendRequest(ByVal strUrl As String) _
    21.     As String
    22.     On Error Resume Next
    23.     Dim objHTTP As New MSXML.XMLHTTPRequest
    24.     objHTTP.Open "GET", strUrl, False
    25.     objHTTP.setRequestHeader "Content-Type", "text/html"
    26.     If Err = 0 Then
    27.         objHTTP.send
    28.         SendRequest = objHTTP.responseText
    29.     Else
    30.         MsgBox "Error " & Err.Number & _
    31.         vbNewLine & Err.Description
    32.     End If
    33. End Function
    34.  
    35. '// STRIP TEXT FROM STRING
    36. Public Function StripText(ByVal source As String, _
    37.     ByVal start As String, ByVal finish As String) As String
    38.     Dim iPos As Integer, iPoe As Integer
    39.     iPos = (InStr(1, source, start, 3) + Len(start))
    40.     iPoe = InStr(iPos, source, finish, 3)
    41.     StripText = Trim$(Mid$(source, iPos, (iPoe - iPos)))
    42. End Function
    Last edited by rory; Jun 17th, 2006 at 08:51 AM.

  17. #17

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Thanks Guys. I actually trying to parse every word on the page so I guess I'll have to modify your suggestions slightly.

  18. #18
    PowerPoster
    Join Date
    May 2006
    Location
    Location, location!
    Posts
    2,673

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Every single word? That's probably even easier. First you would use a split() to split the whole thing by spaces then you'd have all the words in an array. you might want to first filter out the HTML of the file though...splitting at each < and deleting everything at/before the > afterwards should be enough, although a bit much in this case I think. Someone else might have a better idea :-)
    Well, everyone else has been doing it :-)
    Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
    Expect more to come in future
    If I have helped you, RATE ME! :-)

    I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!

    And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.

  19. #19
    Addicted Member Dayjo's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    130

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Well if you're looking to capture sentances, what you'd do is use split() to put all of the sentances into an array first, you'd need to look for all sentance ending punctuation (! . ?). Then you can split each sentance up by the spaces, and look at each individual words.

  20. #20
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    this cleans all the HTML tags and just shows the text with spaces ..

    VB Code:
    1. Private Sub Command1_Click()
    2.  
    3.     Dim strHTML As String
    4.     Dim key1 As Long
    5.     Dim key2 As Long
    6.    
    7.     strHTML = "<meta name=""keywords"" content=""test"">This Text<font size=""2"">More Text</font><img src=""test.jpg"">Lots of Text"
    8.    
    9.     Do While InStr(strHTML, ">") > 0
    10.         key1 = InStr(1, strHTML, "<", 1)
    11.         key1 = key1 + Len("<")
    12.         key2 = InStr(key1, strHTML, ">", 1)
    13.         strHTML = Replace(strHTML, "<" & Trim(Mid(strHTML, key1, (key2 - key1))) & ">", " ")
    14.     Loop
    15.    
    16.     Debug.Print strHTML
    17.    
    18. End Sub

  21. #21

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Actually my bad, I'd better just look for whole sentences for my array right now. Dayjo, would I have to do a split() for each punctuation like (!,., ?) or is there a way to enter them all at once with one split(). Also would it still have the html junk in it, like mine does right now ?

  22. #22

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Rory, I'm getting an infinite loop with your sample I'm not sure why ?

  23. #23
    Former Admin/Moderator MartinLiss's Avatar
    Join Date
    Sep 1999
    Location
    San Jose, CA
    Posts
    33,431

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Quote Originally Posted by tyademosu
    Actually my bad, I'd better just look for whole sentences for my array right now. Dayjo, would I have to do a split() for each punctuation like (!,., ?) or is there a way to enter them all at once with one split(). Also would it still have the html junk in it, like mine does right now ?
    You might also have to worry about sentences like "The car costs 27,999.99."

  24. #24
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Quote Originally Posted by tyademosu
    Rory, I'm getting an infinite loop with your sample I'm not sure why ?
    Yeah I did with yahoo too, though it works with all my sites .. oh well .. hey Ill throw something together a little later tonight .. in the middle of work right now ..

  25. #25
    Frenzied Member litlewiki's Avatar
    Join Date
    Dec 2005
    Location
    Zeta Reticuli Distro:Ubuntu Fiesty
    Posts
    1,162

    Re: Please Help, Need A.L.I.C.E Web page Spider

    i am using this api(thanks to iprank) to download an html file directly to file without the webbrowser control ,the advantage being that u can skip loading the images if any in an html file...

    VB Code:
    1. Private Declare Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
    2. Public Function DownloadFile(URL As String, LocalFilename As String) As Boolean
    3.     Dim lngRetVal As Long
    4.     lngRetVal = URLDownloadToFile(0, URL, LocalFilename, 0, 0)
    5.     If lngRetVal = 0 Then DownloadFile = True
    6. End Function
    7. Private Sub Form_Load()
    8.    
    9.     DownloadFile "http://www.somesite.com", "c:\sample.txt"
    10.  
    11. End Sub

    hope it helps!!

  26. #26
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Ok this works and strips all HTML tags .. splits the < then replaces everything up to the next > which closes the HTML tags .. of course you also get alot of javascript stuff that is not replaced .. but gives you an idea .. I dont know what your web page contains so will need some more info or an example ..

    VB Code:
    1. '// REFERENCE Microsoft XML, Version 2.0
    2. '// GRAB URLS //
    3.  
    4. Private Sub Command1_Click()
    5.    
    6.     '// DECLARATIONS
    7.     Dim sText As String
    8.     Dim sArray() As String
    9.     Dim iPor As String
    10.     Dim iPoe As Integer
    11.     Dim i As Integer
    12.    
    13.     sText = SendRequest("http://www.yahoo.com")                     ' URL TO GRAB
    14.     If Len(sText) Then                                              ' THERE IS TEXT
    15.         sArray = Split(sText, "<")                                  ' SPLIT BY TAG START
    16.         For i = 0 To UBound(sArray)                                 ' LOOP THROUGH
    17.             iPoe = InStr(sArray(i), ">")                            ' GET REPLACE LENGTH
    18.             If iPoe Then
    19.                 iPor = "<" & Mid$(sArray(i), 1, (iPoe - 1)) & ">"   ' OUR REPLACE STRING
    20.                 sText = Trim$(Replace(sText, iPor, " "))            ' REPLACE IN TEXT
    21.             End If
    22.         Next i                                                      ' NEXT TAG START
    23.         Debug.Print sText                                           ' DISPLAY FINAL TEXT
    24.     Else
    25.         Debug.Print "Nothing to display"
    26.     End If
    27.    
    28. End Sub
    29.  
    30. '// GET TEXT FROM WEB PAGE
    31. Private Function SendRequest(ByVal strUrl As String) _
    32.     As String
    33.     On Error Resume Next
    34.     Dim objHTTP As New MSXML.XMLHTTPRequest                         ' CREATE OBJECT
    35.     objHTTP.Open "GET", strUrl, False                               ' START REQUEST
    36.     objHTTP.setRequestHeader "Content-Type", "text/html"
    37.     If Err = 0 Then                                                 ' NO ERRORS
    38.         objHTTP.send                                                ' SEND REQUEST
    39.         SendRequest = objHTTP.responseText                          ' GET TEXT
    40.     Else
    41.         MsgBox "Error " & Err.Number & _
    42.         vbNewLine & Err.Description
    43.     End If
    44. End Function
    Last edited by rory; Jun 17th, 2006 at 11:02 PM.

  27. #27
    Frenzied Member litlewiki's Avatar
    Join Date
    Dec 2005
    Location
    Zeta Reticuli Distro:Ubuntu Fiesty
    Posts
    1,162

    Re: Please Help, Need A.L.I.C.E Web page Spider

    the only issue being that it saves in unix format so i use a function something like this one

    http://www.a1vbcode.com/snippet-2938.asp

    to convert it to a dos compatible one..

  28. #28

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Rory thanks for your input. Attached is the output for the main page I'm trying to parse at http://www.cookbookwiki.com/rice with the code you suggested.
    Attached Files Attached Files

  29. #29
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    We went a little futher in another thread after yours .. see this thread ..

    http://www.vbforums.com/showthread.php?t=412360

    and the code in my post #13
    http://www.vbforums.com/showpost.php...5&postcount=13

    Maybe strips a little too much but will give you something more to work with ..

    For Example .. you can comment out the ones you dont want, in the following section called CUSTOM .. you may also want to comment out the RemoveLines part, as that gets rid of .. lines... which it seems you may want ..? If you keep the lines .. then you can do an array and split the Lines ..

    VB Code:
    1. i = RemoveLines(i)
    2.         i = RemoveTags(i, "<style", "</style>")
    3.         i = RemoveTags(i, "<script", "</script>")
    4.         i = RemoveTags(i, "<!--", "-->")
    5.         i = RemoveTags(i, "<", ">")
    6.  
    7.         '// START CUSTOM
    8.         i = RemoveTags(i, "&#", ";")  ' SPECIAL SYMBOLS
    9.         i = RemoveChars(i, "&nbsp#&amp;#&quot#&gt;#&lt;#[#]#""#;#:#.#,#'#/#$#%#?#!#|#(#)#=#-#+#&#*#©#®")
    10.         i = RemoveDigits(i, "0 1 2 3 4 5 6 7 8 9")
    11.         i = RemoveCommon(i, "a b c d e f g h i j k l m n o p q r s t u v w x y z")
    12.         i = RemoveCommon(i, "at and com is or of to that this then the was what with where who when")
    13.         '// END CUSTOM
    14.  
    15.         i = RemoveMultiple(i, "  ")   ' GET RID OF MULTIPLE SPACES
    16.         i = StrConv(i, vbProperCase)  ' UPPER CASE FIRST LETTER

  30. #30
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Here is a modified version i just tested out ... seems to work pretty well ..
    Adds line numbers to the text box also ..

  31. #31

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Thanks so much again, Rory.


    My ultmate goal for now is to read text from a FAQ webpage and store the Q/A pairs somehow. They will then all be written to an aiml output file in the format below

    <aiml>
    <category>
    <pattern>WHAT ARE YOU</pattern>
    <template>
    I am the latest result in artificial intelligence,
    which can reproduce the capabilities of the human brain
    with greater speed and accuracy.
    </template>
    </category>
    .
    .
    .
    <aiml>
    see -- http://www.alicebot.org/aiml.html

    I can even compensate just a little for a complete solution or maybe you can continue to coach me.
    Last edited by tyademosu; Jun 28th, 2006 at 10:52 AM.

  32. #32
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Id have to read up on this AI stuff .. the files i mean .. but basically, do you know what the faq is as far as the HTML goes ..?

    It should be easy once you know what you are working with .. in other words, what does the Faq look like.. the questions and answers ..?

  33. #33

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Any basic faq webpage inputed by the user should do i.e question followed by answer .... etc. Initially I wanted to use all webpages but I think that is more of a research project and a little beyond basic aiml capabilities right now. The format of the aiml output file really should be ok in the format I posted above - for now anyways.

  34. #34
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    They're going to differ depending on the web site developer's html design ..

    do you have any site link samples in mind ..?

  35. #35

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    True. Well, very elementry question/answer structure for now maybe

    http://www.vbforums.com/faq.php?faq=...b_why_register

    and

    http://gmail.google.com/mail/help/about.html -- or simpler

    using the webrowsercontrol....innertext might be the fastest way to start. --

    Msn messenger tyademosu(aht)hotmail.com

  36. #36

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Here a great faq page.

    http://www.talkorigins.org/origins/faqs-qa.html.

    I'll see if I can make the aiml out format easier i.e just providing the answers and letting the chat bot create the responses (teaching the chat bot) -- this might not be possible yet.

  37. #37
    PowerPoster Static's Avatar
    Join Date
    Oct 2000
    Location
    Rochester, NY
    Posts
    9,390

    Re: Please Help, Need A.L.I.C.E Web page Spider

    if all you want is the text.. u can use the HTML OBject...

    add a reference to the Microsoft HTML object library
    VB Code:
    1. Dim HTML As HTMLDocument
    2. Dim hText As String
    3. Private Sub Form_Load()
    4.     Dim tHTML As New HTMLDocument
    5.     Set HTML = tHTML.createDocumentFromUrl("http://www.yahoo.com", vbNullString)
    6.     Do While HTML.readyState <> "complete"
    7.         DoEvents
    8.         Debug.Print HTML.readyState 'just to see it working
    9.     Loop
    10.     hText = HTML.documentElement.innerText
    11.    
    12.     Debug.Print hText 'there is all the text.. NO HTML Tags
    13. End Sub
    that will ensure you get ONLY the text from the page..
    then u can split it out into sentences...

    here is a good idea.. if u will have MS word available to you.. USE IT!
    Create a new doc thru vb.. dump the hText into it then use code from word to split it up. Let Word do all the work for you!
    VB Code:
    1. For x = 1To ThisDocument.Sentences.Count
    2.         Debug.Print ThisDocument.Sentences(x).Text
    3.     Next

    i tested this with odd sentences.. like ones with #'s 27,199.00.
    and it works perfectly
    JPnyc rocks!! (Just ask him!)
    If u have your answer please go to the thread tools and click "Mark Thread Resolved"

  38. #38
    PowerPoster
    Join Date
    May 2006
    Posts
    2,988

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Quote Originally Posted by tyademosu
    True. Well, very elementry question/answer structure for now maybe

    http://www.vbforums.com/faq.php?faq=...b_why_register

    and

    http://gmail.google.com/mail/help/about.html -- or simpler

    using the webrowsercontrol....innertext might be the fastest way to start. --

    Msn messenger tyademosu(aht)hotmail.com
    Which ever method you get the text only .. each page is different .. for example the google link you posted, i was able to get them, but i also got the question links above them .. with some modifications i can get exactly what i need, but then if you take the page you posted below .. thats different also .. though that one is much easier as all it has is questions and answers .. we simply split the ? marks ..

  39. #39

    Thread Starter
    Member
    Join Date
    Jun 2006
    Posts
    41

    Re: Please Help, Need A.L.I.C.E Web page Spider

    Static I like your Word suggestion. Can you please elaborate : how do I create a new doc and dump into etc

  40. #40
    PowerPoster Static's Avatar
    Join Date
    Oct 2000
    Location
    Rochester, NY
    Posts
    9,390

    Re: Please Help, Need A.L.I.C.E Web page Spider

    heres a few more lines of code for that...
    the final result is an array of sentences...
    im sure you will need to play around a bit, remove blanks etc...
    but it works....
    [Highlight=VB]
    VB Code:
    1. Dim HTML As HTMLDocument
    2. Dim hText As String
    3. Private Sub Form_Load()
    4.     Dim tHTML As New HTMLDocument
    5.     Set HTML = tHTML.createDocumentFromUrl("http://www.yahoo.com", vbNullString)
    6.     Do While HTML.readyState <> "complete"
    7.         DoEvents
    8.         Debug.Print HTML.readyState 'just to see it working
    9.     Loop
    10.     hText = HTML.documentElement.innerText
    11.     Dim tmp As String
    12.     'Start new word app
    13.     Dim wrd As New Word.Application
    14.     Dim Doc As Word.document
    15.     'New Document
    16.     Set Doc = wrd.Documents.Add
    17.     Dim dSentences() As String
    18.     'just so we can see it....
    19.     wrd.Visible = True
    20.     'type the "text" into the doc
    21.     wrd.selection.TypeText Text:=hText
    22.     'set each sentence to an element in the array dSentences
    23.     ReDim dSentences(Doc.Sentences.Count - 1)
    24.     For x = 1 To Doc.Sentences.Count
    25.         dSentences(x - 1) = Doc.Sentences(x).Text
    26.     Next
    27.     'loop through and print the array
    28.     For x = 0 To UBound(dSentences)
    29.         Debug.Print dSentences(x)
    30.     Next
    31.    
    32.     Doc.Close False
    33.     wrd.Quit False
    34.    
    35.     Set Doc = Nothing
    36.     Set wrd = Nothing
    37.    
    38.    
    39. End Sub
    JPnyc rocks!! (Just ask him!)
    If u have your answer please go to the thread tools and click "Mark Thread Resolved"

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width