Results 1 to 9 of 9

Thread: Do you spend your whole life on the internet?

  1. #1

    Thread Starter
    Lively Member
    Join Date
    Jan 2000
    Posts
    123
    Ok, so I do surf alot but it always seems that there is never enough time to go to every site I want.... So here's my plan, the only problem being I don't know where to start. I was thinking about making a program smart enough that if I gave it a list of things I wanted it could go out to sites in my favorites folder and also use a search engine automaticaly, it would look at the content and links to find related subject manner. It would then extract ONLY the text from the site and save it as a text file, rather than downloading every site.

  2. #2
    transcendental analytic kedaman's Avatar
    Join Date
    Mar 2000
    Location
    0x002F2EA8
    Posts
    7,221

    Lightbulb

    If you're up to make something of this, like an application that searches for your specificated pages/files, you're app needs to be very intelligent, Im not calling it directly AI so that I wont upset people here, but it really needs how to search, find the right links, and it has to think like you. Good luck
    Use
    writing software in C++ is like driving rivets into steel beam with a toothpick.
    writing haskell makes your life easier:
    reverse (p (6*9)) where p x|x==0=""|True=chr (48+z): p y where (y,z)=divMod x 13
    To throw away OOP for low level languages is myopia, to keep OOP is hyperopia. To throw away OOP for a high level language is insight.

  3. #3
    Guest

    Cool

    Theres a lot of AI stuff around lately!!!

  4. #4
    transcendental analytic kedaman's Avatar
    Join Date
    Mar 2000
    Location
    0x002F2EA8
    Posts
    7,221

    Cool

    Yeah, I like AI, It sounds professional
    Use
    writing software in C++ is like driving rivets into steel beam with a toothpick.
    writing haskell makes your life easier:
    reverse (p (6*9)) where p x|x==0=""|True=chr (48+z): p y where (y,z)=divMod x 13
    To throw away OOP for low level languages is myopia, to keep OOP is hyperopia. To throw away OOP for a high level language is insight.

  5. #5

    Thread Starter
    Lively Member
    Join Date
    Jan 2000
    Posts
    123
    Ok, so I thaught it out a while. Here's a list of questions I need answered...

    1. Get the program to get a site and temporarally save it.
    2. Search through the downloaded document and find the <P> and </P> tags and find the needed text.
    3. Save that information and delete the temporary copy.

    Ok, so those were the simple things I do not know, now it gets tougher...

    4. Find and retrieve the sites in my favorites folder
    5. Open up a search site like altavista and perform a search.
    6. Use the urls from the search results and or links on a page to goto other pages and gather results.

    And last but not least...

    7. Perform an annalisis of the matterial and generate a report with the url, the number of terms matched (i.e. a most likely match calculation), and generate a summary of the site by useing and identifying headings (i.e. Create an outline).

    If anyone has any ideas on how to perform these tasks HELP IS VERY MUCH APPRETIATED!

    P.S. If my spelling is bad I already know this, english isn't my best subject.

    P.P.S. Source code for these tasks is prefered.

    P.P.P.S ANY HELP IS BETTER THAN NO HELP!!!!!

    [Edited by ravcam on 04-12-2000 at 06:40 PM]

  6. #6
    Hyperactive Member
    Join Date
    Mar 2000
    Posts
    461
    Oh I like that...

    You have this wonderful idea but you would "prefer" people give you the source code to do it all.

    So what are you actually going to do yourself?


    I'm not going to give you it on a silver platter but I will give you some considerations for your questions .

    1. When you say "get a site and temporary save it", that means downloading ALL HTML files, PHTML, ASP (dont forget some sites use scripts), DHTML, traverse ALL branches, process ALL CGI links AND Javascript links (You know those that do "javascript:window(xxxxx)" and eventually put it all back together in your temporary space.

    2. Not all webpages use both <P> and </P>. You will find the majority of them start their paragraphs but end them by starting a new one.

    3. The information you will save will contain single line titles, text indicating other links, unrelated items, completely irrelevant sections and the "text" some people put in to attract you to the site via the search engines in the first place. Getting anything useful out of that mess would be near impossible.

    4. What if you want to select only SOME of the sites in your favourites folder... or of you have a new site that is NOT in your favourites folder.

    5. "perform a search". Well you have to give the search some keywords... how are you going to determine what those keywords are just from a list of your favourites?

    6. How many of us get completely off-topic results from our search engines? You want to automatically collect ALL of the results of sites found... You will be storing 3-4MB of completely useless data clogging up everything that you MIGHT want to see and spend more time filtering through this information than if you actually went to the search engine yourself and looked.

    7. You have just PERFECTLY described what the search engines were designed for!!!! Why repeat what they do. These companies put a lot of time, effort and intelligence into their search sites and that is the best there is at the moment... to think you can whip up some code that will use them but be able to better generate a list of useful sites is far fetched.


    Its a nice idea but not a very feasible one and after all... "Feasibility Studies" are the first thing you should do before EVER deciding to write a piece of code... But hang on... your not wanting to write the code... you want other people to

  7. #7

    Thread Starter
    Lively Member
    Join Date
    Jan 2000
    Posts
    123
    TO Gen-X:

    Ok, first....

    1. If I remember corectly scripts are inbeded, I don't think it will be that much of a problem to simply ignore them, after all I am only going for actual content and I don't know to many sits that use scriping for writen content.

    2. I realize not all pages use the </P> tag. Simple manner of only recognizing the <P> tag, which should at least be used if you plan on good web design practice.

    3. By examining the contents of web pages more closly than a seach engine I hope to reduce the amount of uneeded results, after all I am useing the seach engine as a reference point.

    4. It uses the favorites folder also as a reference point.

    5. I have seen some pretty amazing thing done in html, including the web site for one of my classes. You can enter information in a text box on the class page and it will automatically enter that information into the text box on a different totally separate webpage and perform the action of clicking whatever it is that function does from the other site. The only problem is that you have to know something about the other site and the way it works.

    6. By only temporarily storeing results until an good match can be found it eliminates extra unrelated links, I am hoping to develope a way to acuratly scan the contents unlike most seach engines that only scan the META tags.

    7. Like stated above, search engines have their limitations, while I recognize these I am also aware that my computer may not do any better.

    And last of all, making a program from source code is not always easy, especially when you have little expirience like me... I am only looking for a reference point on which to base my application, after all, I cann't program something I don't know! Don't get me wrong on that idea... I am not looking for someone to do the work for me... and if I was I would have my friend do it, after all he is a much better programmer than I am and enjoys it alot more!

  8. #8
    Hyperactive Member
    Join Date
    Mar 2000
    Posts
    461
    Rav...

    I do commend you on your ideas but I have to remind you that there are professionals out there with years more experience then either you or I that are the top of their fields and doing this work and THEY cannot do what you are talking about....

    1. A lot of sites DO use scripting and the scripts while possibly being embedded are sometimes only reached by submitting a form using the POST method to retrieve. As you would be only attempting to parse the HREF section of an A tag you would also have to lend consideration to parsing the ACTION attribute out of a FORM tag, pre-generate the form fields and replicate the POST method.

    2. What about text in tables? This would NEVER have a <P> tag anywhere in the document but would have the information you sought carefully layed out in a table. Good web design NEVER uses the <P> tag because all that does is puts a double space between text. GOOD web design would be to use both <P> and </P> but nobody ever uses it.

    3. Look at it "more closely" than a search engine!?!?!?! Oh good luck there!!! These guys spend all day and night with the most skilled people in the industry coming up with better ways of "more closely" scanning their data. If you can do it better than they can, even using the search engines as a reference point then apply for a job with infoseek or the like because they will pay you $1,000,000 a year to be an employee of theirs.

    4. Accepted

    5. You have to know about ALL the sites you go to and how they work.. not just a single and specific site. Every site is coded by someone different doing something different. You could not even begin to work out a set of definitive "ways" of looking at sites.

    6. Very FEW search engines scan the META tags because very FEW web pages ever use META tags. Most of them do a complete search in EXACTLY the way you are talking about. As for links... most webpages have a "related links" page or a "completely off-topic links" page and how would you determine if they should be traversed as well?

    7. Search engines DO have limitations... but they are far fewer than anything we can come up with.

    Rav... Have a look on the internet for an application call WebWolf (Or something Wolf... they have ImageWolf as well). This application tries to do exactly what you are talking about but on a single webpage that you pass it. Look at the limitations that has on going to unrelated sites and links etc and perhaps you will see the actual scope of what you are trying to do.

    Just think about it like this. If you could so easily come up with a method of retrieving useful information from the internet in anything LESS than 2 years work you would have done better than ALL the search engine sites, Sherlock for the Machintosh and about every other information broker in the world.

    Do you think that is something that is possible considering as you said you haven't been doing this long? I know its beyond me and I have been doing this for an eternity.

  9. #9

    Thread Starter
    Lively Member
    Join Date
    Jan 2000
    Posts
    123
    If it is impossible then consider this topic closed!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width