Results 1 to 5 of 5

Thread: Stripping HTML file of Unwanted Junk

  1. #1

    Thread Starter
    PowerPoster
    Join Date
    Jul 2001
    Location
    Tucson, AZ
    Posts
    2,166

    Stripping HTML file of Unwanted Junk

    I don't do a lot with HTML, but periodically I download a page or two. Most of the time, the page does not provide a printable format (that is just key text along with any needed graphics).

    Is there an easy way to strip out all the junk (advertiizing gifs, unneeded frames linking to other web pages, etc.)??

    By easy I mean an already existing freeware program to do this so I don't have to manually edit each page. NOTE: Each downloaded page brings along with it normally the same unneeded graphics.

    If not, are there key tags that are commonly used so I can write a VB program to screen for this info.

    Thanks
    David
    Last edited by dw85745; Sep 18th, 2004 at 09:51 AM.

  2. #2
    Kitten CornedBee's Avatar
    Join Date
    Aug 2001
    Location
    In a microchip!
    Posts
    11,594
    If you remove all <img>, <object> and <embed> tags, you should get a largely clean display. But it might destroy the layout, too, especially if the page is authored using old-style tables with spacer gifs.
    All the buzzt
    CornedBee

    "Writing specifications is like writing a novel. Writing code is like writing poetry."
    - Anonymous, published by Raymond Chen

    Don't PM me with your problems, I scan most of the forums daily. If you do PM me, I will not answer your question.

  3. #3

    Thread Starter
    PowerPoster
    Join Date
    Jul 2001
    Location
    Tucson, AZ
    Posts
    2,166
    Thanks for response Corned Bee.

    Will take your suggestions under consideration.

    Know I can't get 100% but came up with a couple ideas

    1. Converting to XHTML so have good tags.

    2. Scan for any gifs, etc, and consolidate these in one directory.
    to eliminate any duplicates

    3. Delete all source directory information to just leave a clean reference to whatever -- however I save the original directory comment at top of each html file to keep the source of origination to know where I downloaded it. (This way I can move the html to any directory of my choice)

    4. Manually delete the unwanted gifs, etc. (in consolidated file) that I consider junk and put in separate directory.
    WISH I COULD DO THIS BY CODE.

    5. Enumerate the unwanted gif, etc. directory and compare to the source html, deleting any tag group that contains these unwanted items -- then save the html file.

    Got any thoughts on above.

  4. #4
    Frenzied Member Jop's Avatar
    Join Date
    Mar 2000
    Location
    Amsterdam, the Netherlands
    Posts
    1,986
    Hey dw,
    I don't think converting to Xhtml will solve your problem, the images tags will stay the same.

    I would suggest using a stylesheet to hide the stuff that you don't need, you could also specify the media type as print so that you can still see the page in full glory on screen, but will get printed as a stripped down version.

    I set up a little example for you, save it as a seperate file and call it print.css for example:

    Code:
    body{
    	background: none;
    	color: #000;
    }
    img, embed, object, form{
    	display: none;
    }
    It removes the background from the page, turns the text to black. It also hides objects that you probably wouldn't need in print, you can add more items as you go along the way and find out what you need and not.

    So if you saved it, the only thing you have to paste in the html file (in the &lt;head&gt; section) you downloaded is:
    Code:
    <link rel="stylesheet" href="print.css" />
    Optionally if you only want this to show up for print you could use:
    Code:
    <link rel="stylesheet" href="print.css" media="print" />
    Jop - validweb.nl

    Alcohol doesn't solve any problems, but then again, neither does milk.

  5. #5

    Thread Starter
    PowerPoster
    Join Date
    Jul 2001
    Location
    Tucson, AZ
    Posts
    2,166
    Thanks Jop for your input and efforts on my behalf.

    Will give your suggestion a try.

    David

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width