PDA

Click to See Complete Forum and Search --> : Stripping HTML file of Unwanted Junk


dw85745
Sep 18th, 2004, 09:48 AM
I don't do a lot with HTML, but periodically I download a page or two. Most of the time, the page does not provide a printable format (that is just key text along with any needed graphics).

Is there an easy way to strip out all the junk (advertiizing gifs, unneeded frames linking to other web pages, etc.)??

By easy I mean an already existing freeware program to do this so I don't have to manually edit each page. NOTE: Each downloaded page brings along with it normally the same unneeded graphics.

If not, are there key tags that are commonly used so I can write a VB program to screen for this info.

Thanks
David

CornedBee
Sep 19th, 2004, 05:10 AM
If you remove all <img>, <object> and <embed> tags, you should get a largely clean display. But it might destroy the layout, too, especially if the page is authored using old-style tables with spacer gifs.

dw85745
Sep 19th, 2004, 12:45 PM
Thanks for response Corned Bee.

Will take your suggestions under consideration.

Know I can't get 100% but came up with a couple ideas

1. Converting to XHTML so have good tags.

2. Scan for any gifs, etc, and consolidate these in one directory.
to eliminate any duplicates

3. Delete all source directory information to just leave a clean reference to whatever -- however I save the original directory comment at top of each html file to keep the source of origination to know where I downloaded it. (This way I can move the html to any directory of my choice)

4. Manually delete the unwanted gifs, etc. (in consolidated file) that I consider junk and put in separate directory.
WISH I COULD DO THIS BY CODE.

5. Enumerate the unwanted gif, etc. directory and compare to the source html, deleting any tag group that contains these unwanted items -- then save the html file.

Got any thoughts on above.

Jop
Sep 20th, 2004, 05:27 AM
Hey dw,
I don't think converting to Xhtml will solve your problem, the images tags will stay the same.

I would suggest using a stylesheet to hide the stuff that you don't need, you could also specify the media type as print so that you can still see the page in full glory on screen, but will get printed as a stripped down version.

I set up a little example for you, save it as a seperate file and call it print.css for example:


body{
background: none;
color: #000;
}
img, embed, object, form{
display: none;
}


It removes the background from the page, turns the text to black. It also hides objects that you probably wouldn't need in print, you can add more items as you go along the way and find out what you need and not.

So if you saved it, the only thing you have to paste in the html file (in the &lt;head&gt; section) you downloaded is:

<link rel="stylesheet" href="print.css" />


Optionally if you only want this to show up for print you could use:

<link rel="stylesheet" href="print.css" media="print" />

dw85745
Sep 20th, 2004, 11:30 AM
Thanks Jop for your input and efforts on my behalf.

Will give your suggestion a try.

David