Results 1 to 8 of 8

Thread: strip text from HTML

  1. #1

    Thread Starter
    Hyperactive Member stingrae's Avatar
    Join Date
    Apr 2002
    Location
    Sydney
    Posts
    401

    strip text from HTML

    Does anybody here have an algorithm to strip the text from an HTML page?

    for instance, i have a page in html with lots of data on it, but when i open it with a text stream reader, i get all the "color=, border =' etc.... crap as well as the text that i want.

    if anybody has a function lying around that does this, it would be greatly appreciated.
    "The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.

    Windows & Web Developer
    Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
    Sutherland Shire, Sydney Australia
    www.stingrae.com.au
    Developer of Arnold - Gym & Martial Arts Database Management System
    www.gymdatabase.com.au

  2. #2
    Addicted Member
    Join Date
    Sep 2003
    Posts
    227
    this is what i could think of >>

    pass your html to this funcion, it removes whatevers between < and >


    Code:
        Public Function GetTextFromHtml(ByVal input As String) As String
            Dim substring() As String = input.Split("<")
            Dim output As String
            Dim enm As IEnumerator = substring.GetEnumerator
            While enm.MoveNext
                Dim counter As Integer
                Dim ishtml As Boolean
                For counter = 1 To enm.Current.Length()
                    If enm.Current.Chars(counter - 1) = ">" Then
                        output &= enm.Current.Substring(counter) & vbCrLf
                        ishtml = True
                        Exit For
                    End If
                Next
                If Not ishtml Then output &= enm.Current & vbCrLf
            End While
            Return output
        End Function

  3. #3
    Addicted Member
    Join Date
    Sep 2003
    Posts
    227
    but you have to think of something for the double quotes in the html code, since you have to pass a string to the function


    is this what you want? if not plz let me know
    Last edited by persianboy; Nov 19th, 2003 at 09:53 PM.

  4. #4
    Frenzied Member
    Join Date
    Oct 2002
    Location
    Gammapolis
    Posts
    1,474
    http://www.4guysfromrolla.com/webtech/042501-1.shtml

    However you will still face problems on complicated pages.
    'Heading for the automatic overload'
    Marillion, Brave, The Great Escape, 1994

    'How will WE stand the FIRE TOMORROW?'
    Eloy, Silent Cries and Mighty Echoes, The Vision - Burning, 1979

  5. #5

    Thread Starter
    Hyperactive Member stingrae's Avatar
    Join Date
    Apr 2002
    Location
    Sydney
    Posts
    401
    hmmmm... thanks for the suggestions guys, but unfortunaly they're not really returning readable output.

    here's the page that i'm trying to get into a text format:

    http://www.ssp.co.uk/coupons/coupon.htm

    any other ideas?
    "The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.

    Windows & Web Developer
    Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
    Sutherland Shire, Sydney Australia
    www.stingrae.com.au
    Developer of Arnold - Gym & Martial Arts Database Management System
    www.gymdatabase.com.au

  6. #6
    Addicted Member Stick's Avatar
    Join Date
    Aug 1999
    Location
    Iowa
    Posts
    152

    Hey...Not sure if you figured it out yet but..

    I think what you want is just the text from the web page right.
    Well let me know if this is what you want.
    Attached Files Attached Files

  7. #7

    Thread Starter
    Hyperactive Member stingrae's Avatar
    Join Date
    Apr 2002
    Location
    Sydney
    Posts
    401
    Hey Stick,

    thanks for that. seems to work well. will test more and let you no.

    cheers.

    "The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.

    Windows & Web Developer
    Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
    Sutherland Shire, Sydney Australia
    www.stingrae.com.au
    Developer of Arnold - Gym & Martial Arts Database Management System
    www.gymdatabase.com.au

  8. #8

    Thread Starter
    Hyperactive Member stingrae's Avatar
    Join Date
    Apr 2002
    Location
    Sydney
    Posts
    401
    hhmmmm..... it works well for single html pages, but the problem comes when you introduce frames. it doesn't seem to always pick up the main frame.

    "The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.

    Windows & Web Developer
    Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
    Sutherland Shire, Sydney Australia
    www.stingrae.com.au
    Developer of Arnold - Gym & Martial Arts Database Management System
    www.gymdatabase.com.au

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width