Does anybody here have an algorithm to strip the text from an HTML page?
for instance, i have a page in html with lots of data on it, but when i open it with a text stream reader, i get all the "color=, border =' etc.... crap as well as the text that i want.
if anybody has a function lying around that does this, it would be greatly appreciated.
"The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.
Windows & Web Developer Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
Sutherland Shire, Sydney Australia www.stingrae.com.au Developer of Arnold - Gym & Martial Arts Database Management System www.gymdatabase.com.au
pass your html to this funcion, it removes whatevers between < and >
Code:
Public Function GetTextFromHtml(ByVal input As String) As String
Dim substring() As String = input.Split("<")
Dim output As String
Dim enm As IEnumerator = substring.GetEnumerator
While enm.MoveNext
Dim counter As Integer
Dim ishtml As Boolean
For counter = 1 To enm.Current.Length()
If enm.Current.Chars(counter - 1) = ">" Then
output &= enm.Current.Substring(counter) & vbCrLf
ishtml = True
Exit For
End If
Next
If Not ishtml Then output &= enm.Current & vbCrLf
End While
Return output
End Function
"The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.
Windows & Web Developer Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
Sutherland Shire, Sydney Australia www.stingrae.com.au Developer of Arnold - Gym & Martial Arts Database Management System www.gymdatabase.com.au
thanks for that. seems to work well. will test more and let you no.
cheers.
"The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.
Windows & Web Developer Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
Sutherland Shire, Sydney Australia www.stingrae.com.au Developer of Arnold - Gym & Martial Arts Database Management System www.gymdatabase.com.au
hhmmmm..... it works well for single html pages, but the problem comes when you introduce frames. it doesn't seem to always pick up the main frame.
"The passion lives to keep your faith, though all are different, all are great" ... Michael Hutchence 1960-1997.
Windows & Web Developer Specialising in Visual Basic .Net & Client Server Programming & Client/Customer Relations Databases
Sutherland Shire, Sydney Australia www.stingrae.com.au Developer of Arnold - Gym & Martial Arts Database Management System www.gymdatabase.com.au