I need to split html code. I have the code as a string , i need to get all information between "text4" and "text4" There are several strings to extract so i need to go through the complete page. How do I do this?
Printable View
I need to split html code. I have the code as a string , i need to get all information between "text4" and "text4" There are several strings to extract so i need to go through the complete page. How do I do this?
I'm sure something incredibly similar to this has been posted before, have a look
If you can't find it, i'll try to help
I havent come up with anything yet. I am trying this code but im not sure how to use it properly.
Private Sub Command2_Click()
stemp = text1.Text
tmp = Mid$(stemp, InStr(stemp, "text4") + 1)
box1.Text = Left(tmp, InStr(tmp, "text4") - 1)
End Sub
Post an extract of the Text your trying to parse.
I need to get the "NEEDED INFO"out.
All of the information I need is between "text4" for each record , Then each field needed with in the record starts with "Text5"
<td width="20" valign="top"><span class="text4">5.
</span></td><td valign="top"><span class="text5"><b>NEEDED INFO(COMPANY)</b></span></td><td><img src="../images/pixel.gif" width="1" height="1"></td><td align="right" rowspan="3">
<table border="0" cellpadding="0" cellspacing="0"><tr><td><A href="map.asp?SEARCH_AFFILIATE_DATA.x=0&DBAFFILIATEID=%7B3EA51E13%2DC923%2D4FA5%2DBCEB%2DF693C62 3D24F%7D&INNERCODE=072595A101A&language=En" title="Street Level Map"><img width="25" height="22" border="0" src="../images/directorymap.gif" alt="Street Level Map"></A></td><td><A href="iti.asp?Search_iti.x=0&DBAFFILIATEID=%7B3EA51E13%2DC923%2D4FA5%2DBCEB%2DF693C623D24F%7D&am p;INNERCODE=072595A101A&ITI_START_COUNTRYCODE=US&ITI_START_CITYNAME=COLUMBUS&ITI_START_Z IPCODE=39702&ITI_START_STATE=MS&ITI_START_ADDRESS=&SESSIONID={1F5033B2-C668-43E4-A89B-725ADC1FD992}"><img width="25" height="22" border="0" src="../images/directorydriveme.gif"></A></td></tr></table></td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="2"><span class="text5">NEEDED INFO(ADDRESS1)<br>NEEDED INFO(ADDRESS2)</span></td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="2"><span class="text5">NEEDED INFO (CITY<STATE<ZIP) </span></td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="3"><span class="text5">
NEEDEDINFO (PHONE AND FAX)</span></td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="3"><span class="text5"><A target="_blank" href="http://www.caseih.com/DEALERS/johnsonimp">http://www.caseih.com/DEALERS/johnsonimp</A></span></td></tr><tr><td colspan="4"><img src="../images/pixel.gif" width="1" height="10"></td></tr></table></td><td><img src="../images/pixel.gif" width="8" height="10"></td><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td></tr><tr><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td><td colspan="3" class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td></tr><tr><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td><td><img src="../images/pixel.gif" width="8" height="10"></td><td valign="top"><table width="100%" border="0" cellpadding="0" cellspacing="0"><tr><td width="20" valign="top"><span class="text4">6.
So what should be the final output? You need to eliminate some tags and the info nested in them?
So you want almost all of the html, starting with ">5. and ending with ">6. ?
That doesn't seem right to me...
If you will notice all of the needed info is between "text4" and "text4"
this is a contact record so between the text4 and text 4 are several items that start with text5. I need to pull the name ,address , city ,state ,zip , email and website . All which start with Text5 (with in Text 4.)
I have this code lying around... what it does is remove all HTML tags leaving the text.
VB Code:
Private Sub Command1_Click() Dim strWorking As String Dim strOutput As String Dim lngPosLessThan As Long strWorking = Text1.Text Do While Len(strWorking) > 0 If Left(strWorking, 1) = "<" Then strWorking = Mid(strWorking, InStr(1, strWorking, ">") + 1) Else lngPosLessThan = InStr(1, strWorking, "<") If lngPosLessThan > 0 Then 'Move non-Tag string to strOutput strOutput = strOutput & Left(strWorking, lngPosLessThan - 1) strWorking = Mid(strWorking, InStr(1, strWorking, "<")) Else 'no other tag in string. strOutput = strOutput & strWorking strWorking = "" End If End If Loop Text2.Text = strOutput End Sub
In case your willing to enhance it to fit your needs.
This was my method for stripping HTML tags, may also be of some use
VB Code:
Function RemoveHTML(strHTML As String) As String Do tagOpen = InStr(1, strHTML, "<") tagClose = InStr(1, strHTML, ">") strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "") Debug.Print strHTML Loop Until InStr(1, strHTML, "<") = 0 RemoveHTML = strHTML End Function
If it doesn't help at all, I'll write some custom code for you ;/
hi,
if each record is separated by text4 then
use this
split(entiretext,"text4")
Can I make a small suggestion?Quote:
Originally posted by da_silvy
This was my method for stripping HTML tags, may also be of some use
VB Code:
Function RemoveHTML(strHTML As String) As String Do tagOpen = InStr(1, strHTML, "<") tagClose = InStr(1, strHTML, ">") strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "") Debug.Print strHTML Loop Until InStr(1, strHTML, "<") = 0 RemoveHTML = strHTML End Function
If it doesn't help at all, I'll write some custom code for you ;/
VB Code:
Function RemoveHTML(strHTML As String) As String Do tagOpen = InStr(1, strHTML, "<") tagClose = InStr(tagOpen, strHTML, ">") strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "") Debug.Print strHTML Loop Until InStr(1, strHTML, "<") = 0 RemoveHTML = strHTML End Function
Would that tiny change not help it not find a stray '>' that might be before the tagOpen?
Just a thought :D
Far from perfect but it should do what you want, you'll need to tweak it yourself
VB Code:
Private Sub Form_Load() Dim StartPos As Integer, EndPos As Integer Dim SearchText As String, IsRecord As Boolean Dim strText As String StartPos = 1 SearchText = Text1 SearchText = Replace(SearchText, vbCrLf, " ") ' Clean up double spaces ? While InStr(1, SearchText, " ") SearchText = Replace(SearchText, " ", " ") Wend While InStr(StartPos, SearchText, "<span class=""") <> 0 StartPos = InStr(StartPos, SearchText, "<span class=""") EndPos = InStr(StartPos, SearchText, "</span>") If Mid(SearchText, StartPos + Len("<span class="""), Len("text#")) = "text4" Then IsRecord = True Else IsRecord = False End If StartPos = StartPos + Len("<span class=""text#"">") 'strText = Mid(SearchText, StartPos, EndPos - StartPos) strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos))) ' Handle your stuff here If IsRecord Then List1.AddItem strText Else List1.AddItem vbTab & strText End If StartPos = EndPos Wend Text1 = SearchText End Sub Function RemoveHTML(strHTML As String) As String Do If InStr(1, strHTML, "<") = 0 Or InStr(1, strHTML, ">") = 0 Then Exit Do tagOpen = InStr(1, strHTML, "<") tagClose = InStr(1, strHTML, ">") strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "") Debug.Print strHTML Loop Until InStr(1, strHTML, "<") = 0 RemoveHTML = strHTML End Function
Well-formed HTML is actually a subset of XML. This means you should be able to use an XML parser to analysis the HTML file.
(saves you writing awkward code)
heh, it was a 5 minute function reply to a codebank, ideally it shouldn't actually happen, any actual ">"'s and "<"'s would be html characters i.e. > etcQuote:
Originally posted by Spajeoly
Can I make a small suggestion?
VB Code:
Function RemoveHTML(strHTML As String) As String Do tagOpen = InStr(1, strHTML, "<") tagClose = InStr(tagOpen, strHTML, ">") strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "") Debug.Print strHTML Loop Until InStr(1, strHTML, "<") = 0 RemoveHTML = strHTML End Function
Would that tiny change not help it not find a stray '>' that might be before the tagOpen?
Just a thought :D
it's a tighter piece of code with that small adjustment though, gw ;)
the code isn't perfect, it fails if there's no < or > in there either, for which there's a small fix in my last post ;p
Quote:
Originally posted by jayakumar
hi,
if each record is separated by text4 then
use this
split(entiretext,"text4")
This way is easiest.
It doesn't actually do what he wants, if you check the rest of the posts.
I couldn't send this via PM so here it is :p
VB Code:
Dim StartPos As Integer, EndPos As Integer, BRPos As Integer Dim SearchText As String, IsRecord As Boolean, strBR() As String Dim FirstField As Boolean, DealerCount As Integer, DealerPrefix As String Dim strText As String Dim strInput As String Open "c:\caseih.txt" For Input As #1 Do Until EOF(1) Input #1, strInput SearchText = SearchText & strInput & vbCrLf Loop Close #1 StartPos = 1 DealerCount = 0 SearchText = Mid(SearchText, InStr(1, SearchText, "../images/powered.gif")) SearchText = Replace(SearchText, vbCrLf, "") While InStr(StartPos, SearchText, "<span class=""") <> 0 StartPos = InStr(StartPos, SearchText, "<span class=""") EndPos = InStr(StartPos, SearchText, "</span>") FirstField = False If IsRecord Then FirstField = True If Mid(SearchText, StartPos + Len("<span class="""), Len("text#")) = "text4" Then IsRecord = True Else IsRecord = False End If StartPos = StartPos + Len("<span class=""text#"">") strText = Mid(SearchText, StartPos, EndPos - StartPos) 'strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos))) strText = Replace(strText, "<b>", "") strText = Replace(strText, "</b>", "") ' Handle your stuff here If Not IsRecord Then DealerPrefix = "" If FirstField Then DealerCount = DealerCount + 1 DealerPrefix = CStr(DealerCount) & "." End If If InStr(1, LCase(strText), "<a ") = 0 Then BRPos = InStr(1, strText, "<br>") If BRPos <> 0 Then strBR = Split(strText, "<br>") For i = 0 To UBound(strBR) If Len(Trim(strBR(i))) <> 0 Then List1.AddItem DealerPrefix & vbTab & Trim(strBR(i)) DealerPrefix = "" End If Next Else List1.AddItem DealerPrefix & vbTab & strText End If End If End If StartPos = EndPos Wend
I saved the website to a text file then used that, it parses it all, you just need to decide what you want to do with the information.
Hey
looks like it will work but on this line i get an error.
PHP Code:SearchText = Mid(SearchText, InStr(1, SearchText, "../images/powered.gif"))
The error says Invalid prcedure call or argument. Runtime error 5
I was parsing the entire website ;o
There are occurences of spans using that class before the dealer information
That "powered.gif" is the powered by map blah blah, and the dealers follow...
You can just remove that line, the code's no different (really) from what I posted before.
I guess i am still doing something wrong.
Program goes fromtoPHP Code:While InStr(StartPos, SearchText, "<span class=""") <> 0
PHP Code:End Sub
here is complete code .
PHP Code:
Private Sub Command1_Click()
Dim StartPos As Integer, EndPos As Integer, BRPos As Integer
Dim SearchText As String, IsRecord As Boolean, strBR() As String
Dim FirstField As Boolean, DealerCount As Integer, DealerPrefix As String
Dim strText As String
Dim strInput As String
Open "c:\caseih.txt" For Input As #1
Do Until EOF(1)
Input #1, strInput
SearchText = SearchText & strInput & vbCrLf
Loop
Close #1
StartPos = 1
DealerCount = 0
'SearchText = Mid(SearchText, InStr(1, SearchText, "../images/powered.gif"))
SearchText = Replace(SearchText, vbCrLf, "")
While InStr(StartPos, SearchText, "<span class=""") <> 0
StartPos = InStr(StartPos, SearchText, "<span class=""")
EndPos = InStr(StartPos, SearchText, "</span>")
FirstField = False
If IsRecord Then FirstField = True
If Mid(SearchText, StartPos + Len("<span class="""), Len("text#")) = "text4" Then
IsRecord = True
Else
IsRecord = False
End If
StartPos = StartPos + Len("<span class=""text#"">")
strText = Mid(SearchText, StartPos, EndPos - StartPos)
'strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos)))
strText = Replace(strText, "<b>", "")
strText = Replace(strText, "</b>", "")
' Handle your stuff here
If Not IsRecord Then
DealerPrefix = ""
If FirstField Then
DealerCount = DealerCount + 1
DealerPrefix = CStr(DealerCount) & "."
End If
If InStr(1, LCase(strText), "<a ") = 0 Then
BRPos = InStr(1, strText, "<br>")
If BRPos <> 0 Then
strBR = Split(strText, "<br>")
For i = 0 To UBound(strBR)
If Len(Trim(strBR(i))) <> 0 Then
List1.AddItem DealerPrefix & vbTab & Trim(strBR(i))
DealerPrefix = ""
End If
Next
Else
List1.AddItem DealerPrefix & vbTab & strText
End If
End If
End If
StartPos = EndPos
Wend
End Sub
have you set the search text or anything?
you can't use my code "as is", you have to do some work ._.
what do you mean by setting the searchtext, it is the string of HTML , what do you mean.
im sorry but i dont understand what to do.
thanks for your help.
well you have to adjust the code for what you are doing
where are you getting the text which you want to parse from?
Split([entireHTMLstring],"text4") gives you an array containing
the text between all the text4's.
if you then run a Split(text4array(1,2,3,4,etc),"text5") it will give you the text between all the text5's.
course, then you gotta clean it up, cause I don't think you want half-tags in it.
use the replace function.
da_silvy,
here is the text i am trying to parse.
Any help is appreciated.
That website's a joke, it's not html transitional at all.
VB Code:
Dim StartPos As Double, EndPos As Double, BRPos As Double Dim SearchText As String, IsRecord As Boolean, strBR() As String Dim FirstField As Boolean, DealerCount As Integer, DealerPrefix As String Dim strText As String Dim strInput As String Open "c:\caseih.htm" For Input As #1 Do Until EOF(1) Input #1, strInput SearchText = SearchText & strInput & vbCrLf Loop Close #1 StartPos = 1 DealerCount = 0 SearchText = Mid(SearchText, InStr(1, SearchText, "/powered.gif")) SearchText = Replace(SearchText, vbCrLf, "") While InStr(StartPos, LCase(SearchText), "<span class=") <> 0 StartPos = InStr(StartPos, LCase(SearchText), "<span class=") EndPos = InStr(StartPos, LCase(SearchText), "</span>") FirstField = False If IsRecord Then FirstField = True If Mid(LCase(SearchText), StartPos + Len("<span class="), Len("text#")) = "text4" Then IsRecord = True Else IsRecord = False End If StartPos = StartPos + Len("<span class=text#>") strText = Mid(SearchText, StartPos, EndPos - StartPos) 'strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos))) strText = Replace(strText, "<b>", "") strText = Replace(strText, "</b>", "") ' Handle your stuff here If Not IsRecord Then DealerPrefix = "" If FirstField Then DealerCount = DealerCount + 1 DealerPrefix = CStr(DealerCount) & "." End If If InStr(1, LCase(strText), "<a ") = 0 Then BRPos = InStr(1, strText, "<br>") If BRPos <> 0 Then strBR = Split(strText, "<br>") For i = 0 To UBound(strBR) If Len(Trim(strBR(i))) <> 0 Then List1.AddItem DealerPrefix & vbTab & Trim(strBR(i)) DealerPrefix = "" End If Next Else List1.AddItem DealerPrefix & vbTab & strText End If End If End If StartPos = EndPos Wend
You need to fix up the clean up yourself.