spliting strings

I need to split html code. I have the code as a string , i need to get all information between "text4" and "text4" There are several strings to extract so i need to go through the complete page. How do I do this?

I'm sure something incredibly similar to this has been posted before, have a look

If you can't find it, i'll try to help

I have been looking

I havent come up with anything yet. I am trying this code but im not sure how to use it properly.

Private Sub Command2_Click()
stemp = text1.Text

tmp = Mid$(stemp, InStr(stemp, "text4") + 1)
box1.Text = Left(tmp, InStr(tmp, "text4") - 1)

End Sub

Post an extract of the Text your trying to parse.

here is an extract of the text

I need to get the "NEEDED INFO"out.

All of the information I need is between "text4" for each record , Then each field needed with in the record starts with "Text5"

<td width="20" valign="top">5.
</td><td valign="top">NEEDED INFO(COMPANY)</td><td><img src="../images/pixel.gif" width="1" height="1"></td><td align="right" rowspan="3">
<table border="0" cellpadding="0" cellspacing="0"><tr><td><A href="map.asp?SEARCH_AFFILIATE_DATA.x=0&DBAFFILIATEID=%7B3EA51E13%2DC923%2D4FA5%2DBCEB%2DF693C62 3D24F%7D&INNERCODE=072595A101A&language=En" title="Street Level Map"><img width="25" height="22" border="0" src="../images/directorymap.gif" alt="Street Level Map"></A></td><td><A href="iti.asp?Search_iti.x=0&DBAFFILIATEID=%7B3EA51E13%2DC923%2D4FA5%2DBCEB%2DF693C623D24F%7D&am p;INNERCODE=072595A101A&ITI_START_COUNTRYCODE=US&ITI_START_CITYNAME=COLUMBUS&ITI_START_Z IPCODE=39702&ITI_START_STATE=MS&ITI_START_ADDRESS=&SESSIONID={1F5033B2-C668-43E4-A89B-725ADC1FD992}"><img width="25" height="22" border="0" src="../images/directorydriveme.gif"></A></td></tr></table></td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="2">NEEDED INFO(ADDRESS1) NEEDED INFO(ADDRESS2)</td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="2">NEEDED INFO (CITY<STATE<ZIP) </td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="3">
NEEDEDINFO (PHONE AND FAX)</td></tr><tr><td width="20"><img src="../images/pixel.gif" width="10" height="1"></td><td colspan="3"><A target="_blank" href="http://www.caseih.com/DEALERS/johnsonimp">http://www.caseih.com/DEALERS/johnsonimp</A></td></tr><tr><td colspan="4"><img src="../images/pixel.gif" width="1" height="10"></td></tr></table></td><td><img src="../images/pixel.gif" width="8" height="10"></td><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td></tr><tr><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td><td colspan="3" class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td></tr><tr><td class="bordersearch"><img src="../images/pixel.gif" width="1" height="1"></td><td><img src="../images/pixel.gif" width="8" height="10"></td><td valign="top"><table width="100%" border="0" cellpadding="0" cellspacing="0"><tr><td width="20" valign="top">6.

So what should be the final output? You need to eliminate some tags and the info nested in them?

So you want almost all of the html, starting with ">5. and ending with ">6. ?

That doesn't seem right to me...

If you will notice all of the needed info is between "text4" and "text4"

this is a contact record so between the text4 and text 4 are several items that start with text5. I need to pull the name ,address , city ,state ,zip , email and website . All which start with Text5 (with in Text 4.)

I have this code lying around... what it does is remove all HTML tags leaving the text.

VB Code:

Private Sub Command1_Click()
 
   Dim strWorking As String
   Dim strOutput As String
   Dim lngPosLessThan As Long
   
   strWorking = Text1.Text
   Do While Len(strWorking) > 0
      If Left(strWorking, 1) = "<" Then
         strWorking = Mid(strWorking, InStr(1, strWorking, ">") + 1)
      Else
         lngPosLessThan = InStr(1, strWorking, "<")
         If lngPosLessThan > 0 Then
            'Move non-Tag string to strOutput
            strOutput = strOutput & Left(strWorking, lngPosLessThan - 1)
            strWorking = Mid(strWorking, InStr(1, strWorking, "<"))
         Else   'no other tag in string.
            strOutput = strOutput & strWorking
            strWorking = ""
         End If
      End If
   Loop
   Text2.Text = strOutput
End Sub

In case your willing to enhance it to fit your needs.

This was my method for stripping HTML tags, may also be of some use

VB Code:

Function RemoveHTML(strHTML As String) As String
Do
    tagOpen = InStr(1, strHTML, "<")
    tagClose = InStr(1, strHTML, ">")
    strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "")
    Debug.Print strHTML
Loop Until InStr(1, strHTML, "<") = 0
RemoveHTML = strHTML
End Function

If it doesn't help at all, I'll write some custom code for you ;/

hi,

if each record is separated by text4 then

use this

split(entiretext,"text4")

Quote:

Originally posted by da_silvy
This was my method for stripping HTML tags, may also be of some use

VB Code:

Function RemoveHTML(strHTML As String) As String
Do
tagOpen = InStr(1, strHTML, "<")
tagClose = InStr(1, strHTML, ">")
strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "")
Debug.Print strHTML
Loop Until InStr(1, strHTML, "<") = 0
RemoveHTML = strHTML
End Function

If it doesn't help at all, I'll write some custom code for you ;/

Can I make a small suggestion?

VB Code:

Function RemoveHTML(strHTML As String) As String
Do
    tagOpen = InStr(1, strHTML, "<")
    tagClose = InStr(tagOpen, strHTML, ">")
    strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "")
    Debug.Print strHTML
Loop Until InStr(1, strHTML, "<") = 0
RemoveHTML = strHTML
End Function

Would that tiny change not help it not find a stray '>' that might be before the tagOpen?

Just a thought :D

Far from perfect but it should do what you want, you'll need to tweak it yourself

VB Code:

Private Sub Form_Load()
    Dim StartPos As Integer, EndPos As Integer
    Dim SearchText As String, IsRecord As Boolean
    Dim strText As String
    StartPos = 1
    
    SearchText = Text1
 
    SearchText = Replace(SearchText, vbCrLf, " ")
    ' Clean up double spaces ?
    While InStr(1, SearchText, "  ")
        SearchText = Replace(SearchText, "  ", " ")
    Wend
 
    While InStr(StartPos, SearchText, "<span class=""") <> 0
        StartPos = InStr(StartPos, SearchText, "<span class=""")
        EndPos = InStr(StartPos, SearchText, "</span>")
        
        If Mid(SearchText, StartPos + Len("<span class="""), Len("text#")) = "text4" Then
            IsRecord = True
        Else
            IsRecord = False
        End If
        
        StartPos = StartPos + Len("<span class=""text#"">")
        'strText = Mid(SearchText, StartPos, EndPos - StartPos)
        strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos)))
        
        ' Handle your stuff here
        If IsRecord Then
            List1.AddItem strText
        Else
            List1.AddItem vbTab & strText
        End If
        
        StartPos = EndPos
    Wend
 
    Text1 = SearchText
 
End Sub
 
 
Function RemoveHTML(strHTML As String) As String
Do
    If InStr(1, strHTML, "<") = 0 Or InStr(1, strHTML, ">") = 0 Then Exit Do
    tagOpen = InStr(1, strHTML, "<")
    tagClose = InStr(1, strHTML, ">")
    strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "")
    Debug.Print strHTML
Loop Until InStr(1, strHTML, "<") = 0
RemoveHTML = strHTML
End Function

Well-formed HTML is actually a subset of XML. This means you should be able to use an XML parser to analysis the HTML file.

(saves you writing awkward code)

Quote:
Originally posted by Spajeoly
Can I make a small suggestion?

VB Code:

Function RemoveHTML(strHTML As String) As String
Do
tagOpen = InStr(1, strHTML, "<")
tagClose = InStr(tagOpen, strHTML, ">")
strHTML = Replace(strHTML, Mid(strHTML, tagOpen, tagClose - tagOpen + 1), "")
Debug.Print strHTML
Loop Until InStr(1, strHTML, "<") = 0
RemoveHTML = strHTML
End Function

Would that tiny change not help it not find a stray '>' that might be before the tagOpen?

Just a thought :D
heh, it was a 5 minute function reply to a codebank, ideally it shouldn't actually happen, any actual ">"'s and "<"'s would be html characters i.e. > etc

it's a tighter piece of code with that small adjustment though, gw ;)

the code isn't perfect, it fails if there's no < or > in there either, for which there's a small fix in my last post ;p

Quote:

Originally posted by jayakumar
hi,

if each record is separated by text4 then

use this

split(entiretext,"text4")

This way is easiest.

It doesn't actually do what he wants, if you check the rest of the posts.

I couldn't send this via PM so here it is :p

VB Code:

Dim StartPos As Integer, EndPos As Integer, BRPos As Integer
    Dim SearchText As String, IsRecord As Boolean, strBR() As String
    Dim FirstField As Boolean, DealerCount As Integer, DealerPrefix As String
    Dim strText As String
    
    Dim strInput As String
    Open "c:\caseih.txt" For Input As #1
 
    
    Do Until EOF(1)
        Input #1, strInput
        SearchText = SearchText & strInput & vbCrLf
    Loop
    
    Close #1
 
    StartPos = 1
    DealerCount = 0
 
    
    SearchText = Mid(SearchText, InStr(1, SearchText, "../images/powered.gif"))
    SearchText = Replace(SearchText, vbCrLf, "")
    
    While InStr(StartPos, SearchText, "<span class=""") <> 0
        StartPos = InStr(StartPos, SearchText, "<span class=""")
        EndPos = InStr(StartPos, SearchText, "</span>")
        
        FirstField = False
        If IsRecord Then FirstField = True
        If Mid(SearchText, StartPos + Len("<span class="""), Len("text#")) = "text4" Then
            IsRecord = True
        Else
            IsRecord = False
        End If
        
        StartPos = StartPos + Len("<span class=""text#"">")
        strText = Mid(SearchText, StartPos, EndPos - StartPos)
        'strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos)))
        
        strText = Replace(strText, "<b>", "")
        strText = Replace(strText, "</b>", "")
        
        ' Handle your stuff here
        If Not IsRecord Then
            DealerPrefix = ""
            If FirstField Then
                DealerCount = DealerCount + 1
                DealerPrefix = CStr(DealerCount) & "."
            End If
            
            If InStr(1, LCase(strText), "<a ") = 0 Then
                BRPos = InStr(1, strText, "<br>")
                If BRPos <> 0 Then
                    strBR = Split(strText, "<br>")
                    For i = 0 To UBound(strBR)
                        If Len(Trim(strBR(i))) <> 0 Then
                            List1.AddItem DealerPrefix & vbTab & Trim(strBR(i))
                            DealerPrefix = ""
                        End If
                    Next
                Else
                    List1.AddItem DealerPrefix & vbTab & strText
                End If
            End If
        End If
        StartPos = EndPos
    Wend

I saved the website to a text file then used that, it parses it all, you just need to decide what you want to do with the information.

Hey
looks like it will work but on this line i get an error.

PHP Code:

SearchText = Mid(SearchText, InStr(1, SearchText, "../images/powered.gif"))

The error says Invalid prcedure call or argument. Runtime error 5

I was parsing the entire website ;o

There are occurences of spans using that class before the dealer information

That "powered.gif" is the powered by map blah blah, and the dealers follow...

You can just remove that line, the code's no different (really) from what I posted before.

I guess im still doing something wrong.

I guess i am still doing something wrong.

Program goes from

PHP Code:

While InStr(StartPos, SearchText, " 0

to

PHP Code:

End Sub

here is complete code .

PHP Code:

Private Sub Command1_Click() Dim StartPos As Integer, EndPos As Integer, BRPos As Integer Dim SearchText As String, IsRecord As Boolean, strBR() As String Dim FirstField As Boolean, DealerCount As Integer, DealerPrefix As String Dim strText As String Dim strInput As String Open "c:\caseih.txt" For Input As #1 Do Until EOF(1) Input #1, strInput SearchText = SearchText & strInput & vbCrLf Loop Close #1 StartPos = 1 DealerCount = 0 'SearchText = Mid(SearchText, InStr(1, SearchText, "../images/powered.gif")) SearchText = Replace(SearchText, vbCrLf, "") While InStr(StartPos, SearchText, " 0 StartPos = InStr(StartPos, SearchText, "") FirstField = False If IsRecord Then FirstField = True If Mid(SearchText, StartPos + Len("") strText = Mid(SearchText, StartPos, EndPos - StartPos) 'strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos))) strText = Replace(strText, "", "") strText = Replace(strText, "", "") ' Handle your stuff here If Not IsRecord Then DealerPrefix = "" If FirstField Then DealerCount = DealerCount + 1 DealerPrefix = CStr(DealerCount) & "." End If If InStr(1, LCase(strText), "<a ") = 0 Then BRPos = InStr(1, strText, " ") If BRPos <> 0 Then strBR = Split(strText, " ") For i = 0 To UBound(strBR) If Len(Trim(strBR(i))) <> 0 Then List1.AddItem DealerPrefix & vbTab & Trim(strBR(i)) DealerPrefix = "" End If Next Else List1.AddItem DealerPrefix & vbTab & strText End If End If End If StartPos = EndPos Wend End Sub

have you set the search text or anything?

you can't use my code "as is", you have to do some work ._.

searchtext

what do you mean by setting the searchtext, it is the string of HTML , what do you mean.
im sorry but i dont understand what to do.

thanks for your help.

well you have to adjust the code for what you are doing

where are you getting the text which you want to parse from?

Split([entireHTMLstring],"text4") gives you an array containing
the text between all the text4's.

if you then run a Split(text4array(1,2,3,4,etc),"text5") it will give you the text between all the text5's.

course, then you gotta clean it up, cause I don't think you want half-tags in it.

use the replace function.

1 Attachment(s)

Here is the text i am trying to parse.

da_silvy,
here is the text i am trying to parse.

Any help is appreciated.

That website's a joke, it's not html transitional at all.

VB Code:

Dim StartPos As Double, EndPos As Double, BRPos As Double
    Dim SearchText As String, IsRecord As Boolean, strBR() As String
    Dim FirstField As Boolean, DealerCount As Integer, DealerPrefix As String
    Dim strText As String
    
    Dim strInput As String
    Open "c:\caseih.htm" For Input As #1
 
    
    Do Until EOF(1)
        Input #1, strInput
        SearchText = SearchText & strInput & vbCrLf
    Loop
    
    Close #1
 
    StartPos = 1
    DealerCount = 0
 
    
    SearchText = Mid(SearchText, InStr(1, SearchText, "/powered.gif"))
    SearchText = Replace(SearchText, vbCrLf, "")
    
    While InStr(StartPos, LCase(SearchText), "<span class=") <> 0
        StartPos = InStr(StartPos, LCase(SearchText), "<span class=")
        EndPos = InStr(StartPos, LCase(SearchText), "</span>")
        
        FirstField = False
        If IsRecord Then FirstField = True
        If Mid(LCase(SearchText), StartPos + Len("<span class="), Len("text#")) = "text4" Then
            IsRecord = True
        Else
            IsRecord = False
        End If
        
        StartPos = StartPos + Len("<span class=text#>")
        strText = Mid(SearchText, StartPos, EndPos - StartPos)
        'strText = RemoveHTML(Trim(Mid(SearchText, StartPos, EndPos - StartPos)))
        
        strText = Replace(strText, "<b>", "")
        strText = Replace(strText, "</b>", "")
        
        ' Handle your stuff here
        If Not IsRecord Then
            DealerPrefix = ""
            If FirstField Then
                DealerCount = DealerCount + 1
                DealerPrefix = CStr(DealerCount) & "."
            End If
            
            If InStr(1, LCase(strText), "<a ") = 0 Then
                BRPos = InStr(1, strText, "<br>")
                If BRPos <> 0 Then
                    strBR = Split(strText, "<br>")
                    For i = 0 To UBound(strBR)
                        If Len(Trim(strBR(i))) <> 0 Then
                            List1.AddItem DealerPrefix & vbTab & Trim(strBR(i))
                            DealerPrefix = ""
                        End If
                    Next
                Else
                    List1.AddItem DealerPrefix & vbTab & strText
                End If
            End If
        End If
        StartPos = EndPos
    Wend

You need to fix up the clean up yourself.