|
-
Jun 25th, 2009, 09:57 PM
#1
Thread Starter
Fanatic Member
Get the Text inbetween two words (such as HTML Tags) without RegEx
Since I've seen this asked a lot here is the function for getting all the text in between two other strings (or tags, words, etc.).
Code:
Code:
Private Function GetTagContents(ByVal Source As String, ByVal startTag As String, ByVal endTag As String) As List(Of String)
Dim StringsFound As New List(Of String)
Dim Index As Integer = Source.IndexOf(startTag) + startTag.Length
While Index <> startTag.Length - 1
StringsFound.Add(Source.Substring(Index, Source.IndexOf(endTag, Index) - Index))
Index = Source.IndexOf(startTag, Index) + startTag.Length
End While
Return StringsFound
End Function
Example Scenario:
If Source was set to "I {b}love{/b} the word {b}life{/b} don't you?" and you set "{b}" and "{/b}" as the starting and ending tags, the List {"love","life"} would be returned. If the tags don't appear at all in the Source string then the lists count will be 0.
Explanation:
The first 2 lines are just variable declarations. Although in the second one we go ahead and search for our first match with:
Code:
Source.IndexOf(startTag) + startTag.Length
As you can see its just a normal IndexOf which gives us the index of the first start tag, however then I put + starTag.Length. The reason for this addition is that we don't want the index of the startag, we want the index of hte text after the startTag, so adding the length of startTag to it's index will give us what comes directly after it.
Then comes the While Loop. Our condition is:
Code:
While Index <> startTag.Length - 1
As you know, when IndexOf can't find the string, it returns -1. Well we can't just put "While Index <> - 1" because we will always add the startTag.Length onto the IndexOf to get the index of the text in it. So -1 would really be - 1 + startTag.Length, or switch it around to be easier like in the code.
Then comes the first line of the loop:
Code:
StringsFound.Add(Source.Substring(Index, Source.IndexOf(endTag, Index) - Index))
It starts off with StringsFound.Add, so as you can tell where going to add the string we just found to the list. If no string was found then the loop will never run thanks to its condition. Now, we still don't have the string to add, just it's starting index, so within the Add command were also going to find the rest of the string in between the tags at the same time. We start off with a substring of the Source because we already know the starting index of that string thanks to when we declared Index. Then for the length of the substring, you search for the endTag using IndexOf and then put it's index. Notice that you don't add the endtags length like we did with startTag, this is because the index of endTag is the same index as the very end of the string we need, so we don't need to change it.
Notice that there is an extra parameter in the IndexOf though, this is because in the future were going to move onto the next set of Tags, we don't want to get the index of the same endTag the whole time! The second parameter is what index to start looking for the endTag, this is easy, we want to start looking for the endtag right after the word starts, so just use Index.
Then comes the last line of the loop:
Code:
Index = Source.IndexOf(startTag, Index) + startTag.Length
It's nearly the same as the line that we declared Index on, in fact it does the same thing. Can you spot the difference? Yes there is a second parameter for this IndexOf. We do that for the same reason as we did in the first line of the loop, because we don't want to find the same startTag over and over again. So since we know that the index of the very beginning of the string in between comes after the startTag we found, then using that same index will get us the startTag that comes next after it.
And that's it for the loop. The last line simply returns the List(Of String), StringsFound, as the result of the function.
Last edited by Vectris; Sep 8th, 2009 at 07:35 PM.
-
Jun 29th, 2009, 06:39 AM
#2
Addicted Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
What if have more with tag dog how to take all???
asdf<lol>rrrrr<dog>ruff</dog>akdje</lol><dog>mouse</dog>
-
Jun 29th, 2009, 11:48 AM
#3
Thread Starter
Fanatic Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
You could probably modify the code somehow to do that, or learn the RegEx way. If you wanna try this way then look at IndexOf() and it's second parameter with starting indexes.
-
Jun 29th, 2009, 01:39 PM
#4
Addicted Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
I use this is work for me is not the best code but that all that have in my mind and sry for english.
Vb.net Code:
Private Function GetTagContents(ByVal Source As String, ByVal startTag As String, ByVal endTag As String) As String
Dim firstIndex As Integer = Source.IndexOf(startTag) + startTag.Length
Dim text As String = txtString.Text
txtString.Text = text.Remove(0, Source.IndexOf(startTag) + startTag.Length + (Source.Substring(firstIndex, Source.IndexOf(endTag) - firstIndex)).Length + endTag.Length)
Return Source.Substring(firstIndex, Source.IndexOf(endTag) - firstIndex)
End Function
Private Sub btnPokaziRez_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnPokaziRez.Click
Dim s As String = txtString.Text
Dim i As Integer = s.IndexOf(txtPocTag.Text)
Do While (i <> -1)
ListBox1.Items.Add(GetTagContents(txtString.Text, txtPocTag.Text, txtZavTag.Text))
i = s.IndexOf(txtPocTag.Text, i + 1)
Loop
End Sub
-
Aug 9th, 2009, 06:25 PM
#5
Thread Starter
Fanatic Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
Ok I updated the code so that it will find all the text in between tags, not just the first result. It will now return a List(Of String) with all the text in between tags that it finds.
Don't forget to check the .Count in case no results are found.
-
Aug 12th, 2009, 11:59 AM
#6
Addicted Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
Yes this is good code then my code tnx anyway.
-
Aug 12th, 2009, 01:13 PM
#7
New Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
Sweet. Was just thinking about figuring out how to do this about an hour ago and then I stumble upon this when I wasn't even looking for it. Thanks for the code!
-
Aug 22nd, 2009, 02:20 AM
#8
Hyperactive Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
please ignore.
wrong thread
Last edited by csKanna; Aug 22nd, 2009 at 02:25 AM.
Kanna
-
Aug 22nd, 2009, 05:33 AM
#9
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
When looking for the endTag, shouldn't you use Index + 1 as the start index in the IndexOf function? When the tags are different it won't be a problem, but if the start and end tags are the same it keeps on finding the same start tag over and over, right? Didn't try it, but that's what I thought lol.
-
Sep 8th, 2009, 01:29 PM
#10
Lively Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
A friend of mine actually created something called GetBetweenAll, which adds a series of parsed strings to an array (for example, if you wanted to grab a bunch of items that were enveloped in the same tags within a table or something).
I don't think he'd mind if I posted it here (it's a little bit inefficient in terms of adding the items to a list, but that's easily modifiable):
Vb.net Code:
Public Sub GBA(ByRef strSource As String, ByRef strStart As String, ByRef strEnd As String, _
ByVal lstAdd As ListBox, Optional ByRef startPos As Integer = 0)
Dim iPos As Integer, iEnd As Integer, strResult As String, lenStart As Integer = strStart.Length
Do Until iPos = -1
strResult = String.Empty
iPos = strSource.IndexOf(strStart, startPos)
iEnd = strSource.IndexOf(strEnd, iPos + lenStart)
If iPos <> -1 AndAlso iEnd <> -1 Then
strResult = strSource.Substring(iPos + lenStart, iEnd - (iPos + lenStart))
lstAdd.Items.Add(strResult)
startPos = iPos + lenStart
End If
Loop
End Function
-
Sep 8th, 2009, 07:30 PM
#11
Thread Starter
Fanatic Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
@blupig
So you mean it basically does the same thing as my code? I'd rather you post your on topic for it then.
Mines less lines so I don't think it's as useful, that is if they do the same thing. Props to your friend for writing it, but I'd rather you make a topic for it.
@Nick
I'll look at that and test it out. There are several other things that could cause problems such as an unbalanced amount of start to stop tags. I sort of assume that the user of this code is going to be using it in a good-tag environment where the tag numbers would match and the start and end would be different. Still I look at it later and post back with what I find.
Last edited by Vectris; Sep 8th, 2009 at 07:33 PM.
-
Dec 8th, 2011, 11:43 PM
#12
New Member
Re: Get the Text inbetween two words (such as HTML Tags) without RegEx
@Vectris, any example how to apply this code especially the usage. I cannot figure it out right now how to apply this, I am scraping a web page contents.
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|