[RESOLVED] Parse HTML with Regex

**dogfighter** · Feb 18th, 2009, 11:44 AM

Thanks to threads found on this forum I've been able to grab a web page and place it into a document. I now have certain text that I need to extract from the page. This text is always surrounded by the same HTML tags.

<h3><a id=link1 href="http://www.site.com"><b>Listing Title</b></a></h3><cite>MyWebsite.com</cite>      Come visit my awesome website!<li class

I cut off at "<li class" because the actual class is not the same from one to the next.

The things I need to grab from this block are, in order:

link1
Listing Title
Come visit my awesome website!
MyWebsite.com

There are 10-15 of these blocks on a page, so I'd like to loop this and store each to its own set of text boxes...

text1.text = link1
text2.text = Listing Title

etc etc..

I'm slightly familiar with PHP so I know I have to do this with regex, but I'd really appreciate some help figuring out how to go about actually putting this together.

**Zach_VB6** · Feb 18th, 2009, 12:14 PM

You could just use a Mid$() function to parse through those, no need for Regex :P

Code:

Public Function GB(rC As String, rS As String, rF As String, Optional lgB As Long = 1) As String
On Error Resume Next
    lgB = InStr(lgB, rC, rS) + Len(rS): GB = Mid$(rC, lgB, InStr(lgB, rC, rF) - lgB)
End Function

GB("abcdef", "ab", "ef") returns "cd"

And if the string is not found in rC, then it returns nothing

**dogfighter** · Feb 18th, 2009, 04:32 PM

The simple thought of not having to use regex makes me tingle. Thanks for the tip, I'll try it and report back with the results!

**dogfighter** · Feb 18th, 2009, 04:59 PM

Ok it works, but there are pound signs and forward slashes in the areas of code I'm trying to match between that cause it to break...

I tried putting the two areas I'm maching in between in strings, but as soon as I add the pound sign, it starts returning the wrong match

For example:

Starting the match with "<h3><a id=link1 href=" works fine

Starting the match with "<h3><a id=link1 href=#" causes it to return the wrong match... WAY wrong, like not even in the neighborhood of the string I'm looking for

**su ki** · Feb 19th, 2009, 02:45 AM

hey dogfighter
as u told u r familier with regular expressions so i m giving u a sample for implementing these in vb
use following function
and pass pattern and text to be parsed

vb Code:

Function TestRegExp(sPattern As String, sText As String)
   Dim oRegExp As RegExp
   Dim oMatch As Match
   Dim oMatches As MatchCollection
   Dim sOutput As String
   
   Set oRegExp = New RegExp
   
   oRegExp.Pattern = sPattern
   oRegExp.IgnoreCase = True
   oRegExp.Global = True
   
   If (oRegExp.Test(sText) = True) Then
    Set oMatches = oRegExp.Execute(sText)   
    For Each oMatch In oMatches   
      sOutput = sOutput & "Match found at position "
      sOutput = sOutput & oMatch.FirstIndex & ". Match Value is '"
      sOutput = sOutput & oMatch.Value & "'." & vbCrLf
    Next
   Else
    sOutput = "String Matching Failed"
   End If
   TestRegExp = sOutput
End Function

**dogfighter** · Feb 19th, 2009, 10:23 AM

Appreciate it suki, but I'd like to avoid regex if I can. Zach's method was working just fine until I included that pound sign.

Can anyone shed some light on a way around this?

**dogfighter** · Feb 20th, 2009, 01:21 PM

Nvm, i was missing something in my match string, my error. Thanks to Zach and suki for your help.

Thread: [RESOLVED] Parse HTML with Regex

Thread Tools

Display

[RESOLVED] Parse HTML with Regex

Re: Parse HTML with Regex

Re: Parse HTML with Regex

Re: Parse HTML with Regex

Re: Parse HTML with Regex

Re: Parse HTML with Regex

Re: Parse HTML with Regex

Posting Permissions