[RESOLVED] Parse HTML with Regex

Printable View

Feb 18th, 2009, 11:44 AM
dogfighter

[RESOLVED] Parse HTML with Regex

Thanks to threads found on this forum I've been able to grab a web page and place it into a document. I now have certain text that I need to extract from the page. This text is always surrounded by the same HTML tags.

<h3><a id=link1 href="http://www.site.com"><b>Listing Title</b></a></h3><cite>MyWebsite.com</cite>      Come visit my awesome website!<li class

I cut off at "<li class" because the actual class is not the same from one to the next.

The things I need to grab from this block are, in order:

link1
Listing Title
Come visit my awesome website!
MyWebsite.com

There are 10-15 of these blocks on a page, so I'd like to loop this and store each to its own set of text boxes...

text1.text = link1
text2.text = Listing Title

etc etc..

I'm slightly familiar with PHP so I know I have to do this with regex, but I'd really appreciate some help figuring out how to go about actually putting this together. :confused:
Feb 18th, 2009, 12:14 PM
Zach_VB6

Re: Parse HTML with Regex

You could just use a Mid$() function to parse through those, no need for Regex :P

Code:

Public Function GB(rC As String, rS As String, rF As String, Optional lgB As Long = 1) As String On Error Resume Next lgB = InStr(lgB, rC, rS) + Len(rS): GB = Mid$(rC, lgB, InStr(lgB, rC, rF) - lgB) End Function

GB("abcdef", "ab", "ef") returns "cd"

And if the string is not found in rC, then it returns nothing
Feb 18th, 2009, 04:32 PM
dogfighter

Re: Parse HTML with Regex

The simple thought of not having to use regex makes me tingle. Thanks for the tip, I'll try it and report back with the results!
Feb 18th, 2009, 04:59 PM
dogfighter

Re: Parse HTML with Regex

Ok it works, but there are pound signs and forward slashes in the areas of code I'm trying to match between that cause it to break...

I tried putting the two areas I'm maching in between in strings, but as soon as I add the pound sign, it starts returning the wrong match

For example:

Starting the match with "<h3><a id=link1 href=" works fine

Starting the match with "<h3><a id=link1 href=#" causes it to return the wrong match... WAY wrong, like not even in the neighborhood of the string I'm looking for

Re: Parse HTML with Regex

hey dogfighter
as u told u r familier with regular expressions so i m giving u a sample for implementing these in vb
use following function
and pass pattern and text to be parsed

vb Code:

Function TestRegExp(sPattern As String, sText As String)
   Dim oRegExp As RegExp
   Dim oMatch As Match
   Dim oMatches As MatchCollection
   Dim sOutput As String
   
   Set oRegExp = New RegExp
   
   oRegExp.Pattern = sPattern
   oRegExp.IgnoreCase = True
   oRegExp.Global = True
   
   If (oRegExp.Test(sText) = True) Then
    Set oMatches = oRegExp.Execute(sText)   
    For Each oMatch In oMatches   
      sOutput = sOutput & "Match found at position "
      sOutput = sOutput & oMatch.FirstIndex & ". Match Value is '"
      sOutput = sOutput & oMatch.Value & "'." & vbCrLf
    Next
   Else
    sOutput = "String Matching Failed"
   End If
   TestRegExp = sOutput
End Function

Feb 19th, 2009, 10:23 AM
dogfighter

Re: Parse HTML with Regex

Appreciate it suki, but I'd like to avoid regex if I can. Zach's method was working just fine until I included that pound sign.

Can anyone shed some light on a way around this?
Feb 20th, 2009, 01:21 PM
dogfighter

Re: Parse HTML with Regex

Nvm, i was missing something in my match string, my error. Thanks to Zach and suki for your help.