Results 1 to 10 of 10

Thread: Regex Help!

  1. #1

    Thread Starter
    PowerPoster i00's Avatar
    Join Date
    Mar 2002
    Location
    1/2 way accross the galaxy.. and then some
    Posts
    2,388

    Regex Help!

    I have the following regular expression that I am writing ...

    <page\s[^>]*?(?<=\s)title\s*?=\s*?(?<tag>'|")(?<Title>.+?)\k<tag>


    This gets elements from html like syntax in this case it gets the title value from the page element (will match it to the title group)...

    I need to to work in a multitude of situations and it will currently work for the following situations:

    <page someproperty="asd" title="hello">
    <page someproperty="asd" title="hello">


    However I also want it to work when there are NO quotation marks such that:

    <page someproperty="asd" title=hello>

    Will match too...

    I can't for the life of me figure this one out!

    Thanks,
    Kris

  2. #2
    Bad man! ident's Avatar
    Join Date
    Mar 2009
    Location
    Cambridge
    Posts
    5,398

    Re: Regex Help!

    Your pattern is different to the actual html? You dont have to be greedy, just as an example

    vb Code:
    1. Option Strict On
    2.  
    3. Imports System.Text.RegularExpressions
    4.  
    5. Public Class Form1
    6.  
    7.     Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
    8.         Dim pattern As String = "(?i)(?<=<pag.+title=""?)[a-z]+(?=""?>)"
    9.         Dim items = {"<page someproperty=""asd"" title=""hello"">",
    10.                      "<page someproperty=""asd"" title=hello>"}
    11.  
    12.  
    13.         MessageBox.Show(String.Join(Environment.NewLine,
    14.                                     From item
    15.                                     In items
    16.                                     Select New Regex(pattern).Match(item).Value)
    17.                                 )
    18.     End Sub
    19.  
    20.  
    21.  
    22. End Class

    actually thinking about that depending how fussy you are you will want a back reference as "hello would match il post back after dinner

  3. #3

    Thread Starter
    PowerPoster i00's Avatar
    Join Date
    Mar 2002
    Location
    1/2 way accross the galaxy.. and then some
    Posts
    2,388

    Re: Regex Help!

    Quote Originally Posted by ident View Post
    Your pattern is different to the actual html? You dont have to be greedy, just as an example

    ...

    actually thinking about that depending how fussy you are you will want a back reference as "hello would match il post back after dinner
    Thanks for your reply ... just looking @ yours... there seem to be a few issues with it

    You have:
    (?i)(?<=<pag.+title="?)[a-z]+(?="?>)

    This will NOT match:

    <page title='hello'>

    And will incorrectly match:

    <page sometitle="hello"> <- This shouldn't match @ all
    <pag title="hello"> <- This shouldn't match @ all
    <page title="hello>there">


    Regards,
    Kris

  4. #4
    Fanatic Member AceInfinity's Avatar
    Join Date
    May 2011
    Posts
    696

    Re: Regex Help!

    Perhaps it could be optimized a bit more, but I took what you had and added what was needed...
    vbnet Code:
    1. Sub Main()
    2.    Dim arr As String() = New String() {
    3.       "<page someproperty="""" title=""1"">",
    4.       "<page someproperty=""asd"" title=""2"">",
    5.       "<page someproperty=""asd"" title=3>",
    6.       "<page someproperty=""asd"" title = 4>",
    7.       "<page someproperty=""asd"" title = '5'>",
    8.       "<page someproperty=""asd"" title = ""6"">",
    9.       "<page  title='7'>",
    10.       "<pAGe      TiTlE='8'>"
    11.    }
    12.    For Each s As String In arr
    13.       Console.WriteLine("Matched: ""{0}""", Regex.Match(s, "<page\s[^>]*?(?<=\s)title\s*?=(\s+)?(?<tag>[""']?)(?<Title>.+)\k<tag>>", RegexOptions.IgnoreCase).Groups("Title").Value)
    14.    Next
    15.    Console.ReadKey()
    16. End Sub

    I get this as output:
    Code:
    Matched: "1"
    Matched: "2"
    Matched: "3"
    Matched: "4"
    Matched: "5"
    Matched: "6"
    Matched: "7"
    Matched: "8"
    Last edited by AceInfinity; Nov 18th, 2013 at 10:58 PM. Reason: Improved the regex for odd cases
    <<<------------
    Improving Managed Code Performance | .NET Application Performance
    < Please if this helped you out. Any kind of thanks is gladly appreciated >


    .NET Programming (2012 - 2018)
    ®Crestron - DMC-T Certified Programmer | Software Developer
    <<<------------

  5. #5

    Thread Starter
    PowerPoster i00's Avatar
    Join Date
    Mar 2002
    Location
    1/2 way accross the galaxy.. and then some
    Posts
    2,388

    Re: Regex Help!

    Quote Originally Posted by AceInfinity View Post
    Perhaps it could be optimized a bit more, but I took what you had and added what was needed...
    ...

    I get this as output:
    ...

    A few problems...
    Your method also matches:
    <page someproperty="asd" title="hello'>
    <page someproperty="asd" title='hello">

    Which it shouldn't
    ... also it groups into tag "asd 123" in:
    <page someproperty="asd" title=asd 123>
    where it should only group "asd"

    .. so i still think my original method is the best one yet... shame since it doesn't work without quotes

    Kris

  6. #6
    Bad man! ident's Avatar
    Join Date
    Mar 2009
    Location
    Cambridge
    Posts
    5,398

    Re: Regex Help!

    I matched exactly what i assumed you wnated to m atch. Can you post up all the examples in html you want to match.

  7. #7
    Addicted Member
    Join Date
    Oct 2012
    Location
    Springfield, IL
    Posts
    142

    Re: Regex Help!

    Try this regex:
    <page\s[^>]*?(?<=\s)title\s*?=\s*?(?<tag>'|")?(?<Title>[^>"'\s]+)(?:[^>]*)?\k<tag>?

    vb.net Code:
    1. Dim TestRegex As New Regex("<page\s[^>]*?(?<=\s)title\s*?=\s*?(?<tag>'|"")?(?<Title>[^>""'\s]+)(?:[^>]*)?\k<tag>?",
    2.  RegexOptions.IgnoreCase Or RegexOptions.Multiline)


    Produces matches on the "Title" group like:

    Name:  regex_result.JPG
Views: 78
Size:  16.7 KB

  8. #8
    Fanatic Member AceInfinity's Avatar
    Join Date
    May 2011
    Posts
    696

    Re: Regex Help!

    Quote Originally Posted by i00 View Post
    A few problems...
    Your method also matches:
    <page someproperty="asd" title="hello'>
    <page someproperty="asd" title='hello">

    Which it shouldn't
    ... also it groups into tag "asd 123" in:
    <page someproperty="asd" title=asd 123>
    where it should only group "asd"

    .. so i still think my original method is the best one yet... shame since it doesn't work without quotes

    Kris
    Okay, but the issues you are providing me with as feedback are very biased..

    1. I've never even seen html like this:
    Code:
    <page someproperty="asd" title="hello'>
    Nor do I think that it is even valid (I'd have to verify)

    2.
    ... also it groups into tag "asd 123" in:
    <page someproperty="asd" title=asd 123>
    where it should only group "asd"
    I could modify that regex to solve both of these however, but it was never part of the original question in my mind...

    I went off the examples you gave, and threw in a couple others that I could think of. If you provide a full list of test cases, and what is supposed to match, I think this thread would have better replies for what you're looking for.

    How about this regex:
    Code:
    <page.+title(\s+)?=(\s+)?("[\w\s]+"|'[\w\s]+'|\w+)
    The portion after "<page.+" accurately matches these:
    Code:
    title=asd
    title="1"
    title="2"
    title=3
    title = 4
    title = '5'
    title = "6 test words"
    title='7'
    TiTlE='8'
    From this:
    Code:
    <page someproperty="asd" title="hello'>
    <page someproperty="asd" title='hello">
    <page someproperty="asd" lkjsdf=lksjdf title=asd 123 prp=lkj>
    <page someproperty="" title="1">
    <page someproperty="asd" title="2">
    <page someproperty="asd" title=3>
    <page someproperty="asd" title = 4>
    <page someproperty="asd" title = '5'>
    <page someproperty="asd"  someproperty="asd" title = "6 tes lkjt" someproperty="asd">
    <page  title='7'>
    <pAGe      TiTlE='8'>
    Quick and dirty.
    Last edited by AceInfinity; Nov 22nd, 2013 at 09:00 PM.
    <<<------------
    Improving Managed Code Performance | .NET Application Performance
    < Please if this helped you out. Any kind of thanks is gladly appreciated >


    .NET Programming (2012 - 2018)
    ®Crestron - DMC-T Certified Programmer | Software Developer
    <<<------------

  9. #9

    Thread Starter
    PowerPoster i00's Avatar
    Join Date
    Mar 2002
    Location
    1/2 way accross the galaxy.. and then some
    Posts
    2,388

    Re: Regex Help!

    The following is valid:

    <a href="something" alt="Kris' day out">

    .. yours would match Kris not Kris' day out

    And i need it to match ALL valid possibilities as I am phasing a large variety of web pages

    Kris

  10. #10
    Fanatic Member AceInfinity's Avatar
    Join Date
    May 2011
    Posts
    696

    Re: Regex Help!

    Code:
    title(\s+)?=(\s+)?("[\w\s']+"|'[\w\s]+'|\w+)
    All I had to do was add a single quote, now it matches.
    <<<------------
    Improving Managed Code Performance | .NET Application Performance
    < Please if this helped you out. Any kind of thanks is gladly appreciated >


    .NET Programming (2012 - 2018)
    ®Crestron - DMC-T Certified Programmer | Software Developer
    <<<------------

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width