Results 1 to 5 of 5

Thread: Meta Tags, RegExpression

  1. #1

    Thread Starter
    Junior Member
    Join Date
    Aug 2004
    Location
    Texas
    Posts
    25

    Meta Tags, RegExpression

    I was wondering if anyone knew a regular expression to pull out the contents= part of a meta tag from a given website?

  2. #2
    Kitten CornedBee's Avatar
    Join Date
    Aug 2001
    Location
    In a microchip!
    Posts
    11,594
    '/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'
    This regular caption will catch every meta tag where the content is after the definition attribute. It stores 'http-equiv' or 'name' in $1, the content of that attribute in $2 and the content of the content attribute in $3. Should be simple to reverse to catch the other metas too.
    All the buzzt
    CornedBee

    "Writing specifications is like writing a novel. Writing code is like writing poetry."
    - Anonymous, published by Raymond Chen

    Don't PM me with your problems, I scan most of the forums daily. If you do PM me, I will not answer your question.

  3. #3
    VBA Nutter visualAd's Avatar
    Join Date
    Apr 2002
    Location
    Ickenham, UK
    Posts
    4,906
    Originally posted by CornedBee
    '/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'
    Impressive stuff

    I needed to do something similar a month ago. Pull all the llinks inside anchor tags from a site. It was nice and easy to construct a regular expression based on vaid well formed XHTML but the truth is many sites out there don't use it. So I had to create a massive regular expression whoule would match all kinds of possiblilties.

    e.g all these are valid HTML 4:
    Code:
    <meta http-equiv="Refresh" content="5" />
    <META http-equiv='Refresh' content=5 >
    <meTa HTTP-equiv=Refresh coNtent=5 >
    <meta
    
              content="5"
                                         http-equiv=Refresh>
    PHP || MySql || Apache || Get Firefox || OpenOffice.org || Click || Slap ILMV || 1337 c0d || GotoMyPc For FREE! Part 1, Part 2

    | PHP Session --> Database Handler * Custom Error Handler * Installing PHP * HTML Form Handler * PHP 5 OOP * Using XML * Ajax * Xslt | VB6 Winsock - HTTP POST / GET * Winsock - HTTP File Upload

    Latest quote: crptcblade - VB6 executables can't be decompiled, only disassembled. And the disassembled code is even less useful than I am.

    Random VisualAd: Blog - Latest Post: When the Internet becomes Electricity!!


    Spread happiness and joy. Rate good posts.

  4. #4

    Thread Starter
    Junior Member
    Join Date
    Aug 2004
    Location
    Texas
    Posts
    25
    in php you could of just put i after the final / and it would take care of the case stuff. As for the ' and " dont know.

  5. #5
    Kitten CornedBee's Avatar
    Join Date
    Aug 2001
    Location
    In a microchip!
    Posts
    11,594
    DTD and visualAd are right, here's a new version. There was also an error in this.

    Watch out for board-removed double backslashes. Going through them, their count is 2, 1, 2, 2, 1, 2, 2.
    It seems the single backslashes before the single quote signs were removed.

    Code:
    '|<meta\\s+([^=]+)=(["\'])(.*?)\\2)\\s+content=(["\'])([^"]*)\\4\\s*/?>|is'
    Should capture pretty much everything now, with $1 being 'http-equiv' or 'name', $3 its value and $5 the content.
    Last edited by CornedBee; Dec 2nd, 2004 at 03:07 AM.
    All the buzzt
    CornedBee

    "Writing specifications is like writing a novel. Writing code is like writing poetry."
    - Anonymous, published by Raymond Chen

    Don't PM me with your problems, I scan most of the forums daily. If you do PM me, I will not answer your question.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width