|
-
Dec 1st, 2004, 03:04 PM
#1
Thread Starter
Junior Member
Meta Tags, RegExpression
I was wondering if anyone knew a regular expression to pull out the contents= part of a meta tag from a given website?
-
Dec 1st, 2004, 06:21 PM
#2
'/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'
This regular caption will catch every meta tag where the content is after the definition attribute. It stores 'http-equiv' or 'name' in $1, the content of that attribute in $2 and the content of the content attribute in $3. Should be simple to reverse to catch the other metas too.
All the buzzt
 CornedBee
"Writing specifications is like writing a novel. Writing code is like writing poetry."
- Anonymous, published by Raymond Chen
Don't PM me with your problems, I scan most of the forums daily. If you do PM me, I will not answer your question.
-
Dec 1st, 2004, 07:50 PM
#3
Originally posted by CornedBee
'/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'
Impressive stuff
I needed to do something similar a month ago. Pull all the llinks inside anchor tags from a site. It was nice and easy to construct a regular expression based on vaid well formed XHTML but the truth is many sites out there don't use it. So I had to create a massive regular expression whoule would match all kinds of possiblilties.
e.g all these are valid HTML 4:
Code:
<meta http-equiv="Refresh" content="5" />
<META http-equiv='Refresh' content=5 >
<meTa HTTP-equiv=Refresh coNtent=5 >
<meta
content="5"
http-equiv=Refresh>
-
Dec 1st, 2004, 11:00 PM
#4
Thread Starter
Junior Member
in php you could of just put i after the final / and it would take care of the case stuff. As for the ' and " dont know.
-
Dec 2nd, 2004, 02:52 AM
#5
DTD and visualAd are right, here's a new version. There was also an error in this.
Watch out for board-removed double backslashes. Going through them, their count is 2, 1, 2, 2, 1, 2, 2.
It seems the single backslashes before the single quote signs were removed.
Code:
'|<meta\\s+([^=]+)=(["\'])(.*?)\\2)\\s+content=(["\'])([^"]*)\\4\\s*/?>|is'
Should capture pretty much everything now, with $1 being 'http-equiv' or 'name', $3 its value and $5 the content.
Last edited by CornedBee; Dec 2nd, 2004 at 03:07 AM.
All the buzzt
 CornedBee
"Writing specifications is like writing a novel. Writing code is like writing poetry."
- Anonymous, published by Raymond Chen
Don't PM me with your problems, I scan most of the forums daily. If you do PM me, I will not answer your question.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|