I was wondering if anyone knew a regular expression to pull out the contents= part of a meta tag from a given website?
Printable View
I was wondering if anyone knew a regular expression to pull out the contents= part of a meta tag from a given website?
'/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'
This regular caption will catch every meta tag where the content is after the definition attribute. It stores 'http-equiv' or 'name' in $1, the content of that attribute in $2 and the content of the content attribute in $3. Should be simple to reverse to catch the other metas too.
Impressive stuff :thumb:Quote:
Originally posted by CornedBee
'/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'
I needed to do something similar a month ago. Pull all the llinks inside anchor tags from a site. It was nice and easy to construct a regular expression based on vaid well formed XHTML but the truth is many sites out there don't use it. So I had to create a massive regular expression whoule would match all kinds of possiblilties.
e.g all these are valid HTML 4:
Code:<meta http-equiv="Refresh" content="5" />
<META http-equiv='Refresh' content=5 >
<meTa HTTP-equiv=Refresh coNtent=5 >
<meta
content="5"
http-equiv=Refresh>
in php you could of just put i after the final / and it would take care of the case stuff. As for the ' and " dont know.
DTD and visualAd are right, here's a new version. There was also an error in this.
Watch out for board-removed double backslashes. Going through them, their count is 2, 1, 2, 2, 1, 2, 2.
It seems the single backslashes before the single quote signs were removed.
Should capture pretty much everything now, with $1 being 'http-equiv' or 'name', $3 its value and $5 the content.Code:'|<meta\\s+([^=]+)=(["\'])(.*?)\\2)\\s+content=(["\'])([^"]*)\\4\\s*/?>|is'