Meta Tags, RegExpression

Printable View

Dec 1st, 2004, 03:04 PM
DTD33inc

Meta Tags, RegExpression

I was wondering if anyone knew a regular expression to pull out the contents= part of a meta tag from a given website?
Dec 1st, 2004, 06:21 PM
CornedBee

'/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'
This regular caption will catch every meta tag where the content is after the definition attribute. It stores 'http-equiv' or 'name' in $1, the content of that attribute in $2 and the content of the content attribute in $3. Should be simple to reverse to catch the other metas too.
Dec 1st, 2004, 07:50 PM
visualAd

Quote:

Originally posted by CornedBee
'/<meta\w+([^=]+)="([^"]*)"\w+content="([^"]*)"\w*/?>/'

Impressive stuff :thumb:

I needed to do something similar a month ago. Pull all the llinks inside anchor tags from a site. It was nice and easy to construct a regular expression based on vaid well formed XHTML but the truth is many sites out there don't use it. So I had to create a massive regular expression whoule would match all kinds of possiblilties.

e.g all these are valid HTML 4:

Code:

<meta http-equiv="Refresh" content="5" /> <META http-equiv='Refresh' content=5 > <meTa HTTP-equiv=Refresh coNtent=5 > <meta content="5" http-equiv=Refresh>
Dec 1st, 2004, 11:00 PM
DTD33inc

in php you could of just put i after the final / and it would take care of the case stuff. As for the ' and " dont know.
Dec 2nd, 2004, 02:52 AM
CornedBee

DTD and visualAd are right, here's a new version. There was also an error in this.

Watch out for board-removed double backslashes. Going through them, their count is 2, 1, 2, 2, 1, 2, 2.
It seems the single backslashes before the single quote signs were removed.

Code:

'|<meta\\s+([^=]+)=(["\'])(.*?)\\2)\\s+content=(["\'])([^"]*)\\4\\s*/?>|is'

Should capture pretty much everything now, with $1 being 'http-equiv' or 'name', $3 its value and $5 the content.