PDA

Click to See Complete Forum and Search --> : dom


slice
Apr 7th, 2006, 07:29 AM
i am back again :)
today i want to learn and want to complete the task of analyzing webpage for its anchor links and to record them for spidering.I also want to respect the "nofollow" attribute.
But the problem is that i haven't good experience of DOM with PHP i want to need your help of some code samples or some examples.

I have tried to use regular expression after reading from
http://blogs.worldnomads.com.au/matthewb/archive/2004/04/06/215.aspx
but it looks to be bit more difficult and less complete solution than DOM.
I need help of your comments :)

Thank You.

visualAd
Apr 7th, 2006, 02:00 PM
Firstly, DOM will only work with valid XHTML. Unfortunatly, a lot of websites do not use it. If you know the site you will be using will have valid XHTML, then you should use DOM.

Both PHP 4 (http://www.php.net/domxml) and PHP 5 (http://www.php.net/dom) have support for XML. The latter being w3c complient and the better option if you have PHP 5 available. What problems were you encountering with DOM?

I had quite a good PCRE whic matches links in web pages. I'll have a look and see if I can dig it out.

slice
Apr 8th, 2006, 12:00 AM
If DOM only works with XHTML document then certainly i should not use it because i could be asked to spider any webpage which may or may not be XHTML compliant.
Then for now my project is to spider any given webpage store its anchor links and spider the anchor link to just 1 level deep and do not spider anchor link with rel="nofollow".

And i have start with correctly analyzing the webpage for anchor links.
What could be the best way to check webpage for all anchor links and then extracting attributes (most importantly rel="nofollow") ?

slice
Apr 9th, 2006, 01:20 AM
any help/comment please ?

visualAd
Apr 9th, 2006, 02:17 AM
The best option, like shown in the example above, is to use two regular expressions. One to match the anchor tags and the other its attributes. The regular expression below does a standard match on a URL with an href attribute.

/<a.*href=((\"(.+)\".*>)|((\S+)((\s.*>)|>)))(.+)<\/a>/sU


But, if you want more attributes such as rel matched, I suggest you go for something like the function I've posted below. I use this to extract information about HTML forms and it seems to work OK on all but the most shoddy HTML code. To get all anchors on a page, use the following:

$anchors = get_html_tags($htmlCode, 'a');

// loop through and find anchors with rel="nofollow" attribute
foreach($anchors as $anchor) {
// link text, note: this may contain HTML
$text = $anchor['text'];

// look for rel in attributes array
if (array_key_exists('rel', $anchor['attirbutes']) &&
$anchor['attributes']['rel'] == 'nofollow') {
/* do stuff here */
}
}


The code for the get_html_tags() function is posted below:

/**
* Matches HTML tags in a string and their attributes.
*
* Returns array in the format:
* Array ([0] => Array ([text] => "text in between tags"
* [attributes] => Array ([attribute1] => [value],
* [attribute2] => [value])));
* @param $text string The string to find tags in.
* @param $tagname string The tag name of he tag to match e.g: form
* @param $clostag boolean Does the tag have corresponding close tags. i.e: </$tagname>
* @param $close_optional If a closetag may be missing, it is closed implicitly when another tag of the same name is found.
* include the name of an additional tags that closes the tag implicitly, here.
* @return array An indexed array, one element for each tag.
*
* @author Adam Delves <codedv @ sccode . com >
*/
function get_html_tags($text, $tagname, $closetag=true, $close_optional=false)
{
/* escape PCRE characters in tag name */
$tagname = preg_quote($tagname);

$ret = array();

/* regular expression to match mattributes in a tag name */
$attrib_match = "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";

if ($closetag) {
if ($close_optional !== false) {
$close_optional = preg_quote($close_optional);
$regex = "/<$tagname(.*)>(.*)(<\/$tagname>|(?=<$tagname>)|(?=<\/$close_optional>))/Uis";
} else {
$regex = "/<$tagname(.*)>(.*)<\/$tagname>/Uis";
}
} else {
$regex = "/<$tagname((.+))\/?>/Uis";
}

// regex now matches the tag appropriatly

preg_match_all($regex, $text, $tags, PREG_SET_ORDER);

$tag_count = count($tags);

for($t = 0; $t < $tag_count; $t++) {
$tag = array();

/* get attributes */
preg_match_all($attrib_match, $tags[$t][1], $attributes, PREG_SET_ORDER);

$attribs = array();

$attrib_count = count($attributes);
for($a = 0; $a < $attrib_count; $a++) {
$name = strtolower(trim($attributes[$a][1]));

if (isset($attributes[$a][4])) { // name value pair found
$value = trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]);
} else {
$value = $name;
}

$attribs[$name] = $value;
}

$tag['attributes'] = $attribs;

if ($closetag) {
$tag['text'] = $tags[$t][2];
}

$ret[] = $tag;
}

return $ret;
}

slice
Apr 19th, 2006, 10:11 AM
Its really superb code.But
$anchor['attributes']['href']

remains all the time empty.

visualAd
Apr 19th, 2006, 12:47 PM
What does the HTML code that you are trying to parse look like?

slice
Apr 20th, 2006, 07:21 AM
I am using in that way
$htmlCode = file_get_contents("http://www.vbforums.com/");
$anchors = get_html_tags($htmlCode, 'a');
foreach($anchors as $anchor) {
$text = $anchor['text'];
if (array_key_exists('rel', $anchor['attirbutes']) &&
$anchor['attributes']['rel'] == 'nofollow') {
echo $text."<br>";
}
}


But it gives me errors
Undefined index: attirbutes
array_key_exists(): The second argument should be either an array or an object


I have also tried

$htmlCode = file_get_contents("http://www.vbforums.com");
$anchors = get_html_tags($htmlCode, 'a');
foreach($anchors as $anchor) {
$text = $anchor['text'];
$urlhref = $anchor['attributes']['href'];
echo ("<a href=".$urlhref.">".$text."</a><br>");
}


But with same errors i think there is problem with the function .

I would appreciate your help. :)

slice
Apr 20th, 2006, 07:30 AM
Here i would mention that

$anchor['text']

always get the correct value but problem is with

$anchor['attributes']['rel']
and $anchor['attributes']['href'] array values :)

visualAd
Apr 20th, 2006, 10:22 AM
If you show me the HTML you are giving the function, I will be able to have a look at it. So far the HTML I have tried works. So, like I asked in my previous post, post the HTML you are putting into the function and I'll take a look :)

slice
Apr 20th, 2006, 10:10 PM
Html comes from this code
$htmlCode = file_get_contents("http://www.vbforums.com");

Isn't it right way?

PlaGuE
Apr 21st, 2006, 12:32 AM
uhhhhh.

slice
Apr 21st, 2006, 01:02 AM
Plague anything wrong with my code :confused:

slice
Apr 21st, 2006, 07:38 AM
I have found where is the problem in the code.


preg_match_all($attrib_match, $tags[$t][1], $attributes, PREG_SET_ORDER);

$attribs = array();

$attrib_count = count($attributes);
echo $attrib_count."<br>";
for($a = 0; $a < $attrib_count; $a++) {
$name = strtolower(trim($attributes[$a][1]));
echo $name."<br>";

if (isset($attributes[$a][4])) { // name value pair found
$value = trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]);
} else {
$value = $name;
}




Here $name is always empty.
So i think there is problem :)

And it may be because of this regular expression.


$attrib_match = "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";


I am not master of RE so kindly help me with this last problem. :)

Thank You.

visualAd
Apr 21st, 2006, 03:54 PM
Yes, the problem is with the regular experssion. I wrote it using PHP 5 (where it works). In PHP 4 the X modifier (which among other things, is meant to cuase the compiler to ignore whitespace in the expression) is being ignored and the expression is not matching.

Remove the X modifer and the white space and it will work:

$attrib_match = "/((?i)[a-z]+)(\s*=\s*(((?U)(\"(.*)\"))|(\s+))|\s+)/si";

slice
Sep 30th, 2006, 04:27 AM
Hi VisualAd,

would you help me in making this code more fine because now it extracts anything between <a> and </a> like

<a href="http://www.vbforums.com><span id="testing">World</span></a>


It will give <span id="testing">World</span>


and any font tags also are included


<a href="http://www.vbforums.com><span id="testing"><font color="#ffffff">Hello</font></span></a>


it will give <span id="testing"><font color="#ffffff">Hello</font></span> as anchor text.


What changes would help to extract "World" only in first case and "Hello" in second case ?

I am really dumb at RE. :o

please help.

slice
Oct 9th, 2006, 10:47 AM
please anybody can help? :)

kows
Oct 9th, 2006, 04:28 PM
I looked but I'm not great with complex regular expressions, sorry.

you might be able to PM VisualAd and ask for their help directly, in case they just haven't read this forum lately.

slice
Oct 10th, 2006, 07:21 AM
Well, i will pm him :)


VisualAd ... when search for "a" tag it also includes "area" tag. how to avoid it?

Matt_T_hat
Oct 11th, 2006, 03:08 AM
How are you at pulling apart code?

I once wrote a plugin that indexes a href with respect to the rel="tag" as you want to respect the rel="nofollow" this should be close (I indexed rel="tag" where as you want to ignore rel="nofollow")

I wanted to be light on the CPU and so used no RegEx at all. Which I think you might approve of.

I must warn you that it looks a bit complex but is easier to read than pages of RegEx (for me).

The code was a plugin but it should be obviouse what is what.

The file is NP_realtags_0.0.1.zip and is found here:
http://freestuff.lordmatt.co.uk/my_downloads/NucleusCMS%20Stuff/

You are welcome to use what you find should you need to.

CornedBee
Oct 11th, 2006, 03:17 AM
I just noticed something in visualAd's very first post. DOM works with HTML, too, but if the HTML is not valid, the resulting tree will be rather unpredictable.