Thread: dom

**slice** · Apr 7th, 2006, 07:29 AM

i am back again

today i want to learn and want to complete the task of analyzing webpage for its anchor links and to record them for spidering.I also want to respect the "nofollow" attribute.
But the problem is that i haven't good experience of DOM with PHP i want to need your help of some code samples or some examples.

I have tried to use regular expression after reading from
http://blogs.worldnomads.com.au/matt...04/06/215.aspx
but it looks to be bit more difficult and less complete solution than DOM.
I need help of your comments

Thank You.

**visualAd** · Apr 7th, 2006, 02:00 PM

Firstly, DOM will only work with valid XHTML. Unfortunatly, a lot of websites do not use it. If you know the site you will be using will have valid XHTML, then you should use DOM.

Both PHP 4 and PHP 5 have support for XML. The latter being w3c complient and the better option if you have PHP 5 available. What problems were you encountering with DOM?

I had quite a good PCRE whic matches links in web pages. I'll have a look and see if I can dig it out.

**slice** · Apr 8th, 2006, 12:00 AM

If DOM only works with XHTML document then certainly i should not use it because i could be asked to spider any webpage which may or may not be XHTML compliant.
Then for now my project is to spider any given webpage store its anchor links and spider the anchor link to just 1 level deep and do not spider anchor link with rel="nofollow".

And i have start with correctly analyzing the webpage for anchor links.
What could be the best way to check webpage for all anchor links and then extracting attributes (most importantly rel="nofollow") ?

**slice** · Apr 9th, 2006, 01:20 AM

any help/comment please ?

**visualAd** · Apr 9th, 2006, 02:17 AM

The best option, like shown in the example above, is to use two regular expressions. One to match the anchor tags and the other its attributes. The regular expression below does a standard match on a URL with an href attribute.

Code:

/<a.*href=((\"(.+)\".*>)|((\S+)((\s.*>)|>)))(.+)<\/a>/sU

But, if you want more attributes such as rel matched, I suggest you go for something like the function I've posted below. I use this to extract information about HTML forms and it seems to work OK on all but the most shoddy HTML code. To get all anchors on a page, use the following:

PHP Code:


$anchors = get_html_tags($htmlCode, 'a');



// loop through and find anchors with rel="nofollow" attribute

foreach($anchors as $anchor) {

    // link text, note: this may contain HTML

    $text =  $anchor['text'];



    // look for rel in attributes array

    if (array_key_exists('rel', $anchor['attirbutes']) && 

        $anchor['attributes']['rel'] == 'nofollow') {

        /* do stuff here */

    }

}

The code for the get_html_tags() function is posted below:

PHP Code:


    /**

     * Matches HTML tags in a string and their attributes.

     *

     * Returns array in the format:

     * Array ([0] => Array ([text] => "text in between tags"

     *                      [attributes] => Array ([attribute1] => [value],

     *                                             [attribute2] => [value])));

     * @param $text string The string to find tags in.

     * @param $tagname string The tag name of he tag to match e.g: form

     * @param $clostag boolean Does the tag have corresponding close tags. i.e: </$tagname>

     * @param $close_optional If a closetag may be missing, it is closed implicitly when another tag of the same name is found.

     *                        include the name of an additional tags that closes the tag implicitly, here.

     * @return array An indexed array, one element for each tag.

     *

     * @author Adam Delves <codedv @ sccode . com >     

     */

    function get_html_tags($text, $tagname, $closetag=true, $close_optional=false)

    {

        /* escape PCRE characters in tag name */

        $tagname = preg_quote($tagname);

        

        $ret = array();



        /* regular expression to match mattributes in a tag name */

        $attrib_match = "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";

        

        if ($closetag) {

            if ($close_optional !== false) {

                $close_optional = preg_quote($close_optional);

                $regex = "/<$tagname(.*)>(.*)(<\/$tagname>|(?=<$tagname>)|(?=<\/$close_optional>))/Uis";

            } else {

                $regex = "/<$tagname(.*)>(.*)<\/$tagname>/Uis";

            }

        } else {

            $regex = "/<$tagname((.+))\/?>/Uis";

        }

        

        // regex now matches the tag appropriatly

        

        preg_match_all($regex, $text, $tags, PREG_SET_ORDER);



        $tag_count = count($tags);

        

        for($t = 0; $t < $tag_count; $t++) {

            $tag = array();

            

            /* get attributes */

            preg_match_all($attrib_match, $tags[$t][1], $attributes, PREG_SET_ORDER);



            $attribs = array();



            $attrib_count = count($attributes);

            for($a = 0; $a < $attrib_count; $a++) {    

                $name = strtolower(trim($attributes[$a][1]));



                if (isset($attributes[$a][4])) { // name value pair found

                    $value = trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]);

                } else {

                    $value = $name;

                }



                $attribs[$name] = $value;                

            }

            

            $tag['attributes'] = $attribs;

            

            if ($closetag) {

                $tag['text'] = $tags[$t][2];

            }



            $ret[] = $tag;

        }



        return $ret;

    }

**slice** · Apr 19th, 2006, 10:11 AM

Its really superb code.But
$anchor['attributes']['href']

remains all the time empty.

**visualAd** · Apr 19th, 2006, 12:47 PM

What does the HTML code that you are trying to parse look like?

**slice** · Apr 20th, 2006, 07:21 AM

I am using in that way

Code:

$htmlCode = file_get_contents("http://www.vbforums.com/");
$anchors = get_html_tags($htmlCode, 'a'); 
foreach($anchors as $anchor) { 
    $text =  $anchor['text']; 	
   if (array_key_exists('rel', $anchor['attirbutes']) && 
        $anchor['attributes']['rel'] == 'nofollow') { 
		echo $text."<br>";
    } 
	}

But it gives me errors
Undefined index: attirbutes
array_key_exists(): The second argument should be either an array or an object

I have also tried

Code:

$htmlCode = file_get_contents("http://www.vbforums.com");
$anchors = get_html_tags($htmlCode, 'a'); 
foreach($anchors as $anchor) { 
    $text =  $anchor['text']; 
	$urlhref = $anchor['attributes']['href'];
   	echo ("<a href=".$urlhref.">".$text."</a><br>");
	}

But with same errors i think there is problem with the function .

I would appreciate your help.

**slice** · Apr 20th, 2006, 07:30 AM

Here i would mention that

$anchor['text']

always get the correct value but problem is with

$anchor['attributes']['rel']
and $anchor['attributes']['href'] array values

**visualAd** · Apr 20th, 2006, 10:22 AM

If you show me the HTML you are giving the function, I will be able to have a look at it. So far the HTML I have tried works. So, like I asked in my previous post, post the HTML you are putting into the function and I'll take a look

**slice** · Apr 20th, 2006, 10:10 PM

Html comes from this code

Code:

$htmlCode = file_get_contents("http://www.vbforums.com");

Isn't it right way?

**PlaGuE** · Apr 21st, 2006, 12:32 AM

uhhhhh.

**slice** · Apr 21st, 2006, 01:02 AM

Plague anything wrong with my code

**slice** · Apr 21st, 2006, 07:38 AM

I have found where is the problem in the code.

PHP Code:


preg_match_all($attrib_match, $tags[$t][1], $attributes, PREG_SET_ORDER); 



            $attribs = array(); 



            $attrib_count = count($attributes); 

            echo $attrib_count."<br>";

            for($a = 0; $a < $attrib_count; $a++) {     

                $name = strtolower(trim($attributes[$a][1])); 

                echo $name."<br>";

                

                if (isset($attributes[$a][4])) { // name value pair found 

                    $value = trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]); 

                } else { 

                    $value = $name; 

                }

Here $name is always empty.
So i think there is problem

And it may be because of this regular expression.

$attrib_match = "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";

I am not master of RE so kindly help me with this last problem.

Thank You.

**visualAd** · Apr 21st, 2006, 03:54 PM

Yes, the problem is with the regular experssion. I wrote it using PHP 5 (where it works). In PHP 4 the X modifier (which among other things, is meant to cuase the compiler to ignore whitespace in the expression) is being ignored and the expression is not matching.

Remove the X modifer and the white space and it will work:

Code:

$attrib_match = "/((?i)[a-z]+)(\s*=\s*(((?U)(\"(.*)\"))|(\s+))|\s+)/si";

**slice** · Sep 30th, 2006, 04:27 AM

Hi VisualAd,

would you help me in making this code more fine because now it extracts anything between <a> and </a> like

Code:

<a href="http://www.vbforums.com><span id="testing">World</span></a>

It will give World

and any font tags also are included

Code:

<a href="http://www.vbforums.com><span id="testing"><font color="#ffffff">Hello</font></span></a>

it will give Hello as anchor text.

What changes would help to extract "World" only in first case and "Hello" in second case ?

I am really dumb at RE.

please help.

**slice** · Oct 9th, 2006, 10:47 AM

please anybody can help?

**kows** · Oct 9th, 2006, 04:28 PM

I looked but I'm not great with complex regular expressions, sorry.

you might be able to PM VisualAd and ask for their help directly, in case they just haven't read this forum lately.

**slice** · Oct 10th, 2006, 07:21 AM

Well, i will pm him

VisualAd ... when search for "a" tag it also includes "area" tag. how to avoid it?

**Matt_T_hat** · Oct 11th, 2006, 03:08 AM

How are you at pulling apart code?

I once wrote a plugin that indexes a href with respect to the rel="tag" as you want to respect the rel="nofollow" this should be close (I indexed rel="tag" where as you want to ignore rel="nofollow")

I wanted to be light on the CPU and so used no RegEx at all. Which I think you might approve of.

I must warn you that it looks a bit complex but is easier to read than pages of RegEx (for me).

The code was a plugin but it should be obviouse what is what.

The file is NP_realtags_0.0.1.zip and is found here:
http://freestuff.lordmatt.co.uk/my_d...usCMS%20Stuff/

You are welcome to use what you find should you need to.

**CornedBee** · Oct 11th, 2006, 03:17 AM

I just noticed something in visualAd's very first post. DOM works with HTML, too, but if the HTML is not valid, the resulting tree will be rather unpredictable.

Thread: dom

Thread Tools

Display

dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Re: dom

Posting Permissions