Results 1 to 21 of 21

Thread: dom

  1. #1

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    dom

    i am back again
    today i want to learn and want to complete the task of analyzing webpage for its anchor links and to record them for spidering.I also want to respect the "nofollow" attribute.
    But the problem is that i haven't good experience of DOM with PHP i want to need your help of some code samples or some examples.

    I have tried to use regular expression after reading from
    http://blogs.worldnomads.com.au/matt...04/06/215.aspx
    but it looks to be bit more difficult and less complete solution than DOM.
    I need help of your comments

    Thank You.

  2. #2
    VBA Nutter visualAd's Avatar
    Join Date
    Apr 2002
    Location
    Ickenham, UK
    Posts
    4,906

    Re: dom

    Firstly, DOM will only work with valid XHTML. Unfortunatly, a lot of websites do not use it. If you know the site you will be using will have valid XHTML, then you should use DOM.

    Both PHP 4 and PHP 5 have support for XML. The latter being w3c complient and the better option if you have PHP 5 available. What problems were you encountering with DOM?

    I had quite a good PCRE whic matches links in web pages. I'll have a look and see if I can dig it out.
    PHP || MySql || Apache || Get Firefox || OpenOffice.org || Click || Slap ILMV || 1337 c0d || GotoMyPc For FREE! Part 1, Part 2

    | PHP Session --> Database Handler * Custom Error Handler * Installing PHP * HTML Form Handler * PHP 5 OOP * Using XML * Ajax * Xslt | VB6 Winsock - HTTP POST / GET * Winsock - HTTP File Upload

    Latest quote: crptcblade - VB6 executables can't be decompiled, only disassembled. And the disassembled code is even less useful than I am.

    Random VisualAd: Blog - Latest Post: When the Internet becomes Electricity!!


    Spread happiness and joy. Rate good posts.

  3. #3

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    If DOM only works with XHTML document then certainly i should not use it because i could be asked to spider any webpage which may or may not be XHTML compliant.
    Then for now my project is to spider any given webpage store its anchor links and spider the anchor link to just 1 level deep and do not spider anchor link with rel="nofollow".

    And i have start with correctly analyzing the webpage for anchor links.
    What could be the best way to check webpage for all anchor links and then extracting attributes (most importantly rel="nofollow") ?

  4. #4

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    any help/comment please ?

  5. #5
    VBA Nutter visualAd's Avatar
    Join Date
    Apr 2002
    Location
    Ickenham, UK
    Posts
    4,906

    Re: dom

    The best option, like shown in the example above, is to use two regular expressions. One to match the anchor tags and the other its attributes. The regular expression below does a standard match on a URL with an href attribute.
    Code:
    /<a.*href=((\"(.+)\".*>)|((\S+)((\s.*>)|>)))(.+)<\/a>/sU
    But, if you want more attributes such as rel matched, I suggest you go for something like the function I've posted below. I use this to extract information about HTML forms and it seems to work OK on all but the most shoddy HTML code. To get all anchors on a page, use the following:
    PHP Code:
    $anchors get_html_tags($htmlCode'a');

    // loop through and find anchors with rel="nofollow" attribute
    foreach($anchors as $anchor) {
        
    // link text, note: this may contain HTML
        
    $text =  $anchor['text'];

        
    // look for rel in attributes array
        
    if (array_key_exists('rel'$anchor['attirbutes']) && 
            
    $anchor['attributes']['rel'] == 'nofollow') {
            
    /* do stuff here */
        
    }

    The code for the get_html_tags() function is posted below:
    PHP Code:
        /**
         * Matches HTML tags in a string and their attributes.
         *
         * Returns array in the format:
         * Array ([0] => Array ([text] => "text in between tags"
         *                      [attributes] => Array ([attribute1] => [value],
         *                                             [attribute2] => [value])));
         * @param $text string The string to find tags in.
         * @param $tagname string The tag name of he tag to match e.g: form
         * @param $clostag boolean Does the tag have corresponding close tags. i.e: </$tagname>
         * @param $close_optional If a closetag may be missing, it is closed implicitly when another tag of the same name is found.
         *                        include the name of an additional tags that closes the tag implicitly, here.
         * @return array An indexed array, one element for each tag.
         *
         * @author Adam Delves <codedv @ sccode . com >     
         */
        
    function get_html_tags($text$tagname$closetag=true$close_optional=false)
        {
            
    /* escape PCRE characters in tag name */
            
    $tagname preg_quote($tagname);
            
            
    $ret = array();

            
    /* regular expression to match mattributes in a tag name */
            
    $attrib_match "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";
            
            if (
    $closetag) {
                if (
    $close_optional !== false) {
                    
    $close_optional preg_quote($close_optional);
                    
    $regex "/<$tagname(.*)>(.*)(<\/$tagname>|(?=<$tagname>)|(?=<\/$close_optional>))/Uis";
                } else {
                    
    $regex "/<$tagname(.*)>(.*)<\/$tagname>/Uis";
                }
            } else {
                
    $regex "/<$tagname((.+))\/?>/Uis";
            }
            
            
    // regex now matches the tag appropriatly
            
            
    preg_match_all($regex$text$tagsPREG_SET_ORDER);

            
    $tag_count count($tags);
            
            for(
    $t 0$t $tag_count$t++) {
                
    $tag = array();
                
                
    /* get attributes */
                
    preg_match_all($attrib_match$tags[$t][1], $attributesPREG_SET_ORDER);

                
    $attribs = array();

                
    $attrib_count count($attributes);
                for(
    $a 0$a $attrib_count$a++) {    
                    
    $name strtolower(trim($attributes[$a][1]));

                    if (isset(
    $attributes[$a][4])) { // name value pair found
                        
    $value trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]);
                    } else {
                        
    $value $name;
                    }

                    
    $attribs[$name] = $value;                
                }
                
                
    $tag['attributes'] = $attribs;
                
                if (
    $closetag) {
                    
    $tag['text'] = $tags[$t][2];
                }

                
    $ret[] = $tag;
            }

            return 
    $ret;
        } 
    PHP || MySql || Apache || Get Firefox || OpenOffice.org || Click || Slap ILMV || 1337 c0d || GotoMyPc For FREE! Part 1, Part 2

    | PHP Session --> Database Handler * Custom Error Handler * Installing PHP * HTML Form Handler * PHP 5 OOP * Using XML * Ajax * Xslt | VB6 Winsock - HTTP POST / GET * Winsock - HTTP File Upload

    Latest quote: crptcblade - VB6 executables can't be decompiled, only disassembled. And the disassembled code is even less useful than I am.

    Random VisualAd: Blog - Latest Post: When the Internet becomes Electricity!!


    Spread happiness and joy. Rate good posts.

  6. #6

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    Its really superb code.But
    $anchor['attributes']['href']

    remains all the time empty.

  7. #7
    VBA Nutter visualAd's Avatar
    Join Date
    Apr 2002
    Location
    Ickenham, UK
    Posts
    4,906

    Re: dom

    What does the HTML code that you are trying to parse look like?
    PHP || MySql || Apache || Get Firefox || OpenOffice.org || Click || Slap ILMV || 1337 c0d || GotoMyPc For FREE! Part 1, Part 2

    | PHP Session --> Database Handler * Custom Error Handler * Installing PHP * HTML Form Handler * PHP 5 OOP * Using XML * Ajax * Xslt | VB6 Winsock - HTTP POST / GET * Winsock - HTTP File Upload

    Latest quote: crptcblade - VB6 executables can't be decompiled, only disassembled. And the disassembled code is even less useful than I am.

    Random VisualAd: Blog - Latest Post: When the Internet becomes Electricity!!


    Spread happiness and joy. Rate good posts.

  8. #8

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    I am using in that way
    Code:
    $htmlCode = file_get_contents("http://www.vbforums.com/");
    $anchors = get_html_tags($htmlCode, 'a'); 
    foreach($anchors as $anchor) { 
        $text =  $anchor['text']; 	
       if (array_key_exists('rel', $anchor['attirbutes']) && 
            $anchor['attributes']['rel'] == 'nofollow') { 
    		echo $text."<br>";
        } 
    	}
    But it gives me errors
    Undefined index: attirbutes
    array_key_exists(): The second argument should be either an array or an object


    I have also tried

    Code:
    $htmlCode = file_get_contents("http://www.vbforums.com");
    $anchors = get_html_tags($htmlCode, 'a'); 
    foreach($anchors as $anchor) { 
        $text =  $anchor['text']; 
    	$urlhref = $anchor['attributes']['href'];
       	echo ("<a href=".$urlhref.">".$text."</a><br>");
    	}
    But with same errors i think there is problem with the function .

    I would appreciate your help.
    Last edited by slice; Apr 20th, 2006 at 07:25 AM.

  9. #9

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    Here i would mention that

    $anchor['text']

    always get the correct value but problem is with

    $anchor['attributes']['rel']
    and $anchor['attributes']['href'] array values

  10. #10
    VBA Nutter visualAd's Avatar
    Join Date
    Apr 2002
    Location
    Ickenham, UK
    Posts
    4,906

    Re: dom

    If you show me the HTML you are giving the function, I will be able to have a look at it. So far the HTML I have tried works. So, like I asked in my previous post, post the HTML you are putting into the function and I'll take a look
    PHP || MySql || Apache || Get Firefox || OpenOffice.org || Click || Slap ILMV || 1337 c0d || GotoMyPc For FREE! Part 1, Part 2

    | PHP Session --> Database Handler * Custom Error Handler * Installing PHP * HTML Form Handler * PHP 5 OOP * Using XML * Ajax * Xslt | VB6 Winsock - HTTP POST / GET * Winsock - HTTP File Upload

    Latest quote: crptcblade - VB6 executables can't be decompiled, only disassembled. And the disassembled code is even less useful than I am.

    Random VisualAd: Blog - Latest Post: When the Internet becomes Electricity!!


    Spread happiness and joy. Rate good posts.

  11. #11

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    Html comes from this code
    Code:
    $htmlCode = file_get_contents("http://www.vbforums.com");
    Isn't it right way?

  12. #12
    Hyperactive Member PlaGuE's Avatar
    Join Date
    Jun 2005
    Location
    in ur mind.
    Posts
    445

    Re: dom

    uhhhhh.
    Without balance, there could only be chaos.
    Without chaos, there could be no balance.
    I live with karma. Eat with destiny. Dream of life without shackles....
    Yet. If life had no consequences, life could not exist, nor could it flourish.


    If at first you dont succeed.You're screwed.

    C++/Java NOOB.

    I aint a professional at PHP, but if i can help i will.

  13. #13

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    Plague anything wrong with my code

  14. #14

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    I have found where is the problem in the code.

    PHP Code:
    preg_match_all($attrib_match$tags[$t][1], $attributesPREG_SET_ORDER); 

                
    $attribs = array(); 

                
    $attrib_count count($attributes); 
                echo 
    $attrib_count."<br>";
                for(
    $a 0$a $attrib_count$a++) {     
                    
    $name strtolower(trim($attributes[$a][1])); 
                    echo 
    $name."<br>";
                    
                    if (isset(
    $attributes[$a][4])) { // name value pair found 
                        
    $value trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]); 
                    } else { 
                        
    $value $name
                    } 

    Here $name is always empty.
    So i think there is problem

    And it may be because of this regular expression.

    $attrib_match = "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";
    I am not master of RE so kindly help me with this last problem.

    Thank You.

  15. #15
    VBA Nutter visualAd's Avatar
    Join Date
    Apr 2002
    Location
    Ickenham, UK
    Posts
    4,906

    Re: dom

    Yes, the problem is with the regular experssion. I wrote it using PHP 5 (where it works). In PHP 4 the X modifier (which among other things, is meant to cuase the compiler to ignore whitespace in the expression) is being ignored and the expression is not matching.

    Remove the X modifer and the white space and it will work:
    Code:
    $attrib_match = "/((?i)[a-z]+)(\s*=\s*(((?U)(\"(.*)\"))|(\s+))|\s+)/si";
    PHP || MySql || Apache || Get Firefox || OpenOffice.org || Click || Slap ILMV || 1337 c0d || GotoMyPc For FREE! Part 1, Part 2

    | PHP Session --> Database Handler * Custom Error Handler * Installing PHP * HTML Form Handler * PHP 5 OOP * Using XML * Ajax * Xslt | VB6 Winsock - HTTP POST / GET * Winsock - HTTP File Upload

    Latest quote: crptcblade - VB6 executables can't be decompiled, only disassembled. And the disassembled code is even less useful than I am.

    Random VisualAd: Blog - Latest Post: When the Internet becomes Electricity!!


    Spread happiness and joy. Rate good posts.

  16. #16

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    Hi VisualAd,

    would you help me in making this code more fine because now it extracts anything between <a> and </a> like
    Code:
    <a href="http://www.vbforums.com><span id="testing">World</span></a>
    It will give <span id="testing">World</span>


    and any font tags also are included

    Code:
    <a href="http://www.vbforums.com><span id="testing"><font color="#ffffff">Hello</font></span></a>
    it will give <span id="testing"><font color="#ffffff">Hello</font></span> as anchor text.


    What changes would help to extract "World" only in first case and "Hello" in second case ?

    I am really dumb at RE.

    please help.

  17. #17

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    please anybody can help?

  18. #18
    PowerPoster
    Join Date
    Sep 2003
    Location
    Edmonton, AB, Canada
    Posts
    2,629

    Re: dom

    I looked but I'm not great with complex regular expressions, sorry.

    you might be able to PM VisualAd and ask for their help directly, in case they just haven't read this forum lately.
    Like Archer? Check out some Sterling Archer quotes.

  19. #19

    Thread Starter
    Addicted Member
    Join Date
    Jan 2006
    Location
    Osaka
    Posts
    200

    Re: dom

    Well, i will pm him


    VisualAd ... when search for "a" tag it also includes "area" tag. how to avoid it?

  20. #20
    Fanatic Member Matt_T_hat's Avatar
    Join Date
    Dec 2001
    Location
    '76 Male Body Evil-Errors: 666
    Posts
    774

    Re: dom

    How are you at pulling apart code?

    I once wrote a plugin that indexes a href with respect to the rel="tag" as you want to respect the rel="nofollow" this should be close (I indexed rel="tag" where as you want to ignore rel="nofollow")

    I wanted to be light on the CPU and so used no RegEx at all. Which I think you might approve of.

    I must warn you that it looks a bit complex but is easier to read than pages of RegEx (for me).

    The code was a plugin but it should be obviouse what is what.

    The file is NP_realtags_0.0.1.zip and is found here:
    http://freestuff.lordmatt.co.uk/my_d...usCMS%20Stuff/

    You are welcome to use what you find should you need to.
    ?
    'What's this bit for anyway?
    For Jono

  21. #21
    Kitten CornedBee's Avatar
    Join Date
    Aug 2001
    Location
    In a microchip!
    Posts
    11,594

    Re: dom

    I just noticed something in visualAd's very first post. DOM works with HTML, too, but if the HTML is not valid, the resulting tree will be rather unpredictable.
    All the buzzt
    CornedBee

    "Writing specifications is like writing a novel. Writing code is like writing poetry."
    - Anonymous, published by Raymond Chen

    Don't PM me with your problems, I scan most of the forums daily. If you do PM me, I will not answer your question.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width