|
Thread: dom
-
Apr 7th, 2006, 07:29 AM
#1
Thread Starter
Addicted Member
dom
i am back again 
today i want to learn and want to complete the task of analyzing webpage for its anchor links and to record them for spidering.I also want to respect the "nofollow" attribute.
But the problem is that i haven't good experience of DOM with PHP i want to need your help of some code samples or some examples.
I have tried to use regular expression after reading from
http://blogs.worldnomads.com.au/matt...04/06/215.aspx
but it looks to be bit more difficult and less complete solution than DOM.
I need help of your comments 
Thank You.
-
Apr 7th, 2006, 02:00 PM
#2
Re: dom
Firstly, DOM will only work with valid XHTML. Unfortunatly, a lot of websites do not use it. If you know the site you will be using will have valid XHTML, then you should use DOM.
Both PHP 4 and PHP 5 have support for XML. The latter being w3c complient and the better option if you have PHP 5 available. What problems were you encountering with DOM?
I had quite a good PCRE whic matches links in web pages. I'll have a look and see if I can dig it out.
-
Apr 8th, 2006, 12:00 AM
#3
Thread Starter
Addicted Member
Re: dom
If DOM only works with XHTML document then certainly i should not use it because i could be asked to spider any webpage which may or may not be XHTML compliant.
Then for now my project is to spider any given webpage store its anchor links and spider the anchor link to just 1 level deep and do not spider anchor link with rel="nofollow".
And i have start with correctly analyzing the webpage for anchor links.
What could be the best way to check webpage for all anchor links and then extracting attributes (most importantly rel="nofollow") ?
-
Apr 9th, 2006, 01:20 AM
#4
Thread Starter
Addicted Member
Re: dom
any help/comment please ?
-
Apr 9th, 2006, 02:17 AM
#5
Re: dom
The best option, like shown in the example above, is to use two regular expressions. One to match the anchor tags and the other its attributes. The regular expression below does a standard match on a URL with an href attribute.
Code:
/<a.*href=((\"(.+)\".*>)|((\S+)((\s.*>)|>)))(.+)<\/a>/sU
But, if you want more attributes such as rel matched, I suggest you go for something like the function I've posted below. I use this to extract information about HTML forms and it seems to work OK on all but the most shoddy HTML code. To get all anchors on a page, use the following:
PHP Code:
$anchors = get_html_tags($htmlCode, 'a');
// loop through and find anchors with rel="nofollow" attribute
foreach($anchors as $anchor) {
// link text, note: this may contain HTML
$text = $anchor['text'];
// look for rel in attributes array
if (array_key_exists('rel', $anchor['attirbutes']) &&
$anchor['attributes']['rel'] == 'nofollow') {
/* do stuff here */
}
}
The code for the get_html_tags() function is posted below:
PHP Code:
/**
* Matches HTML tags in a string and their attributes.
*
* Returns array in the format:
* Array ([0] => Array ([text] => "text in between tags"
* [attributes] => Array ([attribute1] => [value],
* [attribute2] => [value])));
* @param $text string The string to find tags in.
* @param $tagname string The tag name of he tag to match e.g: form
* @param $clostag boolean Does the tag have corresponding close tags. i.e: </$tagname>
* @param $close_optional If a closetag may be missing, it is closed implicitly when another tag of the same name is found.
* include the name of an additional tags that closes the tag implicitly, here.
* @return array An indexed array, one element for each tag.
*
* @author Adam Delves <codedv @ sccode . com >
*/
function get_html_tags($text, $tagname, $closetag=true, $close_optional=false)
{
/* escape PCRE characters in tag name */
$tagname = preg_quote($tagname);
$ret = array();
/* regular expression to match mattributes in a tag name */
$attrib_match = "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";
if ($closetag) {
if ($close_optional !== false) {
$close_optional = preg_quote($close_optional);
$regex = "/<$tagname(.*)>(.*)(<\/$tagname>|(?=<$tagname>)|(?=<\/$close_optional>))/Uis";
} else {
$regex = "/<$tagname(.*)>(.*)<\/$tagname>/Uis";
}
} else {
$regex = "/<$tagname((.+))\/?>/Uis";
}
// regex now matches the tag appropriatly
preg_match_all($regex, $text, $tags, PREG_SET_ORDER);
$tag_count = count($tags);
for($t = 0; $t < $tag_count; $t++) {
$tag = array();
/* get attributes */
preg_match_all($attrib_match, $tags[$t][1], $attributes, PREG_SET_ORDER);
$attribs = array();
$attrib_count = count($attributes);
for($a = 0; $a < $attrib_count; $a++) {
$name = strtolower(trim($attributes[$a][1]));
if (isset($attributes[$a][4])) { // name value pair found
$value = trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]);
} else {
$value = $name;
}
$attribs[$name] = $value;
}
$tag['attributes'] = $attribs;
if ($closetag) {
$tag['text'] = $tags[$t][2];
}
$ret[] = $tag;
}
return $ret;
}
-
Apr 19th, 2006, 10:11 AM
#6
Thread Starter
Addicted Member
Re: dom
Its really superb code.But
$anchor['attributes']['href']
remains all the time empty.
-
Apr 19th, 2006, 12:47 PM
#7
Re: dom
What does the HTML code that you are trying to parse look like?
-
Apr 20th, 2006, 07:21 AM
#8
Thread Starter
Addicted Member
Re: dom
I am using in that way
Code:
$htmlCode = file_get_contents("http://www.vbforums.com/");
$anchors = get_html_tags($htmlCode, 'a');
foreach($anchors as $anchor) {
$text = $anchor['text'];
if (array_key_exists('rel', $anchor['attirbutes']) &&
$anchor['attributes']['rel'] == 'nofollow') {
echo $text."<br>";
}
}
But it gives me errors
Undefined index: attirbutes
array_key_exists(): The second argument should be either an array or an object
I have also tried
Code:
$htmlCode = file_get_contents("http://www.vbforums.com");
$anchors = get_html_tags($htmlCode, 'a');
foreach($anchors as $anchor) {
$text = $anchor['text'];
$urlhref = $anchor['attributes']['href'];
echo ("<a href=".$urlhref.">".$text."</a><br>");
}
But with same errors i think there is problem with the function .
I would appreciate your help.
Last edited by slice; Apr 20th, 2006 at 07:25 AM.
-
Apr 20th, 2006, 07:30 AM
#9
Thread Starter
Addicted Member
Re: dom
Here i would mention that
$anchor['text']
always get the correct value but problem is with
$anchor['attributes']['rel']
and $anchor['attributes']['href'] array values
-
Apr 20th, 2006, 10:22 AM
#10
Re: dom
If you show me the HTML you are giving the function, I will be able to have a look at it. So far the HTML I have tried works. So, like I asked in my previous post, post the HTML you are putting into the function and I'll take a look
-
Apr 20th, 2006, 10:10 PM
#11
Thread Starter
Addicted Member
Re: dom
Html comes from this code
Code:
$htmlCode = file_get_contents("http://www.vbforums.com");
Isn't it right way?
-
Apr 21st, 2006, 12:32 AM
#12
Hyperactive Member
Without balance, there could only be chaos.
Without chaos, there could be no balance.
I live with karma. Eat with destiny. Dream of life without shackles....
Yet. If life had no consequences, life could not exist, nor could it flourish.
If at first you dont succeed.You're screwed.
C++/Java NOOB.
I aint a professional at PHP, but if i can help i will.
-
Apr 21st, 2006, 01:02 AM
#13
Thread Starter
Addicted Member
Re: dom
Plague anything wrong with my code
-
Apr 21st, 2006, 07:38 AM
#14
Thread Starter
Addicted Member
Re: dom
I have found where is the problem in the code.
PHP Code:
preg_match_all($attrib_match, $tags[$t][1], $attributes, PREG_SET_ORDER);
$attribs = array();
$attrib_count = count($attributes);
echo $attrib_count."<br>";
for($a = 0; $a < $attrib_count; $a++) {
$name = strtolower(trim($attributes[$a][1]));
echo $name."<br>";
if (isset($attributes[$a][4])) { // name value pair found
$value = trim(isset($attributes[$a][7])?$attributes[$a][7]:$attributes[$a][6]);
} else {
$value = $name;
}
Here $name is always empty.
So i think there is problem 
And it may be because of this regular expression.
$attrib_match = "/((?i)[a-z]+) (\s*=\s* ( ((?U)(\"(.*)\")) |(\S+) ) |\s+) /sXi";
I am not master of RE so kindly help me with this last problem. 
Thank You.
-
Apr 21st, 2006, 03:54 PM
#15
Re: dom
Yes, the problem is with the regular experssion. I wrote it using PHP 5 (where it works). In PHP 4 the X modifier (which among other things, is meant to cuase the compiler to ignore whitespace in the expression) is being ignored and the expression is not matching.
Remove the X modifer and the white space and it will work:
Code:
$attrib_match = "/((?i)[a-z]+)(\s*=\s*(((?U)(\"(.*)\"))|(\s+))|\s+)/si";
-
Sep 30th, 2006, 04:27 AM
#16
Thread Starter
Addicted Member
Re: dom
Hi VisualAd,
would you help me in making this code more fine because now it extracts anything between <a> and </a> like
Code:
<a href="http://www.vbforums.com><span id="testing">World</span></a>
It will give <span id="testing">World</span>
and any font tags also are included
Code:
<a href="http://www.vbforums.com><span id="testing"><font color="#ffffff">Hello</font></span></a>
it will give <span id="testing"><font color="#ffffff">Hello</font></span> as anchor text.
What changes would help to extract "World" only in first case and "Hello" in second case ?
I am really dumb at RE. 
please help.
-
Oct 9th, 2006, 10:47 AM
#17
Thread Starter
Addicted Member
Re: dom
please anybody can help?
-
Oct 9th, 2006, 04:28 PM
#18
Re: dom
I looked but I'm not great with complex regular expressions, sorry.
you might be able to PM VisualAd and ask for their help directly, in case they just haven't read this forum lately.
-
Oct 10th, 2006, 07:21 AM
#19
Thread Starter
Addicted Member
Re: dom
Well, i will pm him 
VisualAd ... when search for "a" tag it also includes "area" tag. how to avoid it?
-
Oct 11th, 2006, 03:08 AM
#20
Fanatic Member
Re: dom
How are you at pulling apart code?
I once wrote a plugin that indexes a href with respect to the rel="tag" as you want to respect the rel="nofollow" this should be close (I indexed rel="tag" where as you want to ignore rel="nofollow")
I wanted to be light on the CPU and so used no RegEx at all. Which I think you might approve of.
I must warn you that it looks a bit complex but is easier to read than pages of RegEx (for me).
The code was a plugin but it should be obviouse what is what.
The file is NP_realtags_0.0.1.zip and is found here:
http://freestuff.lordmatt.co.uk/my_d...usCMS%20Stuff/
You are welcome to use what you find should you need to.
-
Oct 11th, 2006, 03:17 AM
#21
Re: dom
I just noticed something in visualAd's very first post. DOM works with HTML, too, but if the HTML is not valid, the resulting tree will be rather unpredictable.
All the buzzt
 CornedBee
"Writing specifications is like writing a novel. Writing code is like writing poetry."
- Anonymous, published by Raymond Chen
Don't PM me with your problems, I scan most of the forums daily. If you do PM me, I will not answer your question.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|