-
PHP Crawler?
Hello,
I'm trying to make a PHP crawler, I'm trying to make it crawl a webpage then gather all images and links and store them in MySQL then move on to another link.
This is how far I have got:
Code:
<?
$site = $_GET[url];
$f = fopen("$site","r");
$inputStream = fread($f,65535);
fclose($f);
if (preg_match_all("/<a.*? href=\"(.*?)\".*?>(.*?)<\/a>/i",$inputStream,$matches)) {
$something = strip_tags($matches);
print_r($matches);
}
?>
Maybe someone could help me add in the image crawl part and storing it.
Thank You
-
Re: PHP Crawler?
For the route you have taken, you have to be really good with regular expressions. With the regex you have, you are assuming that all href attributes are enclosed in double quotes ("), not in single quotes('), which is not always correct. If you are using php5, you could make use of DOM. Take a look here. You have lot of functions which make life easier like getElementsByTagName etc.
-
Re: PHP Crawler?
More specifically the function loadHTML that loads HTML 4 documents (which don't conform to XML standards) into a DOM document.