Parsing URLs from a file

**VBlee** · May 7th, 2008, 02:26 PM

Hey,

I am having a problem trying to find a way of getting URLs from a file, i have no problems getting the html file or saving the information, its just getting the information from inside the file.

All i want to do is search through the file to find links e.g. find /tutorials/Maya/1 from a line that maybe:
<a href="/tutorials/Maya/1">Maya</a>

I also am trying to find a way so that when it gets a list of links it then looks if they contain the phrase:
/tutorials/ (I would be able to do that if i knew how to do the first bit i think!).

Can anyone help at all?

Thanks,
Lee.

**dclamp** · May 7th, 2008, 11:53 PM

this is what you are looking for:

preg_match

**visualAd** · May 8th, 2008, 03:26 AM

Originally Posted by VBlee

Hey,

I am having a problem trying to find a way of getting URLs from a file, i have no problems getting the html file or saving the information, its just getting the information from inside the file.

All i want to do is search through the file to find links e.g. find /tutorials/Maya/1 from a line that maybe:
<a href="/tutorials/Maya/1">Maya</a>

I also am trying to find a way so that when it gets a list of links it then looks if they contain the phrase:
/tutorials/ (I would be able to do that if i knew how to do the first bit i think!).

Can anyone help at all?

Thanks,
Lee.

Although a regular expression can be used it would be rather large due to the high variety of ways in which HTML is written.

HTML Code:

<P><A HREF=www.vbforums.com>My Link</p>

<P><A HREF="www.vbforums.com" >My Link<p>

<P><A HREF='http://www.vbforums.com' >My Link<p>

<P><A HREF="www.vbforums.com" >My Link</a><p>

<p><a href=

"www.vbforums.com" 

>My Link</a><p>

And any combination of the above. The best way of doing this is using the loadHTML method of the DomDocument object. This will take into account any poorly formed HTML and inconsistencies in the markup. You can also be relativity sure that everything have been captured.

If the HTMl document is XHTML, you can just load the document using DOMDocument-->load().

You can then use the getElementsByTagName() and getAttribute() methods to get the values of the links attributes.

PHP Code:


$anchors = $dom->getElementsByTagName('a');

foreach($anchors as $anchor) {
  echo ($anchor->getAttribute('href')); 
}

**VBlee** · May 8th, 2008, 10:46 AM

Thanks you both so much, i had been looking at regular expressions, just getting my head round them really but i was just looking for another option such as visualAd's method.

Thanks again, i will test/investigate them

Lee.

Thread: Parsing URLs from a file

Thread Tools

Display

Parsing URLs from a file

Re: Parsing URLs from a file

Re: Parsing URLs from a file

Re: Parsing URLs from a file

Posting Permissions