[RESOLVED] Finding URLs in a String using Regular Expressions
Hello.
I am trying to write an app that will read in a HTML document and extract all the URLs from it.
I currently have the HTML document being read in line by line and I need to be able to identify if there are any URLs in the string.
Someone suggested using Regular Expressions?
I am having abit of trouble doing this.
I am trying something like this but it doesnt work. Any help will be great!! THANKS :thumb:
Code:
String test = new String("bla bla bla http://somesite.com/tmp/page.html bla bla");
String regex = "@\"http(s)?://([\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?\\b\")";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(test);
if (m.find()){
System.out.println(m.group(1));
}
else{
System.out.println("Not found!");
}
Re: Finding URLs in a String using Regular Expressions
Sorted. Ive worked it out.. Thanks anyway.
Code:
URL url = new URL("http://www.bla.com");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String strLine = "";
String URLregex = "http(s)?://([\\w-]+\\.)+[\\w-]+(/[\\w-./?%&=]*)?\\b";
Pattern p = Pattern.compile(URLregex);
while ((strLine = in.readLine()) != null){
Matcher m = p.matcher(strLine);
if (m.find()){
System.out.println(m.group(0));
}
}
Re: [RESOLVED] Finding URLs in a String using Regular Expressions
I don't know if you've noticed but your code will only read the first url in each line.
So if the whole page was formated into a single line you'll only get one response.
That's a logical error
Re: [RESOLVED] Finding URLs in a String using Regular Expressions
Yeah I did notice that it only read one URL per line.
Thanks for pointing it out.
How do you suggest I change this?
Re: [RESOLVED] Finding URLs in a String using Regular Expressions
Try this code:
Code:
URL url = new URL("http://www.bla.com");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String strLine = null;
String URLregex = "http(s)?://([\\w-]+\\.)+[\\w-]+(/[\\w-./?%&=]*)?\\b";
Pattern p = Pattern.compile(URLregex);
while ((strLine = in.readLine()) != null) {
Matcher m = p.matcher(strLine);
while (m.find()) {
System.out.println(m.group(0));
strLine.replaceFirst(URLregex, "");
}
}
Just replaced the If with a while and replaced each found Uri with an empty string
Re: [RESOLVED] Finding URLs in a String using Regular Expressions
Thanks mate, ill give it ago.
I just sent you a PM the second before you posted that!
Re: [RESOLVED] Finding URLs in a String using Regular Expressions
Sorted. Nice one geeza! Your the man ;)