webcrawler for images

**sagey** · May 21st, 2004, 10:42 AM

i work for a company that provide product images to retailers.

i've been asked to look into watermarking images and how we would discover if someone had reused an image without their consent.

i've looked into several ways of using visible and invisible watermarking on images and i've written some v basic apps that do both, but now i need to think about how to discover web site that are using the images without consent.

i figure it would be good if i could write a webcrawler in c# that i could point against a specific web site (www.boots.com for example)
and which would then crawl thru the site and save any images to a folder on my pc. i could then use my watermarking program to analyze the image for my watermark.

i've seen a couple of seemingly complex examples of webcrawlers, but what i'm after is an explanation of the basics. (i am rather dense) for example i unsure how i would get images from a web site and put them in my local folder. can i do that within a httprequest?

if anyone can either explain the basics or point me towards a good basic tutorial i would be very grateful.

**MrPolite** · May 21st, 2004, 01:51 PM

well wouldnt it be more efficient and less costy for your company to just use a webcrowler that's already written (I dunno, webzip or whatever)... get the images and THEN just use your program to analyze the images? it'd be much simpler too....

but tell me how do you apply an invisible watermark? and how do you go by analyzing the image? I'm sorta interested in image processing you know

do you do edge detection or anything?

**Tewl** · May 22nd, 2004, 09:27 AM

Well I wrote one of these to get images an information off a webpage awhile back basically what I did was use teh WebRequest object to retrieve the source then used a regular expression to find the images on the page then used the webclient to download the images.

Example getting source:

Code:

string source = "";
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://somesite.com");
HttpWebResponse hwrsp = (HttpWebResponse)hwr.GetResponse();
Stream s = hwrsp.GetResponseStream();
StreamReader sr = new StreamReader(s);
source += sr.ReadToEnd();
sr.Close();
s.Close();

Example getting image:

Code:

WebClient dl = new WebClient();
string flink = "", fpath = "";
string[] f = null;
Match mMatch = Regex.Match(source, "<img([^>])src=[\',\"](.*?)[\',\"].*?>", RegexOptions.IgnoreCase);
while (mMatch.Success)
{
	flink = "http://somesite.com/" + mMatch.Groups[2].ToString();
	f = Regex.Split(flink,"/");
	fpath = "c:\\pathtosave\\" + f[f.length - 1];
	dl.DownloadFile(flink,fpath);
	mMatch = mMatch.NextMatch();
}

**sagey** · May 23rd, 2004, 03:44 PM

Cheers for the replies.

I didn't realise there were existing cralwers i could use.
But since tewl has provide some code i think i'll try and knock my own up.

The invisible watermark is done by hiding some retrievable information within the binary of the image.

codeproject.com has some steganography examples which spreads data as a pattern within pixel data.

I've basically tried to take this as my starting point and made a much more simplified version. My version will basically create an image object from the file, analyse a specific pattern of pixels and verify whether a company specific key is present in the image.

Thats the plan anyway. i'm a long way off from getting it all sorted.

I have got the visible watermark pretty much sorted tho. it sticks a transparent text saying on top of the image, and also sticks a transparent logo over the image as well. Obviously these could easily be removed via a bit of photoshopping, but i thnk the point of the visible watermarks is to make the potential thief to think 'oh f**k i've got to spend ages trimming the image'

**sagey** · May 24th, 2004, 11:47 AM

i've tried using the code that tewl kindly provided and it seems to fail when trying to get the image using the web client

Code:

		private void Crawl()
		{
			//get source 
			string source = "";
			HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://intranet");
			HttpWebResponse hwrsp = (HttpWebResponse)hwr.GetResponse();
			Stream s = hwrsp.GetResponseStream();
			StreamReader sr = new StreamReader(s);
			source += sr.ReadToEnd();
			sr.Close();
			s.Close();
	
			//MessageBox.Show(source);
			
			//get image stuff
			WebClient dl = new WebClient();
			string flink = "", fpath = "";
			string[] f = null;
			Match mMatch = Regex.Match(source, "<img([^>])src=[',\"](.*?)[',\"].*?>", RegexOptions.IgnoreCase);
			while (mMatch.Success)
			{
				flink = "http://intranet" + mMatch.Groups[2].ToString();
				f = Regex.Split(flink,"/");
				fpath = @"D:\Watermark\WebCrawler\CrawledImages\" + f[f.Length - 1];
				dl.DownloadFile(flink,fpath);
				mMatch = mMatch.NextMatch();
			}		
		}

I'm trying to test this against my works intranet homr page, i don't know whether thats got anything to do with it?
here is the output from vs:

'DefaultDomain': Loaded 'c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll', No symbols loaded.
'WebCrawler': Loaded 'D:\Watermark\WebCrawler\WebCrawler\bin\Debug\WebCrawler.exe', Symbols loaded.
'WebCrawler.exe': Loaded 'c:\windows\assembly\gac\system.windows.forms\1.0.5000.0__b77a5c561934e089\system.windows.forms.dll' , No symbols loaded.
'WebCrawler.exe': Loaded 'c:\windows\assembly\gac\system\1.0.5000.0__b77a5c561934e089\system.dll', No symbols loaded.
'WebCrawler.exe': Loaded 'c:\windows\assembly\gac\system.drawing\1.0.5000.0__b03f5f7f11d50a3a\system.drawing.dll', No symbols loaded.
'WebCrawler.exe': Loaded 'c:\windows\assembly\gac\system.xml\1.0.5000.0__b77a5c561934e089\system.xml.dll', No symbols loaded.
An unhandled exception of type 'System.Net.WebException' occurred in system.dll

Additional information: The underlying connection was closed: The remote name could not be resolved.

The program '[2996] WebCrawler.exe' has exited with code 0 (0x0).

**sagey** · May 24th, 2004, 11:55 AM

Found the problem. as the image was a relative link it was coming out as:
http://intranetchart.gif

whereas it should have come out as:
http://intranet/chart.gif

so i've just changed this line of code:

Code:

	flink = "http://intranet/" + mMatch.Groups[2].ToString();

**sagey** · May 25th, 2004, 02:54 AM

My next question is how would i go about requesting pages that are linked to the main page?

i assume i would analyse the source from the first page and look for url's then request those.

can i use regex to search for urls like the code to search fr images?

**Tewl** · May 25th, 2004, 09:46 AM

Yes a regular expression would be used to retrieve urls in the source.. My suggestiong would be to have a hash table or some collection to add the users to where the url is the key and weather you have searched the page or not to be the value. I would also suggest using threads to go through the pages faster.

I am at work right now but I will post another example later if you need using threads.

**sagey** · May 25th, 2004, 10:57 AM

Thanks Tewl, that would be great if you could.

**hellswraith** · May 25th, 2004, 11:46 AM

I just want to remind you, that although your intentions are good, there is potentially a HUGE flaw in the way you are going about it.

I have created an application exactly like you are doing, yet it doesn't check watermarks, just downloads files from pages and spiders out.

The problem is, on my computer, even with a high speed connection, I am only about to pull in about 5000-15000 images an hour. Now, that sounds like a lot, but look at how many images are out on the Internet. Millions, possibly billions. Think about how long it would take to spider the Internet and download each one to check for copyright infringements.

Google has a huge (thousands) farm of computers that spider the internet. If you are going to run your app constantly, you might want to get a good farm of computers if you want to do this effectively.

**sagey** · May 25th, 2004, 11:55 AM

yep cheer i accept your point. ideally i would like to be able to trawl loads of sites. but to start off with i'm just going to try and right a targetable crawler, so that i could maybe have a table in a db full of domain names where that i think infringement might take place and then i can set that crawler off against those specific sites.

the company i work for take product images for supermarkets so the number of people who would want to nick a pciture of a can of baked beans is quite limited (unless i'm underestimating the number of baked bean fetishist out there

**hellswraith** · May 25th, 2004, 06:46 PM

Fair enough, if you target it, that would be much better.

**sagey** · May 26th, 2004, 02:39 AM

Yep i really only want to target specific sites.

actually, hellswraith your apps figures seems pretty good compared to some of the 'professional' solutions out there. My maths is poor but i calculate that if your app can grab 15000 images an hour, 24 hours a day, 7 days a week, 4 weeks a month, that works out to about 10 million images which ain't bad at all compared to the blurb i got from a saleperson at digimarc (one of the biggest companies offering image watermarking/tracking).

"MarcSpider Reporting - this is our crawling service for the internet. We search for your watermarked photos across the internet and report back to you with the locations that we find them, a copy of the located image and the URL that we found it on. Digimarc continually spiders the web, looking at over 50 million images per month to find and report images that contain ImageBridge watermarks"

**hellswraith** · Jun 16th, 2004, 09:05 PM

Yes, but it all depends on image size, bandwidth, etc. Then on top of that, you are going to have to be processing these images at the same time, so that is also going to take some system resources.

I can definately see how to set this up to be scalable if you wanted. You would use one or two machines running a bunch of threads each downloading their own set of images. There will be some synchronization involved. Then have another machine or two running the watermark scanning. If you have this in mind when building your app/apps, you can plan for scalability. This means you could start on one machine, but scale up as you need more horsepower.

Thread: webcrawler for images

Thread Tools

Display

webcrawler for images

Posting Permissions