File Type Identification

**sciguyryan** · Apr 25th, 2010, 10:21 AM

Hey there guys!

I have a question for anyone. Maybe some code already exists for this or a good algorithm is already written down somewhere.

Basically what I'm trying to do is make an unknown file type identifier. I'm trying to find if there are any .NET implementations of a file-type matching algorithm out there but I can't seem to find one.

If there are none then can anyone point me in the general direction of writing one? I'm looking for help with the project so if anyone else is interested in this let me know.

Cheers

**DeanMc** · Apr 25th, 2010, 07:30 PM

I don't understand what you mean?

You would know the filetype by knowing the path: C:\MyFile.TYPE

It would simply be a case of using a regex to read everything after the . and check it against a list of some sort.

**jmcilhinney** · Apr 25th, 2010, 10:54 PM

I assume that you mean that you would read the data of a file and determine whether it's a Word document, a PDF document, an AutoCAD drawing, and HTML file, etc. For a start, you would have to know the binary format of all the file types you want to be able to identify. You would then have to read the bytes of the file and compare the format to each of the known file types. When you find a match, you've found a match. Your aim is fairly unrealistic unless you are prepared to study all those different binary formats and write code to identify them in an arbitrary set of bytes.

**sciguyryan** · Apr 26th, 2010, 02:10 AM

Originally Posted by DeanMc

I don't understand what you mean?

You would know the filetype by knowing the path: C:\MyFile.TYPE

It would simply be a case of using a regex to read everything after the . and check it against a list of some sort.

You sometimes find that file extensions are named incorrectly, intentionally or otherwise so that method of identification is not considered to be accurate.

Originally Posted by jmcilhinney

I assume that you mean that you would read the data of a file and determine whether it's a Word document, a PDF document, an AutoCAD drawing, and HTML file, etc. For a start, you would have to know the binary format of all the file types you want to be able to identify.

Once I actual figure out a good algorithm I'll generate these automatically using a large and varied sample set for each filetype to give the best accuracy I can.

Originally Posted by jmcilhinney

You would then have to read the bytes of the file and compare the format to each of the known file types. When you find a match, you've found a match. Your aim is fairly unrealistic unless you are prepared to study all those different binary formats and write code to identify them in an arbitrary set of bytes.

You are talking about header byte matching correct? I thought about that but it will need to used in conjunction with other methods since not all files use byte headers for identification. One of the most prominent being ISO disk image files.

Thread: File Type Identification

Thread Tools

Display

File Type Identification

Re: File Type Identification

Re: File Type Identification

Re: File Type Identification

Posting Permissions