Re: File Type Identification
I don't understand what you mean?
You would know the filetype by knowing the path: C:\MyFile.TYPE
It would simply be a case of using a regex to read everything after the . and check it against a list of some sort.
Re: File Type Identification
I assume that you mean that you would read the data of a file and determine whether it's a Word document, a PDF document, an AutoCAD drawing, and HTML file, etc. For a start, you would have to know the binary format of all the file types you want to be able to identify. You would then have to read the bytes of the file and compare the format to each of the known file types. When you find a match, you've found a match. Your aim is fairly unrealistic unless you are prepared to study all those different binary formats and write code to identify them in an arbitrary set of bytes.
Re: File Type Identification
Quote:
Originally Posted by
DeanMc
I don't understand what you mean?
You would know the filetype by knowing the path: C:\MyFile.TYPE
It would simply be a case of using a regex to read everything after the . and check it against a list of some sort.
You sometimes find that file extensions are named incorrectly, intentionally or otherwise so that method of identification is not considered to be accurate.
Quote:
Originally Posted by
jmcilhinney
I assume that you mean that you would read the data of a file and determine whether it's a Word document, a PDF document, an AutoCAD drawing, and HTML file, etc. For a start, you would have to know the binary format of all the file types you want to be able to identify.
Once I actual figure out a good algorithm I'll generate these automatically using a large and varied sample set for each filetype to give the best accuracy I can.
Quote:
Originally Posted by
jmcilhinney
You would then have to read the bytes of the file and compare the format to each of the known file types. When you find a match, you've found a match. Your aim is fairly unrealistic unless you are prepared to study all those different binary formats and write code to identify them in an arbitrary set of bytes.
You are talking about header byte matching correct? I thought about that but it will need to used in conjunction with other methods since not all files use byte headers for identification. One of the most prominent being ISO disk image files.