Easiest and quick way to check if files are identical?
What would be the easiest way to check if two files are identical. Mind you, this for server / client checks. basicly the server sends some code of the file and something about the file (lets say file size), and then client will check to see if it is the same or not, and if it is different, it will download it. But this ain't about getting the file sent over.
Simple put, what would be the best way to detect if a file is different. I have been trying modified dates, but many installers change the modified date depending on timezone from what I've seen, Ive also had one change it by simply seconds!
I've done file size before, but I've seen filesize sometimes doesn't 'change fully' if the change is something minor, (Like a couple things done in a image.)
I've hear of hash checks spitting out a 16 number code or something like that, but I've not been able to find it anywhere, google isn't my friend today.
Re: Easiest and quick way to check if files are identical?
CRC is another option. This is an extremely fast CRC32 class and in my opinion, will provide the fastest way to see if one file is the same as another.
A quick file length check before checking CRC32 can speed things up (incase you're comparing say, a 4GB file with a 1MB file, it would be a waste of time to do the CRC).
Re: Easiest and quick way to check if files are identical?
This might work:
Code:
DIM FirstFileData as String, SecondFileData as String
OPEN "FirstFile" for Binary as #1
FirstFileData = SPACE(LOF(1))
GET 1, , FirstFileData
OPEN "SecondFile" for Binary as #2
SecondFileData = SPACE(LOF(2))
GET 2, , SecondFileData
IF FirstFileData = SecondFileData THEN
MsgBox "Files are Identical."
Else MsgBox "Files are NOT Identical."
End If
Close
Before FileCopy showed up, I used to copy files like this:
Code:
DIM FirstFileData as String
OPEN "FirstFile" for Binary as #1
FirstFileData = SPACE(LOF(1))
GET 1, , FirstFileData
OPEN "SecondFile" for Binary as #2
Put 2, , FirstFileData
Close
Last edited by Code Doc; Mar 20th, 2007 at 09:06 AM.
Re: Easiest and quick way to check if files are identical?
That won't make a bit of difference if those are not the places the file was changed at. It only takes one bit changed in a file to make it not the same...
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
That won't make a bit of difference if those are not the places the file was changed at. It only takes one bit changed in a file to make it not the same...
If even one byte of the file is different, doing a CRC on the whole file will produce different results.
Re: Easiest and quick way to check if files are identical?
That's my point exactly.... It is not reliable It only takes one BIT of information one binary digit to throw the whole thing off. That will determine that they are not identical but take a file with "AB" and change it to "BA" and do a CRC and it will say that the files are identical and they are not.
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
That's my point exactly.... It is not reliable It only takes one BIT of information one binary digit to throw the whole thing off. That will determine that they are not identical but take a file with "AB" and change it to "BA" and do a CRC and it will say that the files are identical and they are not.
Uh, no. What CRC code are you using that gives you that result?
Re: Easiest and quick way to check if files are identical?
Basic CRC is done by adding a byte to a word and having it rollover or add a word to a dword and having it rollover. Either way if you exchange a byte for the first one or a word for the second the results will remain the same.
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
Basic CRC is done by adding a byte to a word and having it rollover or add a word to a dword and having it rollover. Either way if you exchange a byte for the first one or a word for the second the results will remain the same.
Not with (a good) CRC algorithm. The one I'm using (CRC32) does not produce those results.
Re: Easiest and quick way to check if files are identical?
It has to that is the meaning of Cyclical Redundancy Check. In CRC32 all that means is that you are operating on 32 bit words at a time. Try switch binary information on a word boundary.
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
It has to that is the meaning of Cyclical Redundancy Check. In CRC32 all that means is that you are operating on 32 bit words at a time. Try switch binary information on a word boundary.
I'm not going to try and claim I'm an expert at CRC algorithms but all I'm saying is I've never encountered one where if you simply switch the bytes it will produce the same result. I even tested that several times on the CRC class I'm using now.
I would imagine if just switching the byte order of the data produced the same results, CRC wouldn't be of much good for anything (checking for packet loss/corruption in a network program, for example).
While CRC is not perfect and might not be the best option for checking if a file is the same as another (although it seems accurate enough for me and is definitely faster than comparing every byte of the file). Maybe you can suggest a better method.
Re: Easiest and quick way to check if files are identical?
Switch a byte in CRC32 would generate a "not the same" result but switching a word (two bytes on a word boundary) would generate as "the same". It is good to determine if a file has been changed basically but if you want to be certain CRC will produce good results most of the time. But it's in the times when it doesn't is where the most problems will occur. CRC was mainly used in communications when one side would send data to the other side and it needed to be verified that the same data was received. This was for transmission purposes in which the byte word swapping was not an issue. The issue then was dropped bits or scrambled bits.
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
Switch a byte in CRC32 would generate a "not the same" result but switching a word (two bytes on a word boundary) would generate as "the same". It is good to determine if a file has been changed basically but if you want to be certain CRC will produce good results most of the time. But it's in the times when it doesn't is where the most problems will occur. CRC was mainly used in communications when one side would send data to the other side and it needed to be verified that the same data was received. This was for transmission purposes in which the byte word swapping was not an issue. The issue then was dropped bits or scrambled bits.
So you're saying that some CRC algorithms use 2 byte word boundaries (meaning it works with 2 bytes at a time?) Which is why if you just reverse the 2 bytes (AB to BA) then it will produce the same checksum?
But if the CRC algorithm worked with 1 byte at a time (1 word boundary?) then switching 2 bytes in the file will produce different checksums.
Because I tried the AB/BA test on the CRC algorithm I'm using and it produced different checksums.
Re: Easiest and quick way to check if files are identical?
The Qualities of the CRC-32
CRCMAN uses the CRC-32 algorithm to generate a 32 bit number for any given file. We then treat this 32 bit number as a somewhat unique "fingerprint" for that file. This fingerprint differs somewhat from the human fingerprint. It often said that no two people have identical fingerprints. This can't be the case for our CRC fingerprint. Since there are more than 4,294,967,296 different files in the world, it is a foregone conclusion that some of them must have identical checksums.
However, the CRC-32 does have attributes that make it very attractive for the verification of files. These include the following:
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
Basic CRC is done by adding a byte to a word and having it rollover or add a word to a dword and having it rollover.
That's a checksum, not a CRC.
The most difficult part of developing a program is understanding the problem.
The second most difficult part is deciding how you're going to solve the problem.
Actually writing the program (translating your solution into some computer language) is the easiest part.
Please indent your code and use [HIGHLIGHT="VB"] [/HIGHLIGHT] tags around it to make it easier to read. Please Help Us To Save Ana
Re: Easiest and quick way to check if files are identical?
What is produced from CRC32 is exactly a checksum it's just a different way of achieving the same thing but the basic thing is it is still not 100% reliable to compare files. A text file and a pure binary file could have the same checksum and be totally different.
Re: Easiest and quick way to check if files are identical?
Now this could be most embarrassing in a production environment when you produce the same checksums from two different files and say that they are the same when after checking the first byte of each file it would have told you they were different.
Re: Easiest and quick way to check if files are identical?
Well, to be honest CRC seems to work in MOST cases. And I really can't think of a better solution. I don't know of a better/more reliable algorithm out there. MD5 might be an option.
The only sure-fire way is a byte-by-byte comparison which can be extremely slow if the files are large and they are the same. If there was an ASM/C DLL written for that it might be faster.
Either way, AFAIK, CRC32 seems to be the best option. Maybe you know a better way?
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
I use byte by byte comparison and it is not slow at all. It may be possible that the routine you would use could be tweaked but I use it all the time.
I personally don't have any use for an algorithm like this. I could write an optimized byte-by-byte comparison routine, but if 2 large files (~2GB) are the same, it would be slow as hell.
If they were different, the loop could just exit on the first byte that is not the same, and it would be pretty fast. But looping through 2GB worth of data (if the file are the same) would take awhile.
Re: Easiest and quick way to check if files are identical?
Here's one I wrote real quick. If you're comparing 2 files that are the same, and they are really large (~2GB) then it will take along time because it will have to scan through the whole file.
Might be able to make it faster by loading more than 1 byte from the file at a time but this is just an example:
vb Code:
Option Explicit
'Check if 2 files are the same.
Private Function FilesSame(ByVal FilePath1 As String, ByVal FilePath2 As String) As Boolean
Dim intFF1 As Integer, intFF2 As Integer
Dim byt1 As Byte, byt2 As Byte
Dim lonL1 As Long, lonL2 As Long
Dim lonCurByte As Long, bolDiff As Boolean
If Len(FilePath1) = 0 Or Len(FilePath2) = 0 Then Exit Function
Re: Easiest and quick way to check if files are identical?
Originally Posted by randem
You would never attempt to load one byte at a time... That would take a very long time.
Either way, even if you load say, 1KB at a time, you still need to compare that 1KB packet. You could convert it to string with StrConv() and compare it that way, but it's better to keep it as a byte array when comparing binary files.
Re: Easiest and quick way to check if files are identical?
Does anyone recall my earler post on this thread?
Code:
Dim FirstFileData As String, SecondFileData As String
Private Sub Form_Load()
Open "SAT 1" For Binary As #1
FirstFileData = Space(LOF(1))
Get 1, , FirstFileData
Open "SAT 1.Bak" For Binary As #2
SecondFileData = Space(LOF(2))
Get 2, , SecondFileData
If FirstFileData = SecondFileData Then
MsgBox "Files are Identical."
Else: MsgBox "Files are NOT Identical."
End If
Close
End Sub
To test it, I used two files, one a copy of the other, each 14 Mb in size. Execution time < 0.2 seconds. Result: "Files are Identical"
Then I changed the absolute last visible character of the backup file. Both files were still the same byte size. Execution time < 0.2 seconds. Result: "Files are NOT Identical".
Re: Easiest and quick way to check if files are identical?
Originally Posted by Code Doc
Does anyone recall my earler post on this thread?
Code:
Dim FirstFileData As String, SecondFileData As String
Private Sub Form_Load()
Open "SAT 1" For Binary As #1
FirstFileData = Space(LOF(1))
Get 1, , FirstFileData
Open "SAT 1.Bak" For Binary As #2
SecondFileData = Space(LOF(2))
Get 2, , SecondFileData
If FirstFileData = SecondFileData Then
MsgBox "Files are Identical."
Else: MsgBox "Files are NOT Identical."
End If
Close
End Sub
To test it, I used two files, one a copy of the other, each 14 Mb in size. Execution time < 0.2 seconds. Result: "Files are Identical"
Then I changed the absolute last visible character of the backup file. Both files were still the same byte size. Execution time < 0.2 seconds. Result: "Files are NOT Identical".
Am I missing something? My case rests.
Try testing that on a 2GB file.
1. You're using a string variable (at least you're buffering it with Space() but it's still slower).
2. You're loading the entire file into memory! If comparing 2 files, each of them 2GB, that's 4GB being loaded into memory.
Re: Easiest and quick way to check if files are identical?
"If he's not working with large files, either of our methods will work."
--------------
Agreed. You can still use my method by reading the files in using 30 Mb chunks or so, and comparing each chunk. Move the pointer in each iteration. Exit the loop when a chunk fails to compare. Done.
If my timer is correct, it will cost you at most 2 seconds on the average for a pair of 2 Gb files. Worst case is my example when the last chunk fails. I'm only running at 1.7 Ghz.
Re: Easiest and quick way to check if files are identical?
Alternative method using the rather quick native InStr function (quick if we take into account it does compare every single byte). Also reads in bigger chunks to improve performance, although admittedly strings aren't the best datatype to use. Used a chunk size of 50 HDkB.
I also added some more error checks (in comparison to DigiRev's solution) so you can throw pretty much anything at it and it'll do it's job without throwing an error.
Code:
Public Function IsFilesSame(ByVal File1 As String, ByVal File2 As String) As Boolean
Dim intFF1 As Integer, intFF2 As Integer, blnIsSame As Boolean
Dim lngLen1 As Long, lngLen2 As Long
Dim str1 As String, str2 As String
' ensure strings contain something
If LenB(File1) = 0 Or LenB(File2) = 0 Then Exit Function
' ensure not same filename
If File1 = File2 Then IsFilesSame = True: Exit Function
' ensure files exist
If LenB(Dir$(File1, vbHidden Or vbSystem)) = 0 Then Exit Function
If LenB(Dir$(File2, vbHidden Or vbSystem)) = 0 Then Exit Function
' get file lengths
lngLen1 = FileLen(File1)
lngLen2 = FileLen(File2)
' compare file lengths
If lngLen1 = lngLen2 Then
' see if zero length
blnIsSame = (lngLen1 = 0)
' if not zero length
If Not blnIsSame Then
blnIsSame = True
' read files in chunks
intFF1 = FreeFile
Open File1 For Binary Access Read As #intFF1
intFF2 = FreeFile
Open File2 For Binary Access Read As #intFF2
Do While blnIsSame And lngLen1 > 50000
str1 = Input$(50000, #intFF1)
str2 = Input$(50000, #intFF2)
lngLen1 = lngLen1 - 50000
' compare
blnIsSame = (InStr(str1, str2) = 1)
Loop
If blnIsSame Then
str1 = Input$(lngLen1, #intFF1)
str2 = Input$(lngLen1, #intFF2)
' compare
blnIsSame = (InStr(str1, str2) = 1)
End If
Close #intFF1
Close #intFF2
End If
' what is the result...
IsFilesSame = blnIsSame
End If
End Function
Oh, and the reason optimizing for other than strings doesn't give that much in this case is that you're reading from disk to memory anyway, which is many times slower than processing only in memory. Although having a good chunk size can make a surprisingly big difference in some cases, but the optimal size for that often varies from computer to computer.
Aww heck, and I'm awfully tired now... why oh why I spent my time on this...
Re: Easiest and quick way to check if files are identical?
byte by byte comparison is the way to do it and relatively fast too.
if God killed everyone on earth in the flood of Noah, then he killed hundreds of millions of innocent lives. He could have saved all the good souls; he didn't though. Isn't condemning the souls of innocents, the work of the devil? Is Jesus the real God?
Re: Easiest and quick way to check if files are identical?
but make sure the files are read into a buffer and then processed rather than loading the whole file into ram in one hit.
if God killed everyone on earth in the flood of Noah, then he killed hundreds of millions of innocent lives. He could have saved all the good souls; he didn't though. Isn't condemning the souls of innocents, the work of the devil? Is Jesus the real God?
Re: Easiest and quick way to check if files are identical?
Originally Posted by questioner
byte by byte comparison is the way to do it and relatively fast too.
another way would be to divide the file into x parts and byte sum each part and do that for both files, however, i don't think you can beat that approach for accuracy.
Re: Easiest and quick way to check if files are identical?
Just for improving the speed for InStr method in what randem posted:
Code:
Private Function InstrCompareFiles() As Boolean
Dim FirstFileData As String, SecondFileData As String
Dim bytBuffer1(65535) As Byte
Dim bytBuffer2(65535) As Byte
Seek #Fnum, 1
Seek #Fnum1, 1
InstrCompareFiles = True
StartTime = GetTickCount
Do While Not EOF(Fnum)
Get Fnum, , bytBuffer1
Get Fnum1, , bytBuffer2
If InStr(bytBuffer1, bytBuffer2) = 0 Then
InstrCompareFiles = False
Exit Do
End If
Loop
EndTime = GetTickCount
Msg = Msg & vbCrLf & vbCrLf & "InstrCompareFiles Elapsed time - " & EndTime - StartTime & " ms"
End Function
In this format it is less than one third of what it was (on my computer means about three seconds vs. about ten seconds). All I did was to replace string processing with byte arrays and take away that recreation and freeing of memory blocks in a loop...
Taking this into account, randem's sample is flawed: too much time goes into something that shouldn't be happening (in a benchmark of something).
Edit!
I also ticked on all the advanced optimizations, after which ByteArrayCompareFiles and ByteArrayXorXCompareFiles became the two fastest solutions in randem's original sample (but are slower than the chunked InStr above, because reading chunks is more efficient).