I am currently using the code below to get the MD5 has of the given file. However it is very slow! Is there a faster way to do it.
Code:
Using md5 As MD5 = MD5.Create()
Using stream = File.OpenRead(filename)
Return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", String.Empty)
End Using
End Using
The files are MP4 video files over 500 megs, the duration of each file is about 25 minutes. There will eventually be larger and longer files.
I forgot that the files are not local, they are on another computer on the same network. I am going to see if running the program from the system the files are saved on makes any difference and if so how much later today.
So I computed the MD5 for all of the music and video files on my system.
00:00:32.5008616 - 32 secons
5,502 - total files
18,486,754,664 - total bytes
64,663,162 - largest file
3,360,006.3 - average file
I read the entire file and then did the MD5.
Code:
Private fct As Integer = 0
Private maxF As Long = 0
Private totF As Long = 0
Private md5Obj As New System.Security.Cryptography.MD5CryptoServiceProvider
Private Function MD5Enc(path As String) As String
Dim bytesToHash() As Byte = IO.File.ReadAllBytes(path)
fct += 1 'count
If bytesToHash.Length > maxF Then maxF = bytesToHash.Length 'largest file
totF += bytesToHash.Length 'total for average
Dim hash() As Byte = md5Obj.ComputeHash(bytesToHash)
Dim rv As String = BitConverter.ToString(hash).Replace("-", String.Empty)
bytesToHash = Nothing
hash = Nothing
Return rv
End Function
Public Class Form1
Private stpw As New Stopwatch
Private Async Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Button1.Enabled = False
Dim tsk As Task
tsk = Task.Run(Sub()
stpw.Start()
Debug.WriteLine("---")
fct = 0
maxF = 0L
totF = 0L
Dim p As String = Environment.GetFolderPath(Environment.SpecialFolder.MyMusic)
DoFolders(p)
p = Environment.GetFolderPath(Environment.SpecialFolder.MyVideos)
DoFolders(p)
stpw.Stop()
Debug.WriteLine(stpw.Elapsed)
Debug.WriteLine(fct.ToString("n0"))
Debug.WriteLine(totF.ToString("n0"))
Debug.WriteLine(maxF.ToString("n0"))
Debug.WriteLine((totF / fct).ToString("n1"))
End Sub)
Await tsk
Button1.Enabled = True
' Stop
End Sub
Private Sub DoFolders(folder As String)
Dim p() As String = IO.Directory.GetFiles(folder)
For Each f As String In p
Dim s As String = MD5Enc(f)
' Debug.WriteLine(s)
Next
p = IO.Directory.GetDirectories(folder)
For Each d As String In p
DoFolders(d)
Next
End Sub
Private fct As Integer = 0
Private maxF As Long = 0
Private totF As Long = 0
Private md5Obj As New System.Security.Cryptography.MD5CryptoServiceProvider
Private Function MD5Enc(path As String) As String
Dim bytesToHash() As Byte = IO.File.ReadAllBytes(path)
Dim rv As String = ""
fct += 1 'count
If bytesToHash.Length > maxF Then maxF = bytesToHash.Length 'largest file
totF += bytesToHash.Length 'total for average
Dim hash() As Byte = md5Obj.ComputeHash(bytesToHash)
rv = BitConverter.ToString(hash).Replace("-", String.Empty)
hash = Nothing
bytesToHash = Nothing
Return rv
End Function
End Class
Last edited by dbasnett; Jul 26th, 2022 at 12:01 PM.
Standard home/small office network is 1Gbit/s or about 100MB/s.
5400RPM HDD sequential reads are about 130-140MB/s
7200RPM HDD sequential reads are about 220-230MB/s
So 500MB over network require 5 seconds only for the network transfer at max speed.
But network is used by other computers and devices, including other apps on same computer. This will "slow down" the time to "download" the file over the network.
Hard drives are slow and become pain slow on random access. And our computers run many apps and services at the same time so they can access same hard drives and affect the read speeds.
SSDs are much better in terms of random access speeds and much higher linear speeds. But even with PCI gen.4 SSDs and gigabytes per second reads we are limited with the network.
Obvious upgrade is to use faster network - 5 or 10Gbit or even faster. But could become quite expensive if not planned well.
----
Some benchmarks:
57.5GB of 174 video files - each ~340MB
Get hash using local hard drive (7200RPM DC grade with about 220-240MB/s sequential read): 4 min 15 sec
Get hash of same files over 1Gbit/s network (different computer with similar HDD): 15 min 08 sec
* NOTE: all "benchmarks" are performed using mobile phone stopwatch :-)
So what is the conclusion? My network is slow (I know, I know). My hard drives are not so slow as they achieve the max sequential reads.
What if I use SSD instead of HDD? Transferring over network for that "simple" file hash operation will kill the benefits of using SSD. For working with these files locally? Much, much better than HDDs. But with the known limitations: size and price.
Solution to get hash of large files on remote computer? It should be obvious that this should be done locally (where data is) and then retrieved via network service that returns the hash only (few bytes) instead of gigabytes of data.
Or get hash and save as metadata somewhere (file, local SQLite db, network SQL database) when these files are added to the storage.