Here's my code for testing the speed of various memory copy functions. The value printed by the print function after each 100 iterations of the function being tested is the average time (in milliseconds) that it took to execute that function. The below VB6 source code has comments that show how it works.

Code:
Private Declare Sub CopyBytes Lib "FastMemCopy.dll" (ByRef Dest As Any, ByRef Src As Any, ByVal ByteCount As Long)
Private Declare Sub CopyWords Lib "FastMemCopy.dll" (ByRef Dest As Any, ByRef Src As Any, ByVal WordCount As Long)
Private Declare Sub CopyDWords Lib "FastMemCopy.dll" (ByRef Dest As Any, ByRef Src As Any, ByVal DWordCount As Long)
Private Declare Sub CopyBytesFast Lib "FastMemCopy.dll" (ByRef Dest As Any, ByRef Src As Any, ByVal ByteCount As Long)
Private Declare Sub CopyWordsFast Lib "FastMemCopy.dll" (ByRef Dest As Any, ByRef Src As Any, ByVal WordCount As Long)

Private Declare Sub CopyMemory Lib "kernel32.dll" Alias "RtlMoveMemory" (ByRef Destination As Any, ByRef Source As Any, ByVal Length As Long)

Private Declare Function timeBeginPeriod Lib "winmm.dll" (ByVal uPeriod As Long) As Long
Private Declare Function timeEndPeriod Lib "winmm.dll" (ByVal uPeriod As Long) As Long
Private Declare Function timeGetTime Lib "winmm.dll" () As Long


Private Sub Form_Load()
    Dim Mem1(100000000 - 1) As Byte
    Dim Mem2(100000000 - 1) As Byte
    Dim TimeStart As Long
    Dim TimeEnd As Long
    Dim TimePassed As Double
    Dim TimePassedAvg As Double
    Dim i As Long
    
    
    
    timeBeginPeriod 1
    
    
    'Perform 100 iterations of copying 100 million bytes, 1 byte at a time
    TimePassedAvg = 0
    For i = 1 To 100
        TimeStart = timeGetTime
        CopyBytes Mem2(0), Mem1(0), 100000000
        TimeEnd = timeGetTime
        TimePassed = TimeEnd - TimeStart
        TimePassedAvg = TimePassedAvg + TimePassed / 100
    Next i
    Print TimePassedAvg
    
    'Perform 100 iterations of copying 100 million bytes, 2 bytes at a time
    TimePassedAvg = 0
    For i = 1 To 100
        TimeStart = timeGetTime
        CopyWords Mem2(0), Mem1(0), 50000000
        TimeEnd = timeGetTime
        TimePassed = TimeEnd - TimeStart
        TimePassedAvg = TimePassedAvg + TimePassed / 100
    Next i
    Print TimePassedAvg
    
    'Perform 100 iterations of copying 100 million bytes, 4 bytes at a time
    TimePassedAvg = 0
    For i = 1 To 100
        TimeStart = timeGetTime
        CopyDWords Mem2(0), Mem1(0), 25000000
        TimeEnd = timeGetTime
        TimePassed = TimeEnd - TimeStart
        TimePassedAvg = TimePassedAvg + TimePassed / 100
    Next i
    Print TimePassedAvg
    
    
    
    'Should dentical to the fourth test, as 100000000 is an exact multiple of 4 bytes
    TimePassedAvg = 0
    For i = 1 To 100
        TimeStart = timeGetTime
        CopyBytesFast Mem2(0), Mem1(0), 100000000 'Copy as many 4byte blocks as possible and then copy remaining data 1 byte at a time
        TimeEnd = timeGetTime
        TimePassed = TimeEnd - TimeStart
        TimePassedAvg = TimePassedAvg + TimePassed / 100
    Next i
    Print TimePassedAvg
    
    'Should dentical to the fourth test, as 100000000 is an exact multiple of 4 bytes
    TimePassedAvg = 0
    For i = 1 To 100
        TimeStart = timeGetTime
        CopyWordsFast Mem2(0), Mem1(0), 50000000 'Copy as many 4byte blocks as possible and then copy remaining data 2 bytes at a time
        TimeEnd = timeGetTime
        TimePassed = TimeEnd - TimeStart
        TimePassedAvg = TimePassedAvg + TimePassed / 100
    Next i
    Print TimePassedAvg
    
    
    'Perform 100 iterations of copying 100 million bytes using CopyMemory
    'Not sure what method CopyMemory uses, but it is supposed to work on overlapping memory regions, so it must use an advanced technique
    TimePassedAvg = 0
    For i = 1 To 100
        TimeStart = timeGetTime
        CopyMemory Mem2(0), Mem1(0), 100000000
        TimeEnd = timeGetTime
        TimePassed = TimeEnd - TimeStart
        TimePassedAvg = TimePassedAvg + TimePassed / 100
    Next i
    Print TimePassedAvg
    
    
    
    timeEndPeriod 1
    
End Sub
When the program is actually run, I find that there is really no speed difference at all between the different functions. Not sure why this is, but maybe on modern CPUs, it always takes the same amount of time to copy a given number of bytes, regardless if they are copied by Byte, Word, or DWord. So copying 4 bytes takes the same amount time as copying 2 words or 1 dword. Unlike on older CPUs, maybe you don't get a speed boost by optimizing your program, by having it copy dwords or words instead of bytes.

Here's the results of running this program 3 different times.
First time I ran the program:
25.66
26.17
25.90
25.83
26.29
25.71

Second time I ran the program:
27.36
30.50
30.17
26.73
26.88
26.18

Third time I ran the program:
25.58
25.98
25.64
25.44
25.86
25.73

As you can see, the there is no consistency at all between different times I ran the tester program. Nor is there any consistency regarding which function is faster. Sometimes one function was faster, and sometimes another one was faster. The only thing consistent is that the times tended to hover around 26ms, and every once in a while the functions (for no apparent reason) ran slower, sometimes taking about 30ms to complete. I'm not sure what caused those outlier 30ms times. And all of these inconsistencies I've mentioned are present despite getting calculating an average time, by running a given function 100 times, each time it was tested. I hope somebody can explain these inconsistencies.


The first 5 Copy functions are ones in a DLL I made myself in assembly language, and assembled with FASM. Below is the source code for that DLL file. It's also has comments so you can see how it works.
Code:
format PE GUI 4.0 DLL
entry dllmain
include "macro\export.inc"

Arg1 equ ebp+8
Arg2 equ Arg1+4
Arg3 equ Arg2+4


section ".text" code readable executable
        dllmain:
        mov eax,1
        ret 12

        CopyBytes:
        push ebp
        mov ebp,esp
        push esi
        push edi
        push ecx
        mov edi,[Arg1]
        mov esi,[Arg2]
        mov ecx,[Arg3] ;Number of bytes to copy
        rep movsb ;Copy data 1 byte at a time
        pop ecx
        pop edi
        pop esi
        leave
        ret 12

        CopyWords:
        push ebp
        mov ebp,esp
        push esi
        push edi
        push ecx
        mov edi,[ebp+8]
        mov esi,[ebp+12]
        mov ecx,[ebp+16] ;Number of words (2 byte blocks) to copy
        rep movsw ;Copy data 1 word at a time
        pop ecx
        pop edi
        pop esi
        leave
        ret 12

        CopyDWords:
        push ebp
        mov ebp,esp
        push esi
        push edi
        push ecx
        mov edi,[ebp+8]
        mov esi,[ebp+12]
        mov ecx,[ebp+16] ;Number of dwords (4 byte blocks) to copy
        rep movsd ;Copy data 1 dword at a time
        pop ecx
        pop edi
        pop esi
        leave
        ret 12


        CopyBytesFast:
        push ebp
        mov ebp,esp
        push esi
        push edi
        push ecx
        mov edi,[Arg1]
        mov esi,[Arg2]
        mov eax,[Arg3] ;Number of bytes to copy
        xor edx,edx
        mov ecx,4
        div ecx
        mov ecx,eax
        rep movsd ;First, copy as much data as possible 4 bytes at a time
        mov ecx,edx
        rep movsb ;Then, copy remaining data 1 byte at a time
        pop ecx
        pop edi
        pop esi
        leave
        ret 12

        CopyWordsFast:
        push ebp
        mov ebp,esp
        push esi
        push edi
        push ecx
        mov edi,[Arg1]
        mov esi,[Arg2]
        mov eax,[Arg3] ;Number of words to copy
        xor edx,edx
        mov ecx,2
        div ecx
        mov ecx,eax
        rep movsd ;First, copy as much data as possible 2 words at a time
        mov ecx,edx
        rep movsw ;Then, copy remaining data 1 word at a time
        pop ecx
        pop edi
        pop esi
        leave
        ret 12


section ".edata" export readable
        export "FastMemCopy.dll",\
               CopyBytes, "CopyBytes",\
               CopyWords, "CopyWords",\
               CopyDWords, "CopyDWords",\
               CopyBytesFast, "CopyBytesFast",\
               CopyWordsFast, "CopyWordsFast"

section ".reloc" fixups readable
        dq 0