Results 1 to 3 of 3

Thread: Which is the faster way to ZeroExtend 2 bytes to 4?

  1. #1

    Thread Starter
    Frenzied Member
    Join Date
    Oct 2008
    Posts
    1,186

    Which is the faster way to ZeroExtend 2 bytes to 4?

    <<Note that the code shown in this post is using the NASM syntax (first operand is destination).>>

    I can either zero the destination 4-byte register and then copy in the two bytes like this:
    Code:
    xor eax,eax
    mov eax,[mem_ptr]

    or I can do it in one operation like this:
    Code:
    movzx eax,[mem_ptr]
    While the MOVZX opcode takes only one line of assembly code to do 2 things, my question is if it's actually faster for the CPU to execute. Often times in programming, code that takes more lines to write seems to be faster to execute. It all depends on the number of clock ticks that are needed to execute the MOVZX instruction compared to the number of clock ticks that are needed to execute the 2 instrunctions XOR and then MOV (the sum of the clock ticks for each of these 2 added together).

  2. #2
    Frenzied Member 2kaud's Avatar
    Join Date
    May 2014
    Location
    England
    Posts
    1,096

    Re: Which is the faster way to ZeroExtend 2 bytes to 4?

    Intel provides detailed info re the various instructions that include clock cycles used. See:
    https://www.intel.com/content/www/us...intel-sdm.html (vol 2)
    All advice is offered in good faith only. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  3. #3
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    5,413

    Re: Which is the faster way to ZeroExtend 2 bytes to 4?

    Quote Originally Posted by Ben321 View Post
    It all depends on the number of clock ticks that are needed to execute the MOVZX instruction compared to the number of clock ticks that are needed to execute the 2 instrunctions XOR and then MOV (the sum of the clock ticks for each of these 2 added together).
    In your sample XOR and MOV are probably pipelined i.e. executed at the same time so effectively the second instruction takes 0 ticks to execute.

    There is a reason you'll often see several (for instance 4) XOR + MOV combos stacked together so these are executed together but on multiple pipelines simultaneously.

    The funny thing is that the question in OP is hard to even test properly. You don't execute these instructions on their own and usually these get pipelined i.e. get a lot of XOR + MOVs for free around more expensive (other) instructions.

    cheers,
    </wqw>

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width