This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

byte Vs Half word Vs Word comparison

Hi Experts,

unsigned int var1_32;
unsigned int var2_32;

unsigned short int var1_16;
unsigned short int var2_16;

unsigned char var1_8;
unsigned char var2_8;

In the above declarations which is faster,

if(var1_32 == var2_32)
{

}

or

if(var1_16 == var2_16)
{

}

or

if(var1_8 == var2_8)
{

}
Parents
  • I agree. Even if you're reading the values from memory, the speed would be the same.

    -But I imagine that the variables do not always contain the same values, so imagining the code...

         var1_8++;

         if(var1_8 == var2_8)

         {

         }

    ...something will happen, that makes using an 8-bit value slower than using a 32-bit value. This is something that chrisshore is an exper t on...

    var1_8 is loaded into a register and since it's now in a 32-bit register, incrementing it will not wrap it to 0, when the value 255 is incremented.

    instead, var1_8 will contain the value 256, so the compiler will need to wrap this to 0 by clearing the upper 24 bits.

    On some architectures (for instance, Cortex-M), this will take extra time.

    I recommend using uint8_t, uint16_t, uint32_t and uint64_t, plus their signed siblings int8_t, int16_t, int32_t and int64_t instead of unsigned char/short/long/long long, etc.

    To use those, just include <stdint.h>, or compile as C99 or newer.

    The reason is that you don't really know the size of a long or short or int or long long. Most people assume that a long is 32-bit, but it does not have to be.

Reply
  • I agree. Even if you're reading the values from memory, the speed would be the same.

    -But I imagine that the variables do not always contain the same values, so imagining the code...

         var1_8++;

         if(var1_8 == var2_8)

         {

         }

    ...something will happen, that makes using an 8-bit value slower than using a 32-bit value. This is something that chrisshore is an exper t on...

    var1_8 is loaded into a register and since it's now in a 32-bit register, incrementing it will not wrap it to 0, when the value 255 is incremented.

    instead, var1_8 will contain the value 256, so the compiler will need to wrap this to 0 by clearing the upper 24 bits.

    On some architectures (for instance, Cortex-M), this will take extra time.

    I recommend using uint8_t, uint16_t, uint32_t and uint64_t, plus their signed siblings int8_t, int16_t, int32_t and int64_t instead of unsigned char/short/long/long long, etc.

    To use those, just include <stdint.h>, or compile as C99 or newer.

    The reason is that you don't really know the size of a long or short or int or long long. Most people assume that a long is 32-bit, but it does not have to be.

Children
  • Using a full word instead of bytes for arithmetic isn't always a good idea (besides I think one should ignore this unless a real problem with speed turns up and just design what seems sensible). If one has a large array then it is often a better idea to make the items in it as small as possible to avoid extra memory accesses. In the case of incrementing something that only goes to 100 yes putting it in a full 32 bit word is faster as a program variable. But if there are ten thousand of them and your cache is 32K then as bytes they occupy a third of the cache and as 32 bit integers they cause the cache to trash. So using bytes could easily be ten times faster.

  • What you have here is the tension between storage size and computation size.

    You are right that there are good reasons for carrying out computations using the natural word size of the machine - as you say, this avoids extra instructions to deal with truncation, rounding, normalization etc.  But there are also good reasons for, in some circumstances, wanting to store variables in the smallest possible container (reduces memory footprint, reduces cache pollution etc.)

    The ARM architecture is actually very good for this. Values can be automatically converted from byte/halfword to word on loading (LDRB
    LDRSB/LDRH/LDRSH carry out the conversion and/or sign extension as part of the load); values can be automatically converted back to byte/halfword on storing (STRB/STRH automatically remove upper bits). Computation can then be carried out at natural size while memory footprint can be managed as necessary with very little (if any) overhead.

    As Jens points out above (very kind of you to say so, but I'm not sure I'm the world's greatest expert on this!) the extra overhead is required when carrying out computations on sub-word quantities. Then extra instructions are required. In many cases, these can be avoided though. For example, if you are simply going to store the result of a computation to memory, it may be that the necessary truncation can be carried out (for free) by the appropriate STRB/STRH instruction. So the cost is not always incurred.

    Hope this helps.

    Chris

  • Absolutely true both daith and Chris. -One thing that I had in mind when I wrote the reply above, was that the declared variables will not always be placed on the stack. If your routine uses only very few local variables, CPU registers will be used directly, and no memory read/write will occur. In such cases, you'll need truncation, as Chris mentions, when performing operations on the values.

    It is a bit difficult to say exactly what's the best thing to do - like daith mentioned; because in some situations you might benefit from using 32-bit values, in others, 8-bit... Then you might want to start out by using 32-bit and later find out after you modified your code, that your array is getting so large that it'll pay to switch to 8-bit.

    However, if you're declaring the variables as non-arrays (eg. you have very few of them), I'd generally recommend using 32-bit values.

  • As a follow-on to Chris' comment about type conversions often coming for free, it's worth pointing out that compilers also know that there is no need to do type conversions for intermediate results.  For example, in the following code:

         extern unsigned char c[4];
    
         unsigned char sum(void)
         {
              return c[0] + c[1] + c[2] + c[3];
         }
    

    ... the compiler may generate code like this for the core of sum():

         ldrb    r1, [r3]
         ldrb    r2, [r3, #2]
         add     r0, r0, r1
         ldrb    r3, [r3, #3]
         add     r0, r0, r2
         add     r0, r0, r3
         uxtb    r0, r0
    

    The upper bits of the intermediate result in r0 contain garbage in the form of overflowed bits, but the compiler knows that this doesn't affect the bits that are important for the result.  Only one truncation is needed, at the end - and that it only needed because the procedure call standard requires the spare bits to be zero when returning a value of type unsigned char.

    If the function is inlined, the compiler doesn't need to follow the procedure call standard for this value and the uxtb will likely disappear.

    On the whole, you should not worry about which types are "more efficient" - the CPU architecture and implementation and the compiler between them will generally do a pretty good job.  Good choice of algorithms and data representation, or using appropriate pre-optimized libraries for your program, have a much bigger impact on performance.  This is the part the compiler can't do for you. Focusing on the code design also keeps your code more portable - important if you want it to perform well on both AArch32 and AArch64 for example.

    It's definitely worth getting into the habit of disassembling the code coming out of the compiler - the optimisations the compiler applies (or fails to apply) can be very surprising, especially at high optimization levels.

  • Dave's comments are spot on here. In general, the compiler will do a very good job with what you give it. But some thought into the most efficient/appropriate/suitable data types will give it a lot of help.

    One other reason which occurs to me for using "small" containers is the possibility of getting much more value out of SIMD instructions. The NEON architecture (and to a lesser extend the v6 SIMD extensions) are capable of handling a number of individual data items packed into wide vector registers. The smaller the items are, the more of them you can fit into a vector. This can pay huge dividends if used correctly.

    Chris