c - Why is the generated assembly reordered when using intrinsics? -


i playing around bit intrinsics, needed o (1) complexity function similar memcmp() fixed input size. ended writing this:

#include <stdint.h> #include <emmintrin.h>  int64_t f (int64_t a[4], int64_t b[4]) {     __m128i *x = (void *) a, *y = (void *) b, r[2], t;     int64_t *ret = (void *) &t;      r[0] = _mm_xor_si128(x[0], y[0]);     r[1] = _mm_xor_si128(x[1], y[1]);     t = _mm_or_si128(r[0], r[1]);       return (ret[0] | ret[1]); } 

which, when compiled turns this:

f:     movdqa  xmm0, xmmword ptr [rdi]     movdqa  xmm1, xmmword ptr [rdi+16]     pxor    xmm0, xmmword ptr [rsi]     pxor    xmm1, xmmword ptr [rsi+16]     por xmm0, xmm1     movq    rdx, xmm0     pextrq  rax, xmm0, 1     or  rax, rdx     ret 

http://goo.gl/etovja (godbolt compiler explorer)


after though, became curious whether needed use intrinsic functions or whether needed types , use normal operators. modified above code (only 3 sse lines, really) , ended this:

#include <stdint.h> #include <emmintrin.h>  int64_t f (int64_t a[4], int64_t b[4]) {     __m128i *x = (void *) a, *y = (void *) b, r[2], t;     int64_t *ret = (void *) &t;      r[0] = x[0] ^ y[0];     r[1] = x[1] ^ y[1];     t = r[0] | r[1];       return (ret[0] | ret[1]); } 

which instead compiles this:

f:     movdqa  xmm0, xmmword ptr [rdi+16]     movdqa  xmm1, xmmword ptr [rdi]     pxor    xmm0, xmmword ptr [rsi+16]     pxor    xmm1, xmmword ptr [rsi]     por xmm0, xmm1     movq    rdx, xmm0     pextrq  rax, xmm0, 1     or  rax, rdx     ret 

http://goo.gl/odhf3z (godbolt compiler explorer)


now functionally (afaict), 2 compiled assembly outputs identical. in fact, appears take exact same amount of time , resources; execute identically. however, curious why operands in first 4 instructions have been moved around. there particular reason why 1 way might done on other?

note: both of functions compiled gcc, identical flags.

tl;dr: compiler's point of view, input code different , might go through different places , hit different tests on way through, make output different.

you won't see in (a current) clang, since intrinsics disappear when ir (an intermediate representation of code llvm uses), , ir gets transformed instructions, ir both cases same.

if check out code clang or different versions of gcc, you'll see slight changes in instruction scheduling. these changes due changes in cpu scheduler or register allocator, version version.

try out, 2 functions provided in same file. try different versions of gcc, , try different versions of clang. clang changes ordering of movd instruction, , emits both functions same instructions, since llvm backend gets same ir both cases.

i don't know internals of gcc, suppose functions happen not hit exact same places in code scheduler , end emitting loads in different order. happen because 1 of calls intrinsics might not lowered intermediate representation on 1 case, , stay intrinsics (not function) calls.


Comments

Popular posts from this blog

database - VFP Grid + SQL server 2008 - grid not showing correctly -

jquery - Set jPicker field to empty value -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -