c - Why is the generated assembly reordered when using intrinsics? -
i playing around bit intrinsics, needed o (1)
complexity function similar memcmp()
fixed input size. ended writing this:
#include <stdint.h> #include <emmintrin.h> int64_t f (int64_t a[4], int64_t b[4]) { __m128i *x = (void *) a, *y = (void *) b, r[2], t; int64_t *ret = (void *) &t; r[0] = _mm_xor_si128(x[0], y[0]); r[1] = _mm_xor_si128(x[1], y[1]); t = _mm_or_si128(r[0], r[1]); return (ret[0] | ret[1]); }
which, when compiled turns this:
f: movdqa xmm0, xmmword ptr [rdi] movdqa xmm1, xmmword ptr [rdi+16] pxor xmm0, xmmword ptr [rsi] pxor xmm1, xmmword ptr [rsi+16] por xmm0, xmm1 movq rdx, xmm0 pextrq rax, xmm0, 1 or rax, rdx ret
http://goo.gl/etovja (godbolt compiler explorer)
after though, became curious whether needed use intrinsic functions or whether needed types , use normal operators. modified above code (only 3 sse lines, really) , ended this:
#include <stdint.h> #include <emmintrin.h> int64_t f (int64_t a[4], int64_t b[4]) { __m128i *x = (void *) a, *y = (void *) b, r[2], t; int64_t *ret = (void *) &t; r[0] = x[0] ^ y[0]; r[1] = x[1] ^ y[1]; t = r[0] | r[1]; return (ret[0] | ret[1]); }
which instead compiles this:
f: movdqa xmm0, xmmword ptr [rdi+16] movdqa xmm1, xmmword ptr [rdi] pxor xmm0, xmmword ptr [rsi+16] pxor xmm1, xmmword ptr [rsi] por xmm0, xmm1 movq rdx, xmm0 pextrq rax, xmm0, 1 or rax, rdx ret
http://goo.gl/odhf3z (godbolt compiler explorer)
now functionally (afaict), 2 compiled assembly outputs identical. in fact, appears take exact same amount of time , resources; execute identically. however, curious why operands in first 4 instructions have been moved around. there particular reason why 1 way might done on other?
note: both of functions compiled gcc, identical flags.
tl;dr: compiler's point of view, input code different , might go through different places , hit different tests on way through, make output different.
you won't see in (a current) clang, since intrinsics disappear when ir (an intermediate representation of code llvm uses), , ir gets transformed instructions, ir both cases same.
if check out code clang or different versions of gcc, you'll see slight changes in instruction scheduling. these changes due changes in cpu scheduler or register allocator, version version.
try out, 2 functions provided in same file. try different versions of gcc, , try different versions of clang. clang changes ordering of movd instruction, , emits both functions same instructions, since llvm backend gets same ir both cases.
i don't know internals of gcc, suppose functions happen not hit exact same places in code scheduler , end emitting loads in different order. happen because 1 of calls intrinsics might not lowered intermediate representation on 1 case, , stay intrinsics (not function) calls.
Comments
Post a Comment