Convert array of eight bytes to eight integers

I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd.

I have created two functions get_index and get_index2 to convert eight chars to a __m256i. The assembly for get_index is simpler than for get_index2 see https://godbolt.org/z/lhg9fX. However, in my code get_index2 is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

x86 avx2 xeon-phi avx512 knights-landing

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

asked Nov 24 '18 at 14:25

Z boson

20.9k781153

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459134%2fconvert-array-of-eight-bytes-to-eight-integers%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

NvXqonje3UCXk9Li,oWeTHPDUS,hfsA,OMFiF,azgDWhIznz4 Qe

搜尋此網誌

Nsryjdtyk