Does the using of SIMD load main CPU registers?












-1














Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to summultiplydo other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?



I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.










share|improve this question






















  • That is the point of SIMD. But it is not just the processor that plays a role, you also have to consider the need to get the data and the result read and written fast enough. The memory bus quickly turns into a bottleneck. No different for GPU computation, other than that bus speed is easily more of a bottleneck.
    – Hans Passant
    Nov 20 at 14:54










  • CPU SIMD doesn't compete for integer registers, most modern ISAs (including x86) have separate architectural registers for SIMD, and SIMD loads/stores go directly to those registers without going through integer registers. But SIMD does compete for CPU core clock cycles, if that's what you're really trying to ask.
    – Peter Cordes
    Nov 20 at 17:04










  • @HansPassant, you probably didn't understand my question. In case if mixed tasks (for example matrix multiplication, calculating factorial, rendering) we can have different kinds of loads. Some of them coud be compute bound, some of them cache bound and some of them memory bound we can determine it using TopDown analysis. The question - will the using of SIMD instructions load main registers of CPU, so we can achieve more performace for other tasks or not?
    – Jack
    Nov 21 at 10:54












  • Nobody seems to understand the question, I don't see any helpful answers. I'm just a smuck that tried to point out that it of course is designed to give you more perf. Very hard to guess why you'd assume it is not.
    – Hans Passant
    Nov 21 at 11:08










  • Sorry guys, looks like I can't explain my question clear enough. The main idea is maximum CPU utilization when mixed types of calculations in queue. As far as I understand by Peter's comment - performace improvement from using SIMD would be not huge as SIMD compete for each clock (so we can't use smth. like ILP for using general instructions + SIMD same time at each CPU core clock). Thanks for your replyes!
    – Jack
    Nov 27 at 13:37


















-1














Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to summultiplydo other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?



I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.










share|improve this question






















  • That is the point of SIMD. But it is not just the processor that plays a role, you also have to consider the need to get the data and the result read and written fast enough. The memory bus quickly turns into a bottleneck. No different for GPU computation, other than that bus speed is easily more of a bottleneck.
    – Hans Passant
    Nov 20 at 14:54










  • CPU SIMD doesn't compete for integer registers, most modern ISAs (including x86) have separate architectural registers for SIMD, and SIMD loads/stores go directly to those registers without going through integer registers. But SIMD does compete for CPU core clock cycles, if that's what you're really trying to ask.
    – Peter Cordes
    Nov 20 at 17:04










  • @HansPassant, you probably didn't understand my question. In case if mixed tasks (for example matrix multiplication, calculating factorial, rendering) we can have different kinds of loads. Some of them coud be compute bound, some of them cache bound and some of them memory bound we can determine it using TopDown analysis. The question - will the using of SIMD instructions load main registers of CPU, so we can achieve more performace for other tasks or not?
    – Jack
    Nov 21 at 10:54












  • Nobody seems to understand the question, I don't see any helpful answers. I'm just a smuck that tried to point out that it of course is designed to give you more perf. Very hard to guess why you'd assume it is not.
    – Hans Passant
    Nov 21 at 11:08










  • Sorry guys, looks like I can't explain my question clear enough. The main idea is maximum CPU utilization when mixed types of calculations in queue. As far as I understand by Peter's comment - performace improvement from using SIMD would be not huge as SIMD compete for each clock (so we can't use smth. like ILP for using general instructions + SIMD same time at each CPU core clock). Thanks for your replyes!
    – Jack
    Nov 27 at 13:37
















-1












-1








-1







Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to summultiplydo other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?



I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.










share|improve this question













Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to summultiplydo other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?



I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.







performance simd






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 20 at 14:12









Jack

612




612












  • That is the point of SIMD. But it is not just the processor that plays a role, you also have to consider the need to get the data and the result read and written fast enough. The memory bus quickly turns into a bottleneck. No different for GPU computation, other than that bus speed is easily more of a bottleneck.
    – Hans Passant
    Nov 20 at 14:54










  • CPU SIMD doesn't compete for integer registers, most modern ISAs (including x86) have separate architectural registers for SIMD, and SIMD loads/stores go directly to those registers without going through integer registers. But SIMD does compete for CPU core clock cycles, if that's what you're really trying to ask.
    – Peter Cordes
    Nov 20 at 17:04










  • @HansPassant, you probably didn't understand my question. In case if mixed tasks (for example matrix multiplication, calculating factorial, rendering) we can have different kinds of loads. Some of them coud be compute bound, some of them cache bound and some of them memory bound we can determine it using TopDown analysis. The question - will the using of SIMD instructions load main registers of CPU, so we can achieve more performace for other tasks or not?
    – Jack
    Nov 21 at 10:54












  • Nobody seems to understand the question, I don't see any helpful answers. I'm just a smuck that tried to point out that it of course is designed to give you more perf. Very hard to guess why you'd assume it is not.
    – Hans Passant
    Nov 21 at 11:08










  • Sorry guys, looks like I can't explain my question clear enough. The main idea is maximum CPU utilization when mixed types of calculations in queue. As far as I understand by Peter's comment - performace improvement from using SIMD would be not huge as SIMD compete for each clock (so we can't use smth. like ILP for using general instructions + SIMD same time at each CPU core clock). Thanks for your replyes!
    – Jack
    Nov 27 at 13:37




















  • That is the point of SIMD. But it is not just the processor that plays a role, you also have to consider the need to get the data and the result read and written fast enough. The memory bus quickly turns into a bottleneck. No different for GPU computation, other than that bus speed is easily more of a bottleneck.
    – Hans Passant
    Nov 20 at 14:54










  • CPU SIMD doesn't compete for integer registers, most modern ISAs (including x86) have separate architectural registers for SIMD, and SIMD loads/stores go directly to those registers without going through integer registers. But SIMD does compete for CPU core clock cycles, if that's what you're really trying to ask.
    – Peter Cordes
    Nov 20 at 17:04










  • @HansPassant, you probably didn't understand my question. In case if mixed tasks (for example matrix multiplication, calculating factorial, rendering) we can have different kinds of loads. Some of them coud be compute bound, some of them cache bound and some of them memory bound we can determine it using TopDown analysis. The question - will the using of SIMD instructions load main registers of CPU, so we can achieve more performace for other tasks or not?
    – Jack
    Nov 21 at 10:54












  • Nobody seems to understand the question, I don't see any helpful answers. I'm just a smuck that tried to point out that it of course is designed to give you more perf. Very hard to guess why you'd assume it is not.
    – Hans Passant
    Nov 21 at 11:08










  • Sorry guys, looks like I can't explain my question clear enough. The main idea is maximum CPU utilization when mixed types of calculations in queue. As far as I understand by Peter's comment - performace improvement from using SIMD would be not huge as SIMD compete for each clock (so we can't use smth. like ILP for using general instructions + SIMD same time at each CPU core clock). Thanks for your replyes!
    – Jack
    Nov 27 at 13:37


















That is the point of SIMD. But it is not just the processor that plays a role, you also have to consider the need to get the data and the result read and written fast enough. The memory bus quickly turns into a bottleneck. No different for GPU computation, other than that bus speed is easily more of a bottleneck.
– Hans Passant
Nov 20 at 14:54




That is the point of SIMD. But it is not just the processor that plays a role, you also have to consider the need to get the data and the result read and written fast enough. The memory bus quickly turns into a bottleneck. No different for GPU computation, other than that bus speed is easily more of a bottleneck.
– Hans Passant
Nov 20 at 14:54












CPU SIMD doesn't compete for integer registers, most modern ISAs (including x86) have separate architectural registers for SIMD, and SIMD loads/stores go directly to those registers without going through integer registers. But SIMD does compete for CPU core clock cycles, if that's what you're really trying to ask.
– Peter Cordes
Nov 20 at 17:04




CPU SIMD doesn't compete for integer registers, most modern ISAs (including x86) have separate architectural registers for SIMD, and SIMD loads/stores go directly to those registers without going through integer registers. But SIMD does compete for CPU core clock cycles, if that's what you're really trying to ask.
– Peter Cordes
Nov 20 at 17:04












@HansPassant, you probably didn't understand my question. In case if mixed tasks (for example matrix multiplication, calculating factorial, rendering) we can have different kinds of loads. Some of them coud be compute bound, some of them cache bound and some of them memory bound we can determine it using TopDown analysis. The question - will the using of SIMD instructions load main registers of CPU, so we can achieve more performace for other tasks or not?
– Jack
Nov 21 at 10:54






@HansPassant, you probably didn't understand my question. In case if mixed tasks (for example matrix multiplication, calculating factorial, rendering) we can have different kinds of loads. Some of them coud be compute bound, some of them cache bound and some of them memory bound we can determine it using TopDown analysis. The question - will the using of SIMD instructions load main registers of CPU, so we can achieve more performace for other tasks or not?
– Jack
Nov 21 at 10:54














Nobody seems to understand the question, I don't see any helpful answers. I'm just a smuck that tried to point out that it of course is designed to give you more perf. Very hard to guess why you'd assume it is not.
– Hans Passant
Nov 21 at 11:08




Nobody seems to understand the question, I don't see any helpful answers. I'm just a smuck that tried to point out that it of course is designed to give you more perf. Very hard to guess why you'd assume it is not.
– Hans Passant
Nov 21 at 11:08












Sorry guys, looks like I can't explain my question clear enough. The main idea is maximum CPU utilization when mixed types of calculations in queue. As far as I understand by Peter's comment - performace improvement from using SIMD would be not huge as SIMD compete for each clock (so we can't use smth. like ILP for using general instructions + SIMD same time at each CPU core clock). Thanks for your replyes!
– Jack
Nov 27 at 13:37






Sorry guys, looks like I can't explain my question clear enough. The main idea is maximum CPU utilization when mixed types of calculations in queue. As far as I understand by Peter's comment - performace improvement from using SIMD would be not huge as SIMD compete for each clock (so we can't use smth. like ILP for using general instructions + SIMD same time at each CPU core clock). Thanks for your replyes!
– Jack
Nov 27 at 13:37














1 Answer
1






active

oldest

votes


















0














Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.



So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).



Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)



Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)



#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps

void computeThing(float8 a, float8 b, float8 c, int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}


Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).



The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?



It just so happens that we are also making use of the SIMD units to do a couple of multiplications...



Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.



i.e. https://godbolt.org/z/xTcLrf






share|improve this answer





















  • Front-end throughput limits you to a max of 4 fused-domain uops per clock. (With micro-fused loads as memory operands for ALU instructions, you can sustain 6 or 7 unfused-domain uops per clock, e.g. like this microbenchmark on Skylake agner.org/optimize/blog/read.php?i=415#857). Also, the integer ALU execution units are on the same ports as the SIMD ALU execution units. (Except that port 6 has non-SIMD only.) So even for burst throughput when catching up after an input becomes available, you can have 1 scalar-integer ALU operation in the same cycle as 3 AVX uops.
    – Peter Cordes
    Nov 27 at 8:27










  • Thanks a lot! That's exactly what I ask for! And even more :) @PeterCordes, limitation that you describe - it's happen because of hyperthreading? So we have 2 logical processors per 1 core and at the end only 1 SIMDnon-SIMD ALU can be used at one clock?
    – Jack
    Nov 27 at 14:09










  • @Jack: no, back-end execution port resources are competitively shared with hyperthreading. If one logical thread spends a lot of time in branchy or cache-miss heavy code, the other thread won't see much competition for its uops to execute and can come close to the same throughput as if it was running alone. (But with a smaller out-of-order window to find ILP, because the ROB is partitioned and the scheduler is competitively shared). If both threads are high uop throughput, then on average yeah 2 fused-domain uops per thread front-end bottleneck, and they can fill gaps to avoid idle ALUs.
    – Peter Cordes
    Nov 27 at 20:24











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394920%2fdoes-the-using-of-simd-load-main-cpu-registers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.



So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).



Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)



Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)



#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps

void computeThing(float8 a, float8 b, float8 c, int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}


Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).



The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?



It just so happens that we are also making use of the SIMD units to do a couple of multiplications...



Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.



i.e. https://godbolt.org/z/xTcLrf






share|improve this answer





















  • Front-end throughput limits you to a max of 4 fused-domain uops per clock. (With micro-fused loads as memory operands for ALU instructions, you can sustain 6 or 7 unfused-domain uops per clock, e.g. like this microbenchmark on Skylake agner.org/optimize/blog/read.php?i=415#857). Also, the integer ALU execution units are on the same ports as the SIMD ALU execution units. (Except that port 6 has non-SIMD only.) So even for burst throughput when catching up after an input becomes available, you can have 1 scalar-integer ALU operation in the same cycle as 3 AVX uops.
    – Peter Cordes
    Nov 27 at 8:27










  • Thanks a lot! That's exactly what I ask for! And even more :) @PeterCordes, limitation that you describe - it's happen because of hyperthreading? So we have 2 logical processors per 1 core and at the end only 1 SIMDnon-SIMD ALU can be used at one clock?
    – Jack
    Nov 27 at 14:09










  • @Jack: no, back-end execution port resources are competitively shared with hyperthreading. If one logical thread spends a lot of time in branchy or cache-miss heavy code, the other thread won't see much competition for its uops to execute and can come close to the same throughput as if it was running alone. (But with a smaller out-of-order window to find ILP, because the ROB is partitioned and the scheduler is competitively shared). If both threads are high uop throughput, then on average yeah 2 fused-domain uops per thread front-end bottleneck, and they can fill gaps to avoid idle ALUs.
    – Peter Cordes
    Nov 27 at 20:24
















0














Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.



So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).



Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)



Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)



#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps

void computeThing(float8 a, float8 b, float8 c, int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}


Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).



The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?



It just so happens that we are also making use of the SIMD units to do a couple of multiplications...



Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.



i.e. https://godbolt.org/z/xTcLrf






share|improve this answer





















  • Front-end throughput limits you to a max of 4 fused-domain uops per clock. (With micro-fused loads as memory operands for ALU instructions, you can sustain 6 or 7 unfused-domain uops per clock, e.g. like this microbenchmark on Skylake agner.org/optimize/blog/read.php?i=415#857). Also, the integer ALU execution units are on the same ports as the SIMD ALU execution units. (Except that port 6 has non-SIMD only.) So even for burst throughput when catching up after an input becomes available, you can have 1 scalar-integer ALU operation in the same cycle as 3 AVX uops.
    – Peter Cordes
    Nov 27 at 8:27










  • Thanks a lot! That's exactly what I ask for! And even more :) @PeterCordes, limitation that you describe - it's happen because of hyperthreading? So we have 2 logical processors per 1 core and at the end only 1 SIMDnon-SIMD ALU can be used at one clock?
    – Jack
    Nov 27 at 14:09










  • @Jack: no, back-end execution port resources are competitively shared with hyperthreading. If one logical thread spends a lot of time in branchy or cache-miss heavy code, the other thread won't see much competition for its uops to execute and can come close to the same throughput as if it was running alone. (But with a smaller out-of-order window to find ILP, because the ROB is partitioned and the scheduler is competitively shared). If both threads are high uop throughput, then on average yeah 2 fused-domain uops per thread front-end bottleneck, and they can fill gaps to avoid idle ALUs.
    – Peter Cordes
    Nov 27 at 20:24














0












0








0






Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.



So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).



Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)



Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)



#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps

void computeThing(float8 a, float8 b, float8 c, int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}


Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).



The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?



It just so happens that we are also making use of the SIMD units to do a couple of multiplications...



Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.



i.e. https://godbolt.org/z/xTcLrf






share|improve this answer












Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.



So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).



Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)



Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)



#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps

void computeThing(float8 a, float8 b, float8 c, int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}


Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).



The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?



It just so happens that we are also making use of the SIMD units to do a couple of multiplications...



Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.



i.e. https://godbolt.org/z/xTcLrf







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 27 at 8:16









robthebloke

28914




28914












  • Front-end throughput limits you to a max of 4 fused-domain uops per clock. (With micro-fused loads as memory operands for ALU instructions, you can sustain 6 or 7 unfused-domain uops per clock, e.g. like this microbenchmark on Skylake agner.org/optimize/blog/read.php?i=415#857). Also, the integer ALU execution units are on the same ports as the SIMD ALU execution units. (Except that port 6 has non-SIMD only.) So even for burst throughput when catching up after an input becomes available, you can have 1 scalar-integer ALU operation in the same cycle as 3 AVX uops.
    – Peter Cordes
    Nov 27 at 8:27










  • Thanks a lot! That's exactly what I ask for! And even more :) @PeterCordes, limitation that you describe - it's happen because of hyperthreading? So we have 2 logical processors per 1 core and at the end only 1 SIMDnon-SIMD ALU can be used at one clock?
    – Jack
    Nov 27 at 14:09










  • @Jack: no, back-end execution port resources are competitively shared with hyperthreading. If one logical thread spends a lot of time in branchy or cache-miss heavy code, the other thread won't see much competition for its uops to execute and can come close to the same throughput as if it was running alone. (But with a smaller out-of-order window to find ILP, because the ROB is partitioned and the scheduler is competitively shared). If both threads are high uop throughput, then on average yeah 2 fused-domain uops per thread front-end bottleneck, and they can fill gaps to avoid idle ALUs.
    – Peter Cordes
    Nov 27 at 20:24


















  • Front-end throughput limits you to a max of 4 fused-domain uops per clock. (With micro-fused loads as memory operands for ALU instructions, you can sustain 6 or 7 unfused-domain uops per clock, e.g. like this microbenchmark on Skylake agner.org/optimize/blog/read.php?i=415#857). Also, the integer ALU execution units are on the same ports as the SIMD ALU execution units. (Except that port 6 has non-SIMD only.) So even for burst throughput when catching up after an input becomes available, you can have 1 scalar-integer ALU operation in the same cycle as 3 AVX uops.
    – Peter Cordes
    Nov 27 at 8:27










  • Thanks a lot! That's exactly what I ask for! And even more :) @PeterCordes, limitation that you describe - it's happen because of hyperthreading? So we have 2 logical processors per 1 core and at the end only 1 SIMDnon-SIMD ALU can be used at one clock?
    – Jack
    Nov 27 at 14:09










  • @Jack: no, back-end execution port resources are competitively shared with hyperthreading. If one logical thread spends a lot of time in branchy or cache-miss heavy code, the other thread won't see much competition for its uops to execute and can come close to the same throughput as if it was running alone. (But with a smaller out-of-order window to find ILP, because the ROB is partitioned and the scheduler is competitively shared). If both threads are high uop throughput, then on average yeah 2 fused-domain uops per thread front-end bottleneck, and they can fill gaps to avoid idle ALUs.
    – Peter Cordes
    Nov 27 at 20:24
















Front-end throughput limits you to a max of 4 fused-domain uops per clock. (With micro-fused loads as memory operands for ALU instructions, you can sustain 6 or 7 unfused-domain uops per clock, e.g. like this microbenchmark on Skylake agner.org/optimize/blog/read.php?i=415#857). Also, the integer ALU execution units are on the same ports as the SIMD ALU execution units. (Except that port 6 has non-SIMD only.) So even for burst throughput when catching up after an input becomes available, you can have 1 scalar-integer ALU operation in the same cycle as 3 AVX uops.
– Peter Cordes
Nov 27 at 8:27




Front-end throughput limits you to a max of 4 fused-domain uops per clock. (With micro-fused loads as memory operands for ALU instructions, you can sustain 6 or 7 unfused-domain uops per clock, e.g. like this microbenchmark on Skylake agner.org/optimize/blog/read.php?i=415#857). Also, the integer ALU execution units are on the same ports as the SIMD ALU execution units. (Except that port 6 has non-SIMD only.) So even for burst throughput when catching up after an input becomes available, you can have 1 scalar-integer ALU operation in the same cycle as 3 AVX uops.
– Peter Cordes
Nov 27 at 8:27












Thanks a lot! That's exactly what I ask for! And even more :) @PeterCordes, limitation that you describe - it's happen because of hyperthreading? So we have 2 logical processors per 1 core and at the end only 1 SIMDnon-SIMD ALU can be used at one clock?
– Jack
Nov 27 at 14:09




Thanks a lot! That's exactly what I ask for! And even more :) @PeterCordes, limitation that you describe - it's happen because of hyperthreading? So we have 2 logical processors per 1 core and at the end only 1 SIMDnon-SIMD ALU can be used at one clock?
– Jack
Nov 27 at 14:09












@Jack: no, back-end execution port resources are competitively shared with hyperthreading. If one logical thread spends a lot of time in branchy or cache-miss heavy code, the other thread won't see much competition for its uops to execute and can come close to the same throughput as if it was running alone. (But with a smaller out-of-order window to find ILP, because the ROB is partitioned and the scheduler is competitively shared). If both threads are high uop throughput, then on average yeah 2 fused-domain uops per thread front-end bottleneck, and they can fill gaps to avoid idle ALUs.
– Peter Cordes
Nov 27 at 20:24




@Jack: no, back-end execution port resources are competitively shared with hyperthreading. If one logical thread spends a lot of time in branchy or cache-miss heavy code, the other thread won't see much competition for its uops to execute and can come close to the same throughput as if it was running alone. (But with a smaller out-of-order window to find ILP, because the ROB is partitioned and the scheduler is competitively shared). If both threads are high uop throughput, then on average yeah 2 fused-domain uops per thread front-end bottleneck, and they can fill gaps to avoid idle ALUs.
– Peter Cordes
Nov 27 at 20:24


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394920%2fdoes-the-using-of-simd-load-main-cpu-registers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ottavio Pratesi

Tricia Helfer

15 giugno