Compiler alias assumptions, Raw ptrs, unique ptrs, and __restrict - Strange results












0















After looking at the asm produced, adding a third function using vector<int>, and timing them when the ptrs were the same or differing values, all 3 functions work pretty much optimally without using __restrict. See the answer I added which goes into this including the fact that unique_ptr and vector versions produced identical code.



Question Is there some way to use __restrict or some other technique to get rid of slow execution and allow normal use of multiple unique_ptrs instead of having to use the get() method to send a raw pointer. Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious? Does this vary with other compilers?



I was exploring whether unique_ptr might be better optimized in certain situations where functions were passed raw pointers. The MSVC compiler at max optimization still assumes that a function called with two 1unique_ptrs1 to arrays of the same type may alias. But I thought that two unique ptrs would offer better optimization since it's not possible for two unique ptrs that don't have the same address to have overlapping arrays. So not only would unique ptrs be as fast as raw ptrs but possibly faster.



The test functions take 2 "pointers" and a length of array being pointed to. The functions are forcibly instantiated and called through a function pointer because the compiler does recognize aliasing is occurring when it in-lines and optimizes.



These are the two functions:



#define TYPEMOD// __restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}


For reference, when the data is 0246 the compiler is assuming the two ptrs do not alias (overlapping arrays). When the data is 0259 the compiler is assuming aliasing and so will reread previous elements if fears may have changed.



Here's the results:



Raw pointer data 0259  time 0.190027
Unique_ptr data 0259 time 0.198208


Both functions are assuming aliasing with this compiler and so not optimizing and the unique ptr function is very slightly slower.



So then I took a look at MSVC's __restrict C++ extension. thinking that would help. Applied to both raw ptrs and unique ptrs here were the results:



Raw pointer data 0246  time 0.0594369
Unique_ptr data 0259 time 0.192284


OK, unique_ptr is slower under all conditions though quite close to raw pointers without using __restrict. And when the __restrict modifier is used the raw pointer function version kicks into gear. The unique_ptr function ignores the __restrict. The gears may grind if the pointers alias but very little (or none) of my production code does that.



Conclusion: Looks like I'm going to review some critical parts of my code for functions with multiple pointers, raw and unique. That difference is way to big to ignore. Looks like using the unique ptrs get() method together with using __restrict raw pointers in the called functions is quite effective.



Version VC++ 15.9.2, Compiler options:
/permissive- /GS /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64Releasevc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /std:c++17 /FC /Fa"x64Release" /EHsc /nologo /Fo"x64Release" /diagnostics:classic



// Full Code
#include <iostream>
#include <memory>
#include <chrono>

class Timer {
using clock = std::chrono::system_clock;
double cumulative_time{};
double interval_time{};
clock::time_point snapshot_time{ clock::now() }, tmp;
public:
void start() { snapshot_time = clock::now(); }
void reset() { cumulative_time = 0; start(); }
double get_interval_time() {
cumulative_time += (interval_time = std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count());
snapshot_time = tmp;
return interval_time;
}
double get_cumulative_time() {
cumulative_time += std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count();
snapshot_time = tmp;
return cumulative_time;
}
};

template<typename T>
void fill(T &v, int len) {
int i = 0;
for (int i = 0; i < len; i++)
v[i] = i;
}


using namespace std;

#define TYPEMOD //__restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


auto xf1 = f1; // avoid inlining
auto xf2 = f2;


int main() {
const int N = 100'000'000;
auto pa = new int[N]; fill(pa, N);
auto ptra = std::make_unique<int>(N); fill(ptra, N);

Timer timer;
xf1(pa, pa, N);
auto snap1 = timer.get_interval_time();
xf2(ptra, ptra, N);
auto snap2 = timer.get_interval_time();
std::cout << "Raw pointer data " << pa[0] << pa[1] << pa[2] << pa[3] << " time " << snap1 << "n";
std::cout << "Unique_ptr data " << ptra[0] << ptra[1] << ptra[2] << ptra[3] << " time " << snap2 << "n";
std::cout << "n";
}









share|improve this question




















  • 2





    std::unique_ptr<int>& __restrict p1 is just useless, it would need to be std::unique_ptr<int * __restrict>& p1. Try int * __restrict p_p1 = p1.get() and do computations with that.

    – Kamil Cuk
    Nov 25 '18 at 8:01













  • Turns out MSVC puts code in the functions without __restrict that detects the two raw array ptrs are the same and, if so assumes aliasing and computes the correct answer by re-reading aliased locations. With __restrict, it bypasses the ptr equality check and goes directly a code segment that assumes no aliasing. Smart compiler. Same with a vector<int> implementation and the unique_ptr implementation. So it's effectively highly optimized in all these array versions. Those segments of code are the same for all three with and without __restrict. Amazing.

    – doug
    Nov 25 '18 at 17:08


















0















After looking at the asm produced, adding a third function using vector<int>, and timing them when the ptrs were the same or differing values, all 3 functions work pretty much optimally without using __restrict. See the answer I added which goes into this including the fact that unique_ptr and vector versions produced identical code.



Question Is there some way to use __restrict or some other technique to get rid of slow execution and allow normal use of multiple unique_ptrs instead of having to use the get() method to send a raw pointer. Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious? Does this vary with other compilers?



I was exploring whether unique_ptr might be better optimized in certain situations where functions were passed raw pointers. The MSVC compiler at max optimization still assumes that a function called with two 1unique_ptrs1 to arrays of the same type may alias. But I thought that two unique ptrs would offer better optimization since it's not possible for two unique ptrs that don't have the same address to have overlapping arrays. So not only would unique ptrs be as fast as raw ptrs but possibly faster.



The test functions take 2 "pointers" and a length of array being pointed to. The functions are forcibly instantiated and called through a function pointer because the compiler does recognize aliasing is occurring when it in-lines and optimizes.



These are the two functions:



#define TYPEMOD// __restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}


For reference, when the data is 0246 the compiler is assuming the two ptrs do not alias (overlapping arrays). When the data is 0259 the compiler is assuming aliasing and so will reread previous elements if fears may have changed.



Here's the results:



Raw pointer data 0259  time 0.190027
Unique_ptr data 0259 time 0.198208


Both functions are assuming aliasing with this compiler and so not optimizing and the unique ptr function is very slightly slower.



So then I took a look at MSVC's __restrict C++ extension. thinking that would help. Applied to both raw ptrs and unique ptrs here were the results:



Raw pointer data 0246  time 0.0594369
Unique_ptr data 0259 time 0.192284


OK, unique_ptr is slower under all conditions though quite close to raw pointers without using __restrict. And when the __restrict modifier is used the raw pointer function version kicks into gear. The unique_ptr function ignores the __restrict. The gears may grind if the pointers alias but very little (or none) of my production code does that.



Conclusion: Looks like I'm going to review some critical parts of my code for functions with multiple pointers, raw and unique. That difference is way to big to ignore. Looks like using the unique ptrs get() method together with using __restrict raw pointers in the called functions is quite effective.



Version VC++ 15.9.2, Compiler options:
/permissive- /GS /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64Releasevc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /std:c++17 /FC /Fa"x64Release" /EHsc /nologo /Fo"x64Release" /diagnostics:classic



// Full Code
#include <iostream>
#include <memory>
#include <chrono>

class Timer {
using clock = std::chrono::system_clock;
double cumulative_time{};
double interval_time{};
clock::time_point snapshot_time{ clock::now() }, tmp;
public:
void start() { snapshot_time = clock::now(); }
void reset() { cumulative_time = 0; start(); }
double get_interval_time() {
cumulative_time += (interval_time = std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count());
snapshot_time = tmp;
return interval_time;
}
double get_cumulative_time() {
cumulative_time += std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count();
snapshot_time = tmp;
return cumulative_time;
}
};

template<typename T>
void fill(T &v, int len) {
int i = 0;
for (int i = 0; i < len; i++)
v[i] = i;
}


using namespace std;

#define TYPEMOD //__restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


auto xf1 = f1; // avoid inlining
auto xf2 = f2;


int main() {
const int N = 100'000'000;
auto pa = new int[N]; fill(pa, N);
auto ptra = std::make_unique<int>(N); fill(ptra, N);

Timer timer;
xf1(pa, pa, N);
auto snap1 = timer.get_interval_time();
xf2(ptra, ptra, N);
auto snap2 = timer.get_interval_time();
std::cout << "Raw pointer data " << pa[0] << pa[1] << pa[2] << pa[3] << " time " << snap1 << "n";
std::cout << "Unique_ptr data " << ptra[0] << ptra[1] << ptra[2] << ptra[3] << " time " << snap2 << "n";
std::cout << "n";
}









share|improve this question




















  • 2





    std::unique_ptr<int>& __restrict p1 is just useless, it would need to be std::unique_ptr<int * __restrict>& p1. Try int * __restrict p_p1 = p1.get() and do computations with that.

    – Kamil Cuk
    Nov 25 '18 at 8:01













  • Turns out MSVC puts code in the functions without __restrict that detects the two raw array ptrs are the same and, if so assumes aliasing and computes the correct answer by re-reading aliased locations. With __restrict, it bypasses the ptr equality check and goes directly a code segment that assumes no aliasing. Smart compiler. Same with a vector<int> implementation and the unique_ptr implementation. So it's effectively highly optimized in all these array versions. Those segments of code are the same for all three with and without __restrict. Amazing.

    – doug
    Nov 25 '18 at 17:08
















0












0








0








After looking at the asm produced, adding a third function using vector<int>, and timing them when the ptrs were the same or differing values, all 3 functions work pretty much optimally without using __restrict. See the answer I added which goes into this including the fact that unique_ptr and vector versions produced identical code.



Question Is there some way to use __restrict or some other technique to get rid of slow execution and allow normal use of multiple unique_ptrs instead of having to use the get() method to send a raw pointer. Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious? Does this vary with other compilers?



I was exploring whether unique_ptr might be better optimized in certain situations where functions were passed raw pointers. The MSVC compiler at max optimization still assumes that a function called with two 1unique_ptrs1 to arrays of the same type may alias. But I thought that two unique ptrs would offer better optimization since it's not possible for two unique ptrs that don't have the same address to have overlapping arrays. So not only would unique ptrs be as fast as raw ptrs but possibly faster.



The test functions take 2 "pointers" and a length of array being pointed to. The functions are forcibly instantiated and called through a function pointer because the compiler does recognize aliasing is occurring when it in-lines and optimizes.



These are the two functions:



#define TYPEMOD// __restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}


For reference, when the data is 0246 the compiler is assuming the two ptrs do not alias (overlapping arrays). When the data is 0259 the compiler is assuming aliasing and so will reread previous elements if fears may have changed.



Here's the results:



Raw pointer data 0259  time 0.190027
Unique_ptr data 0259 time 0.198208


Both functions are assuming aliasing with this compiler and so not optimizing and the unique ptr function is very slightly slower.



So then I took a look at MSVC's __restrict C++ extension. thinking that would help. Applied to both raw ptrs and unique ptrs here were the results:



Raw pointer data 0246  time 0.0594369
Unique_ptr data 0259 time 0.192284


OK, unique_ptr is slower under all conditions though quite close to raw pointers without using __restrict. And when the __restrict modifier is used the raw pointer function version kicks into gear. The unique_ptr function ignores the __restrict. The gears may grind if the pointers alias but very little (or none) of my production code does that.



Conclusion: Looks like I'm going to review some critical parts of my code for functions with multiple pointers, raw and unique. That difference is way to big to ignore. Looks like using the unique ptrs get() method together with using __restrict raw pointers in the called functions is quite effective.



Version VC++ 15.9.2, Compiler options:
/permissive- /GS /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64Releasevc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /std:c++17 /FC /Fa"x64Release" /EHsc /nologo /Fo"x64Release" /diagnostics:classic



// Full Code
#include <iostream>
#include <memory>
#include <chrono>

class Timer {
using clock = std::chrono::system_clock;
double cumulative_time{};
double interval_time{};
clock::time_point snapshot_time{ clock::now() }, tmp;
public:
void start() { snapshot_time = clock::now(); }
void reset() { cumulative_time = 0; start(); }
double get_interval_time() {
cumulative_time += (interval_time = std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count());
snapshot_time = tmp;
return interval_time;
}
double get_cumulative_time() {
cumulative_time += std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count();
snapshot_time = tmp;
return cumulative_time;
}
};

template<typename T>
void fill(T &v, int len) {
int i = 0;
for (int i = 0; i < len; i++)
v[i] = i;
}


using namespace std;

#define TYPEMOD //__restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


auto xf1 = f1; // avoid inlining
auto xf2 = f2;


int main() {
const int N = 100'000'000;
auto pa = new int[N]; fill(pa, N);
auto ptra = std::make_unique<int>(N); fill(ptra, N);

Timer timer;
xf1(pa, pa, N);
auto snap1 = timer.get_interval_time();
xf2(ptra, ptra, N);
auto snap2 = timer.get_interval_time();
std::cout << "Raw pointer data " << pa[0] << pa[1] << pa[2] << pa[3] << " time " << snap1 << "n";
std::cout << "Unique_ptr data " << ptra[0] << ptra[1] << ptra[2] << ptra[3] << " time " << snap2 << "n";
std::cout << "n";
}









share|improve this question
















After looking at the asm produced, adding a third function using vector<int>, and timing them when the ptrs were the same or differing values, all 3 functions work pretty much optimally without using __restrict. See the answer I added which goes into this including the fact that unique_ptr and vector versions produced identical code.



Question Is there some way to use __restrict or some other technique to get rid of slow execution and allow normal use of multiple unique_ptrs instead of having to use the get() method to send a raw pointer. Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious? Does this vary with other compilers?



I was exploring whether unique_ptr might be better optimized in certain situations where functions were passed raw pointers. The MSVC compiler at max optimization still assumes that a function called with two 1unique_ptrs1 to arrays of the same type may alias. But I thought that two unique ptrs would offer better optimization since it's not possible for two unique ptrs that don't have the same address to have overlapping arrays. So not only would unique ptrs be as fast as raw ptrs but possibly faster.



The test functions take 2 "pointers" and a length of array being pointed to. The functions are forcibly instantiated and called through a function pointer because the compiler does recognize aliasing is occurring when it in-lines and optimizes.



These are the two functions:



#define TYPEMOD// __restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i-1] + p1[i+1];
}


For reference, when the data is 0246 the compiler is assuming the two ptrs do not alias (overlapping arrays). When the data is 0259 the compiler is assuming aliasing and so will reread previous elements if fears may have changed.



Here's the results:



Raw pointer data 0259  time 0.190027
Unique_ptr data 0259 time 0.198208


Both functions are assuming aliasing with this compiler and so not optimizing and the unique ptr function is very slightly slower.



So then I took a look at MSVC's __restrict C++ extension. thinking that would help. Applied to both raw ptrs and unique ptrs here were the results:



Raw pointer data 0246  time 0.0594369
Unique_ptr data 0259 time 0.192284


OK, unique_ptr is slower under all conditions though quite close to raw pointers without using __restrict. And when the __restrict modifier is used the raw pointer function version kicks into gear. The unique_ptr function ignores the __restrict. The gears may grind if the pointers alias but very little (or none) of my production code does that.



Conclusion: Looks like I'm going to review some critical parts of my code for functions with multiple pointers, raw and unique. That difference is way to big to ignore. Looks like using the unique ptrs get() method together with using __restrict raw pointers in the called functions is quite effective.



Version VC++ 15.9.2, Compiler options:
/permissive- /GS /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64Releasevc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /std:c++17 /FC /Fa"x64Release" /EHsc /nologo /Fo"x64Release" /diagnostics:classic



// Full Code
#include <iostream>
#include <memory>
#include <chrono>

class Timer {
using clock = std::chrono::system_clock;
double cumulative_time{};
double interval_time{};
clock::time_point snapshot_time{ clock::now() }, tmp;
public:
void start() { snapshot_time = clock::now(); }
void reset() { cumulative_time = 0; start(); }
double get_interval_time() {
cumulative_time += (interval_time = std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count());
snapshot_time = tmp;
return interval_time;
}
double get_cumulative_time() {
cumulative_time += std::chrono::duration<double>((tmp = clock::now()) - snapshot_time).count();
snapshot_time = tmp;
return cumulative_time;
}
};

template<typename T>
void fill(T &v, int len) {
int i = 0;
for (int i = 0; i < len; i++)
v[i] = i;
}


using namespace std;

#define TYPEMOD //__restrict // Uncomment to add __restrict
void f1(int * TYPEMOD p1, int * TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& TYPEMOD p1, std::unique_ptr<int>& TYPEMOD p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


auto xf1 = f1; // avoid inlining
auto xf2 = f2;


int main() {
const int N = 100'000'000;
auto pa = new int[N]; fill(pa, N);
auto ptra = std::make_unique<int>(N); fill(ptra, N);

Timer timer;
xf1(pa, pa, N);
auto snap1 = timer.get_interval_time();
xf2(ptra, ptra, N);
auto snap2 = timer.get_interval_time();
std::cout << "Raw pointer data " << pa[0] << pa[1] << pa[2] << pa[3] << " time " << snap1 << "n";
std::cout << "Unique_ptr data " << ptra[0] << ptra[1] << ptra[2] << ptra[3] << " time " << snap2 << "n";
std::cout << "n";
}






c++






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 21:52







doug

















asked Nov 25 '18 at 6:32









dougdoug

7871410




7871410








  • 2





    std::unique_ptr<int>& __restrict p1 is just useless, it would need to be std::unique_ptr<int * __restrict>& p1. Try int * __restrict p_p1 = p1.get() and do computations with that.

    – Kamil Cuk
    Nov 25 '18 at 8:01













  • Turns out MSVC puts code in the functions without __restrict that detects the two raw array ptrs are the same and, if so assumes aliasing and computes the correct answer by re-reading aliased locations. With __restrict, it bypasses the ptr equality check and goes directly a code segment that assumes no aliasing. Smart compiler. Same with a vector<int> implementation and the unique_ptr implementation. So it's effectively highly optimized in all these array versions. Those segments of code are the same for all three with and without __restrict. Amazing.

    – doug
    Nov 25 '18 at 17:08
















  • 2





    std::unique_ptr<int>& __restrict p1 is just useless, it would need to be std::unique_ptr<int * __restrict>& p1. Try int * __restrict p_p1 = p1.get() and do computations with that.

    – Kamil Cuk
    Nov 25 '18 at 8:01













  • Turns out MSVC puts code in the functions without __restrict that detects the two raw array ptrs are the same and, if so assumes aliasing and computes the correct answer by re-reading aliased locations. With __restrict, it bypasses the ptr equality check and goes directly a code segment that assumes no aliasing. Smart compiler. Same with a vector<int> implementation and the unique_ptr implementation. So it's effectively highly optimized in all these array versions. Those segments of code are the same for all three with and without __restrict. Amazing.

    – doug
    Nov 25 '18 at 17:08










2




2





std::unique_ptr<int>& __restrict p1 is just useless, it would need to be std::unique_ptr<int * __restrict>& p1. Try int * __restrict p_p1 = p1.get() and do computations with that.

– Kamil Cuk
Nov 25 '18 at 8:01







std::unique_ptr<int>& __restrict p1 is just useless, it would need to be std::unique_ptr<int * __restrict>& p1. Try int * __restrict p_p1 = p1.get() and do computations with that.

– Kamil Cuk
Nov 25 '18 at 8:01















Turns out MSVC puts code in the functions without __restrict that detects the two raw array ptrs are the same and, if so assumes aliasing and computes the correct answer by re-reading aliased locations. With __restrict, it bypasses the ptr equality check and goes directly a code segment that assumes no aliasing. Smart compiler. Same with a vector<int> implementation and the unique_ptr implementation. So it's effectively highly optimized in all these array versions. Those segments of code are the same for all three with and without __restrict. Amazing.

– doug
Nov 25 '18 at 17:08







Turns out MSVC puts code in the functions without __restrict that detects the two raw array ptrs are the same and, if so assumes aliasing and computes the correct answer by re-reading aliased locations. With __restrict, it bypasses the ptr equality check and goes directly a code segment that assumes no aliasing. Smart compiler. Same with a vector<int> implementation and the unique_ptr implementation. So it's effectively highly optimized in all these array versions. Those segments of code are the same for all three with and without __restrict. Amazing.

– doug
Nov 25 '18 at 17:08














2 Answers
2






active

oldest

votes


















0














I've delved into how MSVC optimizes with and without aliasing and have included a vector<int> version of the test functions named f3(). The set is now:



void f1(int *p1, int *p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& p1, std::unique_ptr<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(vector<int>& p1, vector<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

auto xf1 = f1; // avoid inlining
auto xf2 = f2;
auto xf3 = f3;


As before, xf1, xf2, and xf3 forces instantiation of the functions and provides pointers to call them as the compiler wants to inline them.



Turns out the unique_ptr and vector versions (f2() and f3()) produce code that tests whether p1 and p2 point to the same memory and, if so, aliasing is assumed producing the slow, but correct, code.



What is interesting is that unique_ptr<int> and vector<int> produce identical code. When COMDAT code folding is enabled in the linker optimizer the duplicate is deleted and the function pointer xf3 set to the same address as xf2 which can be seen in debugging. So when f3 is called it's actually the f2 code that is executed.



When these functions are executed they first test whether p1 and p2 are the same. If so they assume aliasing and generate correct code otherwise they assume no aliasing and generate faster code. If __restrict is used in f1() the code does not first test if p1 and p2 are the same and goes straight into the code that assumes no aliasing.



Summarizing, __restrict didn't really speed the raw ptr function up except in the case where p1 and p2 pointed to the same address. When p1 and p2 differed, it was just as fast as the unique_ptr and vector versions.



The compiler produces fast code even when the raw pointer function was called at the cost of initial tests of pointer equivalence slowing down by assuming aliasing when the pointers are equal.






share|improve this answer


























  • they first test whether p1 and p2 are the same no. You should test if the range <p1, p1 + count) is disjoint with <p2, p2 + count). Still std::unique_ptr<int>& TYPEMOD p1 is doing nothing, it maks std::addressof(p1) as __restrict. You want to mark the underlying pointer as restrict.

    – Kamil Cuk
    Nov 25 '18 at 22:00











  • @KamilCuk Yes, I haven't done that with f1() and am now curious too. UB would be expected? It doesn't apply to f2() and f3() since neither of those can overlap other than 100% overlap. I'll edit the answer to remove TYPEMOD I left it in by accident since it is now clear that __restrict is irrelevant to f2.

    – doug
    Nov 25 '18 at 22:14





















0















Is there some way to use __restrict or some other technique to get rid of slow execution




Yes. Just cast the pointers into being restrict, thus giving the compiler information that they are restricted with each other.



#include <memory>
#include <vector>

#if defined(__cplusplus)
#if defined(_MSC_VER)
#define restrict __restrict
#elif defined(__GNUC__)
#define restrict __restrict__
#endif
#endif

void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
int * const restrict p1 = pp1.get();
int * const restrict p2 = pp2.get();
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
int * const restrict p1 = &pp1[0];
int * const restrict p2 = &pp2[0];
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


But code duplication is the worst:



void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
f1(pp1.get(), pp2.get(), count);
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
f1(&pp1[0], &pp2[0], count);
}



Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious?




No. As shown, we can get the pointer using std::unique_ptr::get() function. So doing:



std::unique_ptr p1;
int *a = p1.get();
int *b = p1.get();
f1(a, b, 5);



would create three pointers that point to the same memory.




Does this vary with other compilers?




Certainly yes. restrict is not supported in C++. And it's only a hint for the compiler, the compiler may ignore it.




That difference is way to big to ignore




The only way to compare is to compare the generated assembly code. I don't have visual studio, so I can't do it.




class Timer { using clock = std::chrono::system_clock;




system_clock is a system-wide real time wall clock. Wall clock is used to display user-nice looking time on the desktop (wall). That's why it's called "wall clock", it should be used to be displayed on a wall. Use a monotonic clock, like high_resolution_clock, for measuring intervals. Or best, don't compare execution speeds, which depend on environment. Compare instruction count, ex. measured by the debugger with sample data or best, count the number of assembly instructions generated by the compiler for specific architecture and compiler options. Often sites like godbolt come in handy.



restrict qualifier is something the programmer takes care of. You must be sure, that you didn't pass pointers to a region that overlap.



You code compiles under gcc -O2 with or without restrict to the same assembly instructions, see here. If you wish to speed up the execution speed, start using platform specific instructions.






share|improve this answer
























  • Agree re high_resolution_clock. For MSVC it's the same as system_clock but portability is best using high_resolution_clock. Sure you can get pull the raw ptrs out of unique_ptr but you can't have multiple unique_ptrs pointing to the same things without duplicate destructors-argh. BTW, gotbolt also has VSC++ and the generated code is horrid. Loop unrolling and no rereading of aliased locations conditioned on ptr's not matching. gcc generates highly readable code but all the versions clearly show the compiler is assuming aliasing in -O2. Is there some option to presume pointers don't alias?

    – doug
    Nov 26 '18 at 0:10













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465220%2fcompiler-alias-assumptions-raw-ptrs-unique-ptrs-and-restrict-strange-resu%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I've delved into how MSVC optimizes with and without aliasing and have included a vector<int> version of the test functions named f3(). The set is now:



void f1(int *p1, int *p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& p1, std::unique_ptr<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(vector<int>& p1, vector<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

auto xf1 = f1; // avoid inlining
auto xf2 = f2;
auto xf3 = f3;


As before, xf1, xf2, and xf3 forces instantiation of the functions and provides pointers to call them as the compiler wants to inline them.



Turns out the unique_ptr and vector versions (f2() and f3()) produce code that tests whether p1 and p2 point to the same memory and, if so, aliasing is assumed producing the slow, but correct, code.



What is interesting is that unique_ptr<int> and vector<int> produce identical code. When COMDAT code folding is enabled in the linker optimizer the duplicate is deleted and the function pointer xf3 set to the same address as xf2 which can be seen in debugging. So when f3 is called it's actually the f2 code that is executed.



When these functions are executed they first test whether p1 and p2 are the same. If so they assume aliasing and generate correct code otherwise they assume no aliasing and generate faster code. If __restrict is used in f1() the code does not first test if p1 and p2 are the same and goes straight into the code that assumes no aliasing.



Summarizing, __restrict didn't really speed the raw ptr function up except in the case where p1 and p2 pointed to the same address. When p1 and p2 differed, it was just as fast as the unique_ptr and vector versions.



The compiler produces fast code even when the raw pointer function was called at the cost of initial tests of pointer equivalence slowing down by assuming aliasing when the pointers are equal.






share|improve this answer


























  • they first test whether p1 and p2 are the same no. You should test if the range <p1, p1 + count) is disjoint with <p2, p2 + count). Still std::unique_ptr<int>& TYPEMOD p1 is doing nothing, it maks std::addressof(p1) as __restrict. You want to mark the underlying pointer as restrict.

    – Kamil Cuk
    Nov 25 '18 at 22:00











  • @KamilCuk Yes, I haven't done that with f1() and am now curious too. UB would be expected? It doesn't apply to f2() and f3() since neither of those can overlap other than 100% overlap. I'll edit the answer to remove TYPEMOD I left it in by accident since it is now clear that __restrict is irrelevant to f2.

    – doug
    Nov 25 '18 at 22:14


















0














I've delved into how MSVC optimizes with and without aliasing and have included a vector<int> version of the test functions named f3(). The set is now:



void f1(int *p1, int *p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& p1, std::unique_ptr<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(vector<int>& p1, vector<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

auto xf1 = f1; // avoid inlining
auto xf2 = f2;
auto xf3 = f3;


As before, xf1, xf2, and xf3 forces instantiation of the functions and provides pointers to call them as the compiler wants to inline them.



Turns out the unique_ptr and vector versions (f2() and f3()) produce code that tests whether p1 and p2 point to the same memory and, if so, aliasing is assumed producing the slow, but correct, code.



What is interesting is that unique_ptr<int> and vector<int> produce identical code. When COMDAT code folding is enabled in the linker optimizer the duplicate is deleted and the function pointer xf3 set to the same address as xf2 which can be seen in debugging. So when f3 is called it's actually the f2 code that is executed.



When these functions are executed they first test whether p1 and p2 are the same. If so they assume aliasing and generate correct code otherwise they assume no aliasing and generate faster code. If __restrict is used in f1() the code does not first test if p1 and p2 are the same and goes straight into the code that assumes no aliasing.



Summarizing, __restrict didn't really speed the raw ptr function up except in the case where p1 and p2 pointed to the same address. When p1 and p2 differed, it was just as fast as the unique_ptr and vector versions.



The compiler produces fast code even when the raw pointer function was called at the cost of initial tests of pointer equivalence slowing down by assuming aliasing when the pointers are equal.






share|improve this answer


























  • they first test whether p1 and p2 are the same no. You should test if the range <p1, p1 + count) is disjoint with <p2, p2 + count). Still std::unique_ptr<int>& TYPEMOD p1 is doing nothing, it maks std::addressof(p1) as __restrict. You want to mark the underlying pointer as restrict.

    – Kamil Cuk
    Nov 25 '18 at 22:00











  • @KamilCuk Yes, I haven't done that with f1() and am now curious too. UB would be expected? It doesn't apply to f2() and f3() since neither of those can overlap other than 100% overlap. I'll edit the answer to remove TYPEMOD I left it in by accident since it is now clear that __restrict is irrelevant to f2.

    – doug
    Nov 25 '18 at 22:14
















0












0








0







I've delved into how MSVC optimizes with and without aliasing and have included a vector<int> version of the test functions named f3(). The set is now:



void f1(int *p1, int *p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& p1, std::unique_ptr<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(vector<int>& p1, vector<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

auto xf1 = f1; // avoid inlining
auto xf2 = f2;
auto xf3 = f3;


As before, xf1, xf2, and xf3 forces instantiation of the functions and provides pointers to call them as the compiler wants to inline them.



Turns out the unique_ptr and vector versions (f2() and f3()) produce code that tests whether p1 and p2 point to the same memory and, if so, aliasing is assumed producing the slow, but correct, code.



What is interesting is that unique_ptr<int> and vector<int> produce identical code. When COMDAT code folding is enabled in the linker optimizer the duplicate is deleted and the function pointer xf3 set to the same address as xf2 which can be seen in debugging. So when f3 is called it's actually the f2 code that is executed.



When these functions are executed they first test whether p1 and p2 are the same. If so they assume aliasing and generate correct code otherwise they assume no aliasing and generate faster code. If __restrict is used in f1() the code does not first test if p1 and p2 are the same and goes straight into the code that assumes no aliasing.



Summarizing, __restrict didn't really speed the raw ptr function up except in the case where p1 and p2 pointed to the same address. When p1 and p2 differed, it was just as fast as the unique_ptr and vector versions.



The compiler produces fast code even when the raw pointer function was called at the cost of initial tests of pointer equivalence slowing down by assuming aliasing when the pointers are equal.






share|improve this answer















I've delved into how MSVC optimizes with and without aliasing and have included a vector<int> version of the test functions named f3(). The set is now:



void f1(int *p1, int *p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& p1, std::unique_ptr<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(vector<int>& p1, vector<int>& p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

auto xf1 = f1; // avoid inlining
auto xf2 = f2;
auto xf3 = f3;


As before, xf1, xf2, and xf3 forces instantiation of the functions and provides pointers to call them as the compiler wants to inline them.



Turns out the unique_ptr and vector versions (f2() and f3()) produce code that tests whether p1 and p2 point to the same memory and, if so, aliasing is assumed producing the slow, but correct, code.



What is interesting is that unique_ptr<int> and vector<int> produce identical code. When COMDAT code folding is enabled in the linker optimizer the duplicate is deleted and the function pointer xf3 set to the same address as xf2 which can be seen in debugging. So when f3 is called it's actually the f2 code that is executed.



When these functions are executed they first test whether p1 and p2 are the same. If so they assume aliasing and generate correct code otherwise they assume no aliasing and generate faster code. If __restrict is used in f1() the code does not first test if p1 and p2 are the same and goes straight into the code that assumes no aliasing.



Summarizing, __restrict didn't really speed the raw ptr function up except in the case where p1 and p2 pointed to the same address. When p1 and p2 differed, it was just as fast as the unique_ptr and vector versions.



The compiler produces fast code even when the raw pointer function was called at the cost of initial tests of pointer equivalence slowing down by assuming aliasing when the pointers are equal.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 25 '18 at 22:16

























answered Nov 25 '18 at 21:41









dougdoug

7871410




7871410













  • they first test whether p1 and p2 are the same no. You should test if the range <p1, p1 + count) is disjoint with <p2, p2 + count). Still std::unique_ptr<int>& TYPEMOD p1 is doing nothing, it maks std::addressof(p1) as __restrict. You want to mark the underlying pointer as restrict.

    – Kamil Cuk
    Nov 25 '18 at 22:00











  • @KamilCuk Yes, I haven't done that with f1() and am now curious too. UB would be expected? It doesn't apply to f2() and f3() since neither of those can overlap other than 100% overlap. I'll edit the answer to remove TYPEMOD I left it in by accident since it is now clear that __restrict is irrelevant to f2.

    – doug
    Nov 25 '18 at 22:14





















  • they first test whether p1 and p2 are the same no. You should test if the range <p1, p1 + count) is disjoint with <p2, p2 + count). Still std::unique_ptr<int>& TYPEMOD p1 is doing nothing, it maks std::addressof(p1) as __restrict. You want to mark the underlying pointer as restrict.

    – Kamil Cuk
    Nov 25 '18 at 22:00











  • @KamilCuk Yes, I haven't done that with f1() and am now curious too. UB would be expected? It doesn't apply to f2() and f3() since neither of those can overlap other than 100% overlap. I'll edit the answer to remove TYPEMOD I left it in by accident since it is now clear that __restrict is irrelevant to f2.

    – doug
    Nov 25 '18 at 22:14



















they first test whether p1 and p2 are the same no. You should test if the range <p1, p1 + count) is disjoint with <p2, p2 + count). Still std::unique_ptr<int>& TYPEMOD p1 is doing nothing, it maks std::addressof(p1) as __restrict. You want to mark the underlying pointer as restrict.

– Kamil Cuk
Nov 25 '18 at 22:00





they first test whether p1 and p2 are the same no. You should test if the range <p1, p1 + count) is disjoint with <p2, p2 + count). Still std::unique_ptr<int>& TYPEMOD p1 is doing nothing, it maks std::addressof(p1) as __restrict. You want to mark the underlying pointer as restrict.

– Kamil Cuk
Nov 25 '18 at 22:00













@KamilCuk Yes, I haven't done that with f1() and am now curious too. UB would be expected? It doesn't apply to f2() and f3() since neither of those can overlap other than 100% overlap. I'll edit the answer to remove TYPEMOD I left it in by accident since it is now clear that __restrict is irrelevant to f2.

– doug
Nov 25 '18 at 22:14







@KamilCuk Yes, I haven't done that with f1() and am now curious too. UB would be expected? It doesn't apply to f2() and f3() since neither of those can overlap other than 100% overlap. I'll edit the answer to remove TYPEMOD I left it in by accident since it is now clear that __restrict is irrelevant to f2.

– doug
Nov 25 '18 at 22:14















0















Is there some way to use __restrict or some other technique to get rid of slow execution




Yes. Just cast the pointers into being restrict, thus giving the compiler information that they are restricted with each other.



#include <memory>
#include <vector>

#if defined(__cplusplus)
#if defined(_MSC_VER)
#define restrict __restrict
#elif defined(__GNUC__)
#define restrict __restrict__
#endif
#endif

void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
int * const restrict p1 = pp1.get();
int * const restrict p2 = pp2.get();
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
int * const restrict p1 = &pp1[0];
int * const restrict p2 = &pp2[0];
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


But code duplication is the worst:



void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
f1(pp1.get(), pp2.get(), count);
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
f1(&pp1[0], &pp2[0], count);
}



Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious?




No. As shown, we can get the pointer using std::unique_ptr::get() function. So doing:



std::unique_ptr p1;
int *a = p1.get();
int *b = p1.get();
f1(a, b, 5);



would create three pointers that point to the same memory.




Does this vary with other compilers?




Certainly yes. restrict is not supported in C++. And it's only a hint for the compiler, the compiler may ignore it.




That difference is way to big to ignore




The only way to compare is to compare the generated assembly code. I don't have visual studio, so I can't do it.




class Timer { using clock = std::chrono::system_clock;




system_clock is a system-wide real time wall clock. Wall clock is used to display user-nice looking time on the desktop (wall). That's why it's called "wall clock", it should be used to be displayed on a wall. Use a monotonic clock, like high_resolution_clock, for measuring intervals. Or best, don't compare execution speeds, which depend on environment. Compare instruction count, ex. measured by the debugger with sample data or best, count the number of assembly instructions generated by the compiler for specific architecture and compiler options. Often sites like godbolt come in handy.



restrict qualifier is something the programmer takes care of. You must be sure, that you didn't pass pointers to a region that overlap.



You code compiles under gcc -O2 with or without restrict to the same assembly instructions, see here. If you wish to speed up the execution speed, start using platform specific instructions.






share|improve this answer
























  • Agree re high_resolution_clock. For MSVC it's the same as system_clock but portability is best using high_resolution_clock. Sure you can get pull the raw ptrs out of unique_ptr but you can't have multiple unique_ptrs pointing to the same things without duplicate destructors-argh. BTW, gotbolt also has VSC++ and the generated code is horrid. Loop unrolling and no rereading of aliased locations conditioned on ptr's not matching. gcc generates highly readable code but all the versions clearly show the compiler is assuming aliasing in -O2. Is there some option to presume pointers don't alias?

    – doug
    Nov 26 '18 at 0:10


















0















Is there some way to use __restrict or some other technique to get rid of slow execution




Yes. Just cast the pointers into being restrict, thus giving the compiler information that they are restricted with each other.



#include <memory>
#include <vector>

#if defined(__cplusplus)
#if defined(_MSC_VER)
#define restrict __restrict
#elif defined(__GNUC__)
#define restrict __restrict__
#endif
#endif

void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
int * const restrict p1 = pp1.get();
int * const restrict p2 = pp2.get();
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
int * const restrict p1 = &pp1[0];
int * const restrict p2 = &pp2[0];
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


But code duplication is the worst:



void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
f1(pp1.get(), pp2.get(), count);
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
f1(&pp1[0], &pp2[0], count);
}



Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious?




No. As shown, we can get the pointer using std::unique_ptr::get() function. So doing:



std::unique_ptr p1;
int *a = p1.get();
int *b = p1.get();
f1(a, b, 5);



would create three pointers that point to the same memory.




Does this vary with other compilers?




Certainly yes. restrict is not supported in C++. And it's only a hint for the compiler, the compiler may ignore it.




That difference is way to big to ignore




The only way to compare is to compare the generated assembly code. I don't have visual studio, so I can't do it.




class Timer { using clock = std::chrono::system_clock;




system_clock is a system-wide real time wall clock. Wall clock is used to display user-nice looking time on the desktop (wall). That's why it's called "wall clock", it should be used to be displayed on a wall. Use a monotonic clock, like high_resolution_clock, for measuring intervals. Or best, don't compare execution speeds, which depend on environment. Compare instruction count, ex. measured by the debugger with sample data or best, count the number of assembly instructions generated by the compiler for specific architecture and compiler options. Often sites like godbolt come in handy.



restrict qualifier is something the programmer takes care of. You must be sure, that you didn't pass pointers to a region that overlap.



You code compiles under gcc -O2 with or without restrict to the same assembly instructions, see here. If you wish to speed up the execution speed, start using platform specific instructions.






share|improve this answer
























  • Agree re high_resolution_clock. For MSVC it's the same as system_clock but portability is best using high_resolution_clock. Sure you can get pull the raw ptrs out of unique_ptr but you can't have multiple unique_ptrs pointing to the same things without duplicate destructors-argh. BTW, gotbolt also has VSC++ and the generated code is horrid. Loop unrolling and no rereading of aliased locations conditioned on ptr's not matching. gcc generates highly readable code but all the versions clearly show the compiler is assuming aliasing in -O2. Is there some option to presume pointers don't alias?

    – doug
    Nov 26 '18 at 0:10
















0












0








0








Is there some way to use __restrict or some other technique to get rid of slow execution




Yes. Just cast the pointers into being restrict, thus giving the compiler information that they are restricted with each other.



#include <memory>
#include <vector>

#if defined(__cplusplus)
#if defined(_MSC_VER)
#define restrict __restrict
#elif defined(__GNUC__)
#define restrict __restrict__
#endif
#endif

void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
int * const restrict p1 = pp1.get();
int * const restrict p2 = pp2.get();
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
int * const restrict p1 = &pp1[0];
int * const restrict p2 = &pp2[0];
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


But code duplication is the worst:



void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
f1(pp1.get(), pp2.get(), count);
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
f1(&pp1[0], &pp2[0], count);
}



Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious?




No. As shown, we can get the pointer using std::unique_ptr::get() function. So doing:



std::unique_ptr p1;
int *a = p1.get();
int *b = p1.get();
f1(a, b, 5);



would create three pointers that point to the same memory.




Does this vary with other compilers?




Certainly yes. restrict is not supported in C++. And it's only a hint for the compiler, the compiler may ignore it.




That difference is way to big to ignore




The only way to compare is to compare the generated assembly code. I don't have visual studio, so I can't do it.




class Timer { using clock = std::chrono::system_clock;




system_clock is a system-wide real time wall clock. Wall clock is used to display user-nice looking time on the desktop (wall). That's why it's called "wall clock", it should be used to be displayed on a wall. Use a monotonic clock, like high_resolution_clock, for measuring intervals. Or best, don't compare execution speeds, which depend on environment. Compare instruction count, ex. measured by the debugger with sample data or best, count the number of assembly instructions generated by the compiler for specific architecture and compiler options. Often sites like godbolt come in handy.



restrict qualifier is something the programmer takes care of. You must be sure, that you didn't pass pointers to a region that overlap.



You code compiles under gcc -O2 with or without restrict to the same assembly instructions, see here. If you wish to speed up the execution speed, start using platform specific instructions.






share|improve this answer














Is there some way to use __restrict or some other technique to get rid of slow execution




Yes. Just cast the pointers into being restrict, thus giving the compiler information that they are restricted with each other.



#include <memory>
#include <vector>

#if defined(__cplusplus)
#if defined(_MSC_VER)
#define restrict __restrict
#elif defined(__GNUC__)
#define restrict __restrict__
#endif
#endif

void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
int * const restrict p1 = pp1.get();
int * const restrict p2 = pp2.get();
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
int * const restrict p1 = &pp1[0];
int * const restrict p2 = &pp2[0];
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}


But code duplication is the worst:



void f1(int * restrict p1, int * restrict p2, int count)
{
for (int i = 1; i < count - 1; i++)
p2[i] = p1[i - 1] + p1[i + 1];
}

void f2(std::unique_ptr<int>& pp1, std::unique_ptr<int>& pp2, int count)
{
f1(pp1.get(), pp2.get(), count);
}

void f3(std::vector<int>& pp1, std::vector<int>& pp2, int count)
{
f1(&pp1[0], &pp2[0], count);
}



Shouldn't compilers assume that unique_ptrs don't alias since you can't have partial overlaps and full overlaps are obvious?




No. As shown, we can get the pointer using std::unique_ptr::get() function. So doing:



std::unique_ptr p1;
int *a = p1.get();
int *b = p1.get();
f1(a, b, 5);



would create three pointers that point to the same memory.




Does this vary with other compilers?




Certainly yes. restrict is not supported in C++. And it's only a hint for the compiler, the compiler may ignore it.




That difference is way to big to ignore




The only way to compare is to compare the generated assembly code. I don't have visual studio, so I can't do it.




class Timer { using clock = std::chrono::system_clock;




system_clock is a system-wide real time wall clock. Wall clock is used to display user-nice looking time on the desktop (wall). That's why it's called "wall clock", it should be used to be displayed on a wall. Use a monotonic clock, like high_resolution_clock, for measuring intervals. Or best, don't compare execution speeds, which depend on environment. Compare instruction count, ex. measured by the debugger with sample data or best, count the number of assembly instructions generated by the compiler for specific architecture and compiler options. Often sites like godbolt come in handy.



restrict qualifier is something the programmer takes care of. You must be sure, that you didn't pass pointers to a region that overlap.



You code compiles under gcc -O2 with or without restrict to the same assembly instructions, see here. If you wish to speed up the execution speed, start using platform specific instructions.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 25 '18 at 22:44









Kamil CukKamil Cuk

11.9k1529




11.9k1529













  • Agree re high_resolution_clock. For MSVC it's the same as system_clock but portability is best using high_resolution_clock. Sure you can get pull the raw ptrs out of unique_ptr but you can't have multiple unique_ptrs pointing to the same things without duplicate destructors-argh. BTW, gotbolt also has VSC++ and the generated code is horrid. Loop unrolling and no rereading of aliased locations conditioned on ptr's not matching. gcc generates highly readable code but all the versions clearly show the compiler is assuming aliasing in -O2. Is there some option to presume pointers don't alias?

    – doug
    Nov 26 '18 at 0:10





















  • Agree re high_resolution_clock. For MSVC it's the same as system_clock but portability is best using high_resolution_clock. Sure you can get pull the raw ptrs out of unique_ptr but you can't have multiple unique_ptrs pointing to the same things without duplicate destructors-argh. BTW, gotbolt also has VSC++ and the generated code is horrid. Loop unrolling and no rereading of aliased locations conditioned on ptr's not matching. gcc generates highly readable code but all the versions clearly show the compiler is assuming aliasing in -O2. Is there some option to presume pointers don't alias?

    – doug
    Nov 26 '18 at 0:10



















Agree re high_resolution_clock. For MSVC it's the same as system_clock but portability is best using high_resolution_clock. Sure you can get pull the raw ptrs out of unique_ptr but you can't have multiple unique_ptrs pointing to the same things without duplicate destructors-argh. BTW, gotbolt also has VSC++ and the generated code is horrid. Loop unrolling and no rereading of aliased locations conditioned on ptr's not matching. gcc generates highly readable code but all the versions clearly show the compiler is assuming aliasing in -O2. Is there some option to presume pointers don't alias?

– doug
Nov 26 '18 at 0:10







Agree re high_resolution_clock. For MSVC it's the same as system_clock but portability is best using high_resolution_clock. Sure you can get pull the raw ptrs out of unique_ptr but you can't have multiple unique_ptrs pointing to the same things without duplicate destructors-argh. BTW, gotbolt also has VSC++ and the generated code is horrid. Loop unrolling and no rereading of aliased locations conditioned on ptr's not matching. gcc generates highly readable code but all the versions clearly show the compiler is assuming aliasing in -O2. Is there some option to presume pointers don't alias?

– doug
Nov 26 '18 at 0:10




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465220%2fcompiler-alias-assumptions-raw-ptrs-unique-ptrs-and-restrict-strange-resu%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Create new schema in PostgreSQL using DBeaver

Deepest pit of an array with Javascript: test on Codility

Costa Masnaga