CUDA loop on matlab

up vote
8
down vote

favorite

I have been playing around with parallelization both using ACC and OpenMP in Fortran. I am now trying to do the same in matlab. I find it very interesting that it seems to be very hard to paralelize a loop using GPUs in matlab. Apparently the only way to do it is to by using arrayfun function. But I might be wrong.

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran ACC code

Matlab parfor code

Matlab CUDA (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran ACC:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

This is what I have no clue how to code. Is using arrayfun the only way of doing this? In fortran is so simple to move from OpenMP to OpenACC. Isn't there an easy way in Matlab of going from parfor to GPUs loops?

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran ACC:    2.4 seconds

Matlab parfor:  1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

edited Nov 23 at 10:23

talonmies

58.9k17127194

asked Nov 19 at 11:50

phdstudent

409725

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

1

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

up vote
8
down vote

favorite

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran ACC code

Matlab parfor code

Matlab CUDA (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran ACC:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran ACC:    2.4 seconds

Matlab parfor:  1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

edited Nov 23 at 10:23

talonmies

58.9k17127194

asked Nov 19 at 11:50

phdstudent

409725

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

1

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

up vote
8
down vote

favorite

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran ACC code

Matlab parfor code

Matlab CUDA (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran ACC:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran ACC:    2.4 seconds

Matlab parfor:  1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

edited Nov 23 at 10:23

talonmies

58.9k17127194

asked Nov 19 at 11:50

phdstudent

409725

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran ACC code

Matlab parfor code

Matlab CUDA (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran ACC:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran ACC:    2.4 seconds

Matlab parfor:  1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

matlab openmp openacc

edited Nov 23 at 10:23

talonmies

58.9k17127194

asked Nov 19 at 11:50

phdstudent

409725

edited Nov 23 at 10:23

talonmies

58.9k17127194

asked Nov 19 at 11:50

phdstudent

409725

edited Nov 23 at 10:23

talonmies

58.9k17127194

edited Nov 23 at 10:23

talonmies

58.9k17127194

edited Nov 23 at 10:23

talonmies

58.9k17127194

asked Nov 19 at 11:50

phdstudent

409725

asked Nov 19 at 11:50

phdstudent

409725

asked Nov 19 at 11:50

phdstudent

409725

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

1

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

1

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
0
down vote

So, this bit is what is going to mess you up on this project. MATLAB stands for Matrix Laboratory. Vectors and matrices are kind of its thing. The number 1 way to optimize anything in MATLAB is to vectorize it. For this reason, when using performance enhancing tools like CUDA, MATLAB assumes that you are going to vectorize your inputs if possible. Given the primacy of vectorizing inputs in the MATLAB coding style, it is not a fair comparison to assess its performance using only loops. It would be like assessing the performance of C++ while refusing to use pointers. If you want to use CUDA with MATLAB, the main way to go about it is to vectorize your inputs and use gpuarray. Honestly, I haven't looked too hard at your code but it kind of looks like your inputs are already mostly vectorized. You may be able to get away with something as simple as gpuarray(1:nk) or kgrid=gpuarray(linspace(...).

edited 5 hours ago

answered 5 hours ago

Jeremiah

536

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374053%2fcuda-loop-on-matlab%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

edited 5 hours ago

answered 5 hours ago

Jeremiah

536

add a comment |

up vote
0
down vote

edited 5 hours ago

answered 5 hours ago

Jeremiah

536

add a comment |

up vote
0
down vote

edited 5 hours ago

answered 5 hours ago

Jeremiah

536

edited 5 hours ago

answered 5 hours ago

Jeremiah

536

edited 5 hours ago

answered 5 hours ago

Jeremiah

536

answered 5 hours ago

Jeremiah

536

answered 5 hours ago

Jeremiah

536

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk