How to speed up Numpy array filtering/selection?
I have around 40k rows and I want to test all kinds of selection combinations on the rows. By selection I mean boolean masks. The number of masks/filters is around 250MM.
The current simplified code:
np_arr = np.random.randint(1, 40000, 40000)
results = np.empty(250000000)
filters = np.random.randint(1, size=(250000000, 40000))
for i in range(250000000):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
I tried Numba and Multiprocessing, but since most of the processing is in the filter selection rather than the computation, that doesn't help much.
What would be the most efficient way to solve this? Is there any way to parallelize this? As far as I see I need to loop through each filter to then individually calculate the sum, prod, count etc because I can't apply filters in parallel (even though the calculations after applying the filters are very simple).
Appreciate any suggestions on performance improvement/speedup.
python performance numpy parallel-processing multiprocessing
add a comment |
I have around 40k rows and I want to test all kinds of selection combinations on the rows. By selection I mean boolean masks. The number of masks/filters is around 250MM.
The current simplified code:
np_arr = np.random.randint(1, 40000, 40000)
results = np.empty(250000000)
filters = np.random.randint(1, size=(250000000, 40000))
for i in range(250000000):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
I tried Numba and Multiprocessing, but since most of the processing is in the filter selection rather than the computation, that doesn't help much.
What would be the most efficient way to solve this? Is there any way to parallelize this? As far as I see I need to loop through each filter to then individually calculate the sum, prod, count etc because I can't apply filters in parallel (even though the calculations after applying the filters are very simple).
Appreciate any suggestions on performance improvement/speedup.
python performance numpy parallel-processing multiprocessing
Are all functions you want to apply available in Numba, or at least easy to implement?
– max9111
Nov 22 '18 at 16:49
for all i,j filters[i,j] ==0. use randint(2, ...) instead.
– B. M.
Nov 22 '18 at 18:26
Hi, yes calculations are easy to implement in Numba, but the tricky part is the loop which applies the filter 250MM times.
– Franc Weser
Nov 23 '18 at 2:51
Where do you get the filter array from in your calculation? A boolean array of size (250000000, 40000) has 10TB and would not fit into RAM. Or do you want to create some random numbers in the loop which applies the filter?
– max9111
Nov 23 '18 at 9:00
add a comment |
I have around 40k rows and I want to test all kinds of selection combinations on the rows. By selection I mean boolean masks. The number of masks/filters is around 250MM.
The current simplified code:
np_arr = np.random.randint(1, 40000, 40000)
results = np.empty(250000000)
filters = np.random.randint(1, size=(250000000, 40000))
for i in range(250000000):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
I tried Numba and Multiprocessing, but since most of the processing is in the filter selection rather than the computation, that doesn't help much.
What would be the most efficient way to solve this? Is there any way to parallelize this? As far as I see I need to loop through each filter to then individually calculate the sum, prod, count etc because I can't apply filters in parallel (even though the calculations after applying the filters are very simple).
Appreciate any suggestions on performance improvement/speedup.
python performance numpy parallel-processing multiprocessing
I have around 40k rows and I want to test all kinds of selection combinations on the rows. By selection I mean boolean masks. The number of masks/filters is around 250MM.
The current simplified code:
np_arr = np.random.randint(1, 40000, 40000)
results = np.empty(250000000)
filters = np.random.randint(1, size=(250000000, 40000))
for i in range(250000000):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
I tried Numba and Multiprocessing, but since most of the processing is in the filter selection rather than the computation, that doesn't help much.
What would be the most efficient way to solve this? Is there any way to parallelize this? As far as I see I need to loop through each filter to then individually calculate the sum, prod, count etc because I can't apply filters in parallel (even though the calculations after applying the filters are very simple).
Appreciate any suggestions on performance improvement/speedup.
python performance numpy parallel-processing multiprocessing
python performance numpy parallel-processing multiprocessing
asked Nov 22 '18 at 15:56
Franc WeserFranc Weser
16417
16417
Are all functions you want to apply available in Numba, or at least easy to implement?
– max9111
Nov 22 '18 at 16:49
for all i,j filters[i,j] ==0. use randint(2, ...) instead.
– B. M.
Nov 22 '18 at 18:26
Hi, yes calculations are easy to implement in Numba, but the tricky part is the loop which applies the filter 250MM times.
– Franc Weser
Nov 23 '18 at 2:51
Where do you get the filter array from in your calculation? A boolean array of size (250000000, 40000) has 10TB and would not fit into RAM. Or do you want to create some random numbers in the loop which applies the filter?
– max9111
Nov 23 '18 at 9:00
add a comment |
Are all functions you want to apply available in Numba, or at least easy to implement?
– max9111
Nov 22 '18 at 16:49
for all i,j filters[i,j] ==0. use randint(2, ...) instead.
– B. M.
Nov 22 '18 at 18:26
Hi, yes calculations are easy to implement in Numba, but the tricky part is the loop which applies the filter 250MM times.
– Franc Weser
Nov 23 '18 at 2:51
Where do you get the filter array from in your calculation? A boolean array of size (250000000, 40000) has 10TB and would not fit into RAM. Or do you want to create some random numbers in the loop which applies the filter?
– max9111
Nov 23 '18 at 9:00
Are all functions you want to apply available in Numba, or at least easy to implement?
– max9111
Nov 22 '18 at 16:49
Are all functions you want to apply available in Numba, or at least easy to implement?
– max9111
Nov 22 '18 at 16:49
for all i,j filters[i,j] ==0. use randint(2, ...) instead.
– B. M.
Nov 22 '18 at 18:26
for all i,j filters[i,j] ==0. use randint(2, ...) instead.
– B. M.
Nov 22 '18 at 18:26
Hi, yes calculations are easy to implement in Numba, but the tricky part is the loop which applies the filter 250MM times.
– Franc Weser
Nov 23 '18 at 2:51
Hi, yes calculations are easy to implement in Numba, but the tricky part is the loop which applies the filter 250MM times.
– Franc Weser
Nov 23 '18 at 2:51
Where do you get the filter array from in your calculation? A boolean array of size (250000000, 40000) has 10TB and would not fit into RAM. Or do you want to create some random numbers in the loop which applies the filter?
– max9111
Nov 23 '18 at 9:00
Where do you get the filter array from in your calculation? A boolean array of size (250000000, 40000) has 10TB and would not fit into RAM. Or do you want to create some random numbers in the loop which applies the filter?
– max9111
Nov 23 '18 at 9:00
add a comment |
2 Answers
2
active
oldest
votes
To get good performane within Numba
simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.
Parallelization is also very easy to do.
Example
import numpy as np
import numba as nb
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)
#Implement your functions like this, avoid masking
#Sum Filter
@nb.njit(fastmath=True)
def sum_filter(filter,arr):
sum=0.
for i in range(filter.shape[0]):
if filter[i]==True:
sum+=arr[i]
return sum
#Implement your functions like this, avoid masking
#Prod Filter
@nb.njit(fastmath=True)
def prod_filter(filter,arr):
prod=1.
for i in range(filter.shape[0]):
if filter[i]==True:
prod*=arr[i]
return sum
@nb.njit(parallel=True)
def main_func(np_arr,filters):
results = np.empty(filters.shape[0])
for i in nb.prange(max_num):
results[i]=sum_filter(filters[i],np_arr)
#results[i]=prod_filter(filters[i],np_arr)
return results
add a comment |
One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half.
For comparison, check the two codes below:
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 2.12
while
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i]] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 0.940
Good idea, thanks!
– Franc Weser
Nov 22 '18 at 16:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434568%2fhow-to-speed-up-numpy-array-filtering-selection%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
To get good performane within Numba
simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.
Parallelization is also very easy to do.
Example
import numpy as np
import numba as nb
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)
#Implement your functions like this, avoid masking
#Sum Filter
@nb.njit(fastmath=True)
def sum_filter(filter,arr):
sum=0.
for i in range(filter.shape[0]):
if filter[i]==True:
sum+=arr[i]
return sum
#Implement your functions like this, avoid masking
#Prod Filter
@nb.njit(fastmath=True)
def prod_filter(filter,arr):
prod=1.
for i in range(filter.shape[0]):
if filter[i]==True:
prod*=arr[i]
return sum
@nb.njit(parallel=True)
def main_func(np_arr,filters):
results = np.empty(filters.shape[0])
for i in nb.prange(max_num):
results[i]=sum_filter(filters[i],np_arr)
#results[i]=prod_filter(filters[i],np_arr)
return results
add a comment |
To get good performane within Numba
simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.
Parallelization is also very easy to do.
Example
import numpy as np
import numba as nb
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)
#Implement your functions like this, avoid masking
#Sum Filter
@nb.njit(fastmath=True)
def sum_filter(filter,arr):
sum=0.
for i in range(filter.shape[0]):
if filter[i]==True:
sum+=arr[i]
return sum
#Implement your functions like this, avoid masking
#Prod Filter
@nb.njit(fastmath=True)
def prod_filter(filter,arr):
prod=1.
for i in range(filter.shape[0]):
if filter[i]==True:
prod*=arr[i]
return sum
@nb.njit(parallel=True)
def main_func(np_arr,filters):
results = np.empty(filters.shape[0])
for i in nb.prange(max_num):
results[i]=sum_filter(filters[i],np_arr)
#results[i]=prod_filter(filters[i],np_arr)
return results
add a comment |
To get good performane within Numba
simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.
Parallelization is also very easy to do.
Example
import numpy as np
import numba as nb
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)
#Implement your functions like this, avoid masking
#Sum Filter
@nb.njit(fastmath=True)
def sum_filter(filter,arr):
sum=0.
for i in range(filter.shape[0]):
if filter[i]==True:
sum+=arr[i]
return sum
#Implement your functions like this, avoid masking
#Prod Filter
@nb.njit(fastmath=True)
def prod_filter(filter,arr):
prod=1.
for i in range(filter.shape[0]):
if filter[i]==True:
prod*=arr[i]
return sum
@nb.njit(parallel=True)
def main_func(np_arr,filters):
results = np.empty(filters.shape[0])
for i in nb.prange(max_num):
results[i]=sum_filter(filters[i],np_arr)
#results[i]=prod_filter(filters[i],np_arr)
return results
To get good performane within Numba
simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.
Parallelization is also very easy to do.
Example
import numpy as np
import numba as nb
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)
#Implement your functions like this, avoid masking
#Sum Filter
@nb.njit(fastmath=True)
def sum_filter(filter,arr):
sum=0.
for i in range(filter.shape[0]):
if filter[i]==True:
sum+=arr[i]
return sum
#Implement your functions like this, avoid masking
#Prod Filter
@nb.njit(fastmath=True)
def prod_filter(filter,arr):
prod=1.
for i in range(filter.shape[0]):
if filter[i]==True:
prod*=arr[i]
return sum
@nb.njit(parallel=True)
def main_func(np_arr,filters):
results = np.empty(filters.shape[0])
for i in nb.prange(max_num):
results[i]=sum_filter(filters[i],np_arr)
#results[i]=prod_filter(filters[i],np_arr)
return results
answered Nov 22 '18 at 17:09
max9111max9111
2,2761612
2,2761612
add a comment |
add a comment |
One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half.
For comparison, check the two codes below:
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 2.12
while
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i]] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 0.940
Good idea, thanks!
– Franc Weser
Nov 22 '18 at 16:47
add a comment |
One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half.
For comparison, check the two codes below:
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 2.12
while
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i]] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 0.940
Good idea, thanks!
– Franc Weser
Nov 22 '18 at 16:47
add a comment |
One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half.
For comparison, check the two codes below:
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 2.12
while
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i]] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 0.940
One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half.
For comparison, check the two codes below:
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 2.12
while
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i]] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 0.940
answered Nov 22 '18 at 16:40
Pedro TorresPedro Torres
683413
683413
Good idea, thanks!
– Franc Weser
Nov 22 '18 at 16:47
add a comment |
Good idea, thanks!
– Franc Weser
Nov 22 '18 at 16:47
Good idea, thanks!
– Franc Weser
Nov 22 '18 at 16:47
Good idea, thanks!
– Franc Weser
Nov 22 '18 at 16:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434568%2fhow-to-speed-up-numpy-array-filtering-selection%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Are all functions you want to apply available in Numba, or at least easy to implement?
– max9111
Nov 22 '18 at 16:49
for all i,j filters[i,j] ==0. use randint(2, ...) instead.
– B. M.
Nov 22 '18 at 18:26
Hi, yes calculations are easy to implement in Numba, but the tricky part is the loop which applies the filter 250MM times.
– Franc Weser
Nov 23 '18 at 2:51
Where do you get the filter array from in your calculation? A boolean array of size (250000000, 40000) has 10TB and would not fit into RAM. Or do you want to create some random numbers in the loop which applies the filter?
– max9111
Nov 23 '18 at 9:00