Efficiently dropping rows from a DataFrame if some row-values are identical with row-values in a second...
import pandas as pd
df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})
df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
'nr': [91, 92, 91, 99, 92, 93, 92, 99],
'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})
def eliminate1 ():
for i1, row1 in df1.iterrows():
for i2, row2 in df2.iterrows():
if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
df1.drop(i1, inplace=True)
df1.reset_index(drop=True, inplace=True)
print(df1)
eliminate1()
I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.
Here are df1 and df2:
id nr val_a
0 1 91 22
1 1 92 23
2 1 93 24
3 2 91 33
4 2 92 34
5 2 93 35
6 3 91 44
7 3 92 43
8 3 93 42
id nr val_a
0 1 91 72
1 1 92 27
2 2 91 74
3 3 99 83
4 4 92 84
5 4 93 85
6 3 92 84
7 5 99 83
And here the result as it should look like:
id nr val_a
0 1 93 24
1 2 92 34
2 2 93 35
3 3 91 44
4 3 93 42
Does anyone know how to write a faster code and/or use an already existing function?
python pandas performance dataframe
add a comment |
import pandas as pd
df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})
df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
'nr': [91, 92, 91, 99, 92, 93, 92, 99],
'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})
def eliminate1 ():
for i1, row1 in df1.iterrows():
for i2, row2 in df2.iterrows():
if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
df1.drop(i1, inplace=True)
df1.reset_index(drop=True, inplace=True)
print(df1)
eliminate1()
I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.
Here are df1 and df2:
id nr val_a
0 1 91 22
1 1 92 23
2 1 93 24
3 2 91 33
4 2 92 34
5 2 93 35
6 3 91 44
7 3 92 43
8 3 93 42
id nr val_a
0 1 91 72
1 1 92 27
2 2 91 74
3 3 99 83
4 4 92 84
5 4 93 85
6 3 92 84
7 5 99 83
And here the result as it should look like:
id nr val_a
0 1 93 24
1 2 92 34
2 2 93 35
3 3 91 44
4 3 93 42
Does anyone know how to write a faster code and/or use an already existing function?
python pandas performance dataframe
add a comment |
import pandas as pd
df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})
df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
'nr': [91, 92, 91, 99, 92, 93, 92, 99],
'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})
def eliminate1 ():
for i1, row1 in df1.iterrows():
for i2, row2 in df2.iterrows():
if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
df1.drop(i1, inplace=True)
df1.reset_index(drop=True, inplace=True)
print(df1)
eliminate1()
I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.
Here are df1 and df2:
id nr val_a
0 1 91 22
1 1 92 23
2 1 93 24
3 2 91 33
4 2 92 34
5 2 93 35
6 3 91 44
7 3 92 43
8 3 93 42
id nr val_a
0 1 91 72
1 1 92 27
2 2 91 74
3 3 99 83
4 4 92 84
5 4 93 85
6 3 92 84
7 5 99 83
And here the result as it should look like:
id nr val_a
0 1 93 24
1 2 92 34
2 2 93 35
3 3 91 44
4 3 93 42
Does anyone know how to write a faster code and/or use an already existing function?
python pandas performance dataframe
import pandas as pd
df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})
df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
'nr': [91, 92, 91, 99, 92, 93, 92, 99],
'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})
def eliminate1 ():
for i1, row1 in df1.iterrows():
for i2, row2 in df2.iterrows():
if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
df1.drop(i1, inplace=True)
df1.reset_index(drop=True, inplace=True)
print(df1)
eliminate1()
I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.
Here are df1 and df2:
id nr val_a
0 1 91 22
1 1 92 23
2 1 93 24
3 2 91 33
4 2 92 34
5 2 93 35
6 3 91 44
7 3 92 43
8 3 93 42
id nr val_a
0 1 91 72
1 1 92 27
2 2 91 74
3 3 99 83
4 4 92 84
5 4 93 85
6 3 92 84
7 5 99 83
And here the result as it should look like:
id nr val_a
0 1 93 24
1 2 92 34
2 2 93 35
3 3 91 44
4 3 93 42
Does anyone know how to write a faster code and/or use an already existing function?
python pandas performance dataframe
python pandas performance dataframe
edited Nov 21 '18 at 15:12
jpp
95.1k2157108
95.1k2157108
asked Nov 21 '18 at 15:03
UweDUweD
1495
1495
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
merge
You can merge
with indicator=True
and include only those rows marked 'left_only'
.
res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)
print(res)
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
The solution is easily adaptable to any condition depending on 'left_only'
, 'right_only'
or 'both'
.
add a comment |
Method 1 isin
after zip the merge
column into tuple
df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
Out[43]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
Method 2 numpy
broadcast
s1=df1[['id','nr']].values
s2=df2[['id','nr']].values
df1[~np.any(np.all(s1==s2[:,None],-1),0)]
Out[64]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
My method timing
%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
100 loops, best of 3: 3.67 ms per loop
def m2():
s1 = df1[['id', 'nr']].values
s2 = df2[['id', 'nr']].values
return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
%timeit m2()
1000 loops, best of 3: 926 µs per loop
Amazingly fast as compared to my original code!
– UweD
Nov 22 '18 at 9:11
add a comment |
Would an inner join
solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index()
afterwards if you wish to do so.
df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
id_list = df3['id'].tolist()
df4 = df1[~df1['id'].isin(id_list)]
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53414888%2fefficiently-dropping-rows-from-a-dataframe-if-some-row-values-are-identical-with%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
merge
You can merge
with indicator=True
and include only those rows marked 'left_only'
.
res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)
print(res)
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
The solution is easily adaptable to any condition depending on 'left_only'
, 'right_only'
or 'both'
.
add a comment |
merge
You can merge
with indicator=True
and include only those rows marked 'left_only'
.
res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)
print(res)
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
The solution is easily adaptable to any condition depending on 'left_only'
, 'right_only'
or 'both'
.
add a comment |
merge
You can merge
with indicator=True
and include only those rows marked 'left_only'
.
res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)
print(res)
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
The solution is easily adaptable to any condition depending on 'left_only'
, 'right_only'
or 'both'
.
merge
You can merge
with indicator=True
and include only those rows marked 'left_only'
.
res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)
print(res)
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
The solution is easily adaptable to any condition depending on 'left_only'
, 'right_only'
or 'both'
.
answered Nov 21 '18 at 15:10
jppjpp
95.1k2157108
95.1k2157108
add a comment |
add a comment |
Method 1 isin
after zip the merge
column into tuple
df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
Out[43]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
Method 2 numpy
broadcast
s1=df1[['id','nr']].values
s2=df2[['id','nr']].values
df1[~np.any(np.all(s1==s2[:,None],-1),0)]
Out[64]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
My method timing
%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
100 loops, best of 3: 3.67 ms per loop
def m2():
s1 = df1[['id', 'nr']].values
s2 = df2[['id', 'nr']].values
return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
%timeit m2()
1000 loops, best of 3: 926 µs per loop
Amazingly fast as compared to my original code!
– UweD
Nov 22 '18 at 9:11
add a comment |
Method 1 isin
after zip the merge
column into tuple
df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
Out[43]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
Method 2 numpy
broadcast
s1=df1[['id','nr']].values
s2=df2[['id','nr']].values
df1[~np.any(np.all(s1==s2[:,None],-1),0)]
Out[64]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
My method timing
%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
100 loops, best of 3: 3.67 ms per loop
def m2():
s1 = df1[['id', 'nr']].values
s2 = df2[['id', 'nr']].values
return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
%timeit m2()
1000 loops, best of 3: 926 µs per loop
Amazingly fast as compared to my original code!
– UweD
Nov 22 '18 at 9:11
add a comment |
Method 1 isin
after zip the merge
column into tuple
df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
Out[43]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
Method 2 numpy
broadcast
s1=df1[['id','nr']].values
s2=df2[['id','nr']].values
df1[~np.any(np.all(s1==s2[:,None],-1),0)]
Out[64]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
My method timing
%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
100 loops, best of 3: 3.67 ms per loop
def m2():
s1 = df1[['id', 'nr']].values
s2 = df2[['id', 'nr']].values
return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
%timeit m2()
1000 loops, best of 3: 926 µs per loop
Method 1 isin
after zip the merge
column into tuple
df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
Out[43]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
Method 2 numpy
broadcast
s1=df1[['id','nr']].values
s2=df2[['id','nr']].values
df1[~np.any(np.all(s1==s2[:,None],-1),0)]
Out[64]:
id nr val_a
2 1 93 24
4 2 92 34
5 2 93 35
6 3 91 44
8 3 93 42
My method timing
%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
100 loops, best of 3: 3.67 ms per loop
def m2():
s1 = df1[['id', 'nr']].values
s2 = df2[['id', 'nr']].values
return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
%timeit m2()
1000 loops, best of 3: 926 µs per loop
edited Nov 21 '18 at 15:29
answered Nov 21 '18 at 15:24
W-BW-B
104k73165
104k73165
Amazingly fast as compared to my original code!
– UweD
Nov 22 '18 at 9:11
add a comment |
Amazingly fast as compared to my original code!
– UweD
Nov 22 '18 at 9:11
Amazingly fast as compared to my original code!
– UweD
Nov 22 '18 at 9:11
Amazingly fast as compared to my original code!
– UweD
Nov 22 '18 at 9:11
add a comment |
Would an inner join
solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index()
afterwards if you wish to do so.
df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
id_list = df3['id'].tolist()
df4 = df1[~df1['id'].isin(id_list)]
add a comment |
Would an inner join
solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index()
afterwards if you wish to do so.
df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
id_list = df3['id'].tolist()
df4 = df1[~df1['id'].isin(id_list)]
add a comment |
Would an inner join
solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index()
afterwards if you wish to do so.
df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
id_list = df3['id'].tolist()
df4 = df1[~df1['id'].isin(id_list)]
Would an inner join
solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index()
afterwards if you wish to do so.
df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
id_list = df3['id'].tolist()
df4 = df1[~df1['id'].isin(id_list)]
answered Nov 21 '18 at 21:35
Matías RomoMatías Romo
463
463
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53414888%2fefficiently-dropping-rows-from-a-dataframe-if-some-row-values-are-identical-with%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown