Efficiently dropping rows from a DataFrame if some row-values are identical with row-values in a second...

import pandas as pd



df1 = pd.DataFrame({'id':   [ 1,  1,  1,  2,  2,  2,  3,  3,  3], 

                    'nr':   [91, 92, 93, 91, 92, 93, 91, 92, 93], 

                    'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})



df2 = pd.DataFrame({'id':   [ 1,  1,  2,  3,  4,  4,  3,  5], 

                    'nr':   [91, 92, 91, 99, 92, 93, 92, 99], 

                    'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})



def eliminate1 (): 

    for i1, row1 in df1.iterrows():

        for i2, row2 in df2.iterrows():

            if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:

                df1.drop(i1, inplace=True)

    df1.reset_index(drop=True, inplace=True)

    print(df1)



eliminate1()

I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.

Here are df1 and df2:

   id  nr  val_a

0   1  91     22

1   1  92     23

2   1  93     24

3   2  91     33

4   2  92     34

5   2  93     35

6   3  91     44

7   3  92     43

8   3  93     42 



   id  nr  val_a

0   1  91     72

1   1  92     27

2   2  91     74

3   3  99     83

4   4  92     84

5   4  93     85

6   3  92     84

7   5  99     83

And here the result as it should look like:

   id  nr  val_a

0   1  93     24

1   2  92     34

2   2  93     35

3   3  91     44

4   3  93     42

Does anyone know how to write a faster code and/or use an already existing function?

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

asked Nov 21 '18 at 15:03

UweD

1495

add a comment |

import pandas as pd



df1 = pd.DataFrame({'id':   [ 1,  1,  1,  2,  2,  2,  3,  3,  3], 

                    'nr':   [91, 92, 93, 91, 92, 93, 91, 92, 93], 

                    'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})



df2 = pd.DataFrame({'id':   [ 1,  1,  2,  3,  4,  4,  3,  5], 

                    'nr':   [91, 92, 91, 99, 92, 93, 92, 99], 

                    'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})



def eliminate1 (): 

    for i1, row1 in df1.iterrows():

        for i2, row2 in df2.iterrows():

            if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:

                df1.drop(i1, inplace=True)

    df1.reset_index(drop=True, inplace=True)

    print(df1)



eliminate1()

I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.

Here are df1 and df2:

   id  nr  val_a

0   1  91     22

1   1  92     23

2   1  93     24

3   2  91     33

4   2  92     34

5   2  93     35

6   3  91     44

7   3  92     43

8   3  93     42 



   id  nr  val_a

0   1  91     72

1   1  92     27

2   2  91     74

3   3  99     83

4   4  92     84

5   4  93     85

6   3  92     84

7   5  99     83

And here the result as it should look like:

   id  nr  val_a

0   1  93     24

1   2  92     34

2   2  93     35

3   3  91     44

4   3  93     42

Does anyone know how to write a faster code and/or use an already existing function?

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

asked Nov 21 '18 at 15:03

UweD

1495

add a comment |

import pandas as pd



df1 = pd.DataFrame({'id':   [ 1,  1,  1,  2,  2,  2,  3,  3,  3], 

                    'nr':   [91, 92, 93, 91, 92, 93, 91, 92, 93], 

                    'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})



df2 = pd.DataFrame({'id':   [ 1,  1,  2,  3,  4,  4,  3,  5], 

                    'nr':   [91, 92, 91, 99, 92, 93, 92, 99], 

                    'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})



def eliminate1 (): 

    for i1, row1 in df1.iterrows():

        for i2, row2 in df2.iterrows():

            if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:

                df1.drop(i1, inplace=True)

    df1.reset_index(drop=True, inplace=True)

    print(df1)



eliminate1()

I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.

Here are df1 and df2:

   id  nr  val_a

0   1  91     22

1   1  92     23

2   1  93     24

3   2  91     33

4   2  92     34

5   2  93     35

6   3  91     44

7   3  92     43

8   3  93     42 



   id  nr  val_a

0   1  91     72

1   1  92     27

2   2  91     74

3   3  99     83

4   4  92     84

5   4  93     85

6   3  92     84

7   5  99     83

And here the result as it should look like:

   id  nr  val_a

0   1  93     24

1   2  92     34

2   2  93     35

3   3  91     44

4   3  93     42

Does anyone know how to write a faster code and/or use an already existing function?

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

asked Nov 21 '18 at 15:03

UweD

1495

import pandas as pd



df1 = pd.DataFrame({'id':   [ 1,  1,  1,  2,  2,  2,  3,  3,  3], 

                    'nr':   [91, 92, 93, 91, 92, 93, 91, 92, 93], 

                    'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})



df2 = pd.DataFrame({'id':   [ 1,  1,  2,  3,  4,  4,  3,  5], 

                    'nr':   [91, 92, 91, 99, 92, 93, 92, 99], 

                    'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})



def eliminate1 (): 

    for i1, row1 in df1.iterrows():

        for i2, row2 in df2.iterrows():

            if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:

                df1.drop(i1, inplace=True)

    df1.reset_index(drop=True, inplace=True)

    print(df1)



eliminate1()

I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.

Here are df1 and df2:

   id  nr  val_a

0   1  91     22

1   1  92     23

2   1  93     24

3   2  91     33

4   2  92     34

5   2  93     35

6   3  91     44

7   3  92     43

8   3  93     42 



   id  nr  val_a

0   1  91     72

1   1  92     27

2   2  91     74

3   3  99     83

4   4  92     84

5   4  93     85

6   3  92     84

7   5  99     83

And here the result as it should look like:

   id  nr  val_a

0   1  93     24

1   2  92     34

2   2  93     35

3   3  91     44

4   3  93     42

Does anyone know how to write a faster code and/or use an already existing function?

python pandas performance dataframe

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

asked Nov 21 '18 at 15:03

UweD

1495

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

asked Nov 21 '18 at 15:03

UweD

1495

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

edited Nov 21 '18 at 15:12

jpp

95.1k2157108

asked Nov 21 '18 at 15:03

UweD

1495

asked Nov 21 '18 at 15:03

UweD

1495

asked Nov 21 '18 at 15:03

UweD

1495

add a comment |

3 Answers
3

active

oldest

votes

`merge`

You can merge with indicator=True and include only those rows marked 'left_only'.

res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)

res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)



print(res)



   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

add a comment |

Method 1 isin after zip the merge column into tuple

df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

Out[43]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

Method 2 numpy broadcast

s1=df1[['id','nr']].values

s2=df2[['id','nr']].values

df1[~np.any(np.all(s1==s2[:,None],-1),0)]

Out[64]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

My method timing

%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

100 loops, best of 3: 3.67 ms per loop

def m2():

    s1 = df1[['id', 'nr']].values

    s2 = df2[['id', 'nr']].values

    return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]

%timeit m2()

1000 loops, best of 3: 926 µs per loop

edited Nov 21 '18 at 15:29

answered Nov 21 '18 at 15:24

W-B

104k73165

Amazingly fast as compared to my original code!

– UweD
Nov 22 '18 at 9:11

add a comment |

Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.

df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()

id_list = df3['id'].tolist()

df4 = df1[~df1['id'].isin(id_list)]

answered Nov 21 '18 at 21:35

Matías Romo

463

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53414888%2fefficiently-dropping-rows-from-a-dataframe-if-some-row-values-are-identical-with%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

`merge`

You can merge with indicator=True and include only those rows marked 'left_only'.

res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)

res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)



print(res)



   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

add a comment |

`merge`

You can merge with indicator=True and include only those rows marked 'left_only'.

res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)

res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)



print(res)



   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

add a comment |

`merge`

You can merge with indicator=True and include only those rows marked 'left_only'.

res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)

res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)



print(res)



   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

`merge`

You can merge with indicator=True and include only those rows marked 'left_only'.

res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)

res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)



print(res)



   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

answered Nov 21 '18 at 15:10

jpp

95.1k2157108

add a comment |

Method 1 isin after zip the merge column into tuple

df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

Out[43]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

Method 2 numpy broadcast

s1=df1[['id','nr']].values

s2=df2[['id','nr']].values

df1[~np.any(np.all(s1==s2[:,None],-1),0)]

Out[64]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

My method timing

%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

100 loops, best of 3: 3.67 ms per loop

def m2():

    s1 = df1[['id', 'nr']].values

    s2 = df2[['id', 'nr']].values

    return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]

%timeit m2()

1000 loops, best of 3: 926 µs per loop

edited Nov 21 '18 at 15:29

answered Nov 21 '18 at 15:24

W-B

104k73165

Amazingly fast as compared to my original code!

– UweD
Nov 22 '18 at 9:11

add a comment |

Method 1 isin after zip the merge column into tuple

df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

Out[43]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

Method 2 numpy broadcast

s1=df1[['id','nr']].values

s2=df2[['id','nr']].values

df1[~np.any(np.all(s1==s2[:,None],-1),0)]

Out[64]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

My method timing

%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

100 loops, best of 3: 3.67 ms per loop

def m2():

    s1 = df1[['id', 'nr']].values

    s2 = df2[['id', 'nr']].values

    return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]

%timeit m2()

1000 loops, best of 3: 926 µs per loop

edited Nov 21 '18 at 15:29

answered Nov 21 '18 at 15:24

W-B

104k73165

Amazingly fast as compared to my original code!

– UweD
Nov 22 '18 at 9:11

add a comment |

Method 1 isin after zip the merge column into tuple

df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

Out[43]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

Method 2 numpy broadcast

s1=df1[['id','nr']].values

s2=df2[['id','nr']].values

df1[~np.any(np.all(s1==s2[:,None],-1),0)]

Out[64]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

My method timing

%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

100 loops, best of 3: 3.67 ms per loop

def m2():

    s1 = df1[['id', 'nr']].values

    s2 = df2[['id', 'nr']].values

    return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]

%timeit m2()

1000 loops, best of 3: 926 µs per loop

edited Nov 21 '18 at 15:29

answered Nov 21 '18 at 15:24

W-B

104k73165

Method 1 isin after zip the merge column into tuple

df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

Out[43]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

Method 2 numpy broadcast

s1=df1[['id','nr']].values

s2=df2[['id','nr']].values

df1[~np.any(np.all(s1==s2[:,None],-1),0)]

Out[64]: 

   id  nr  val_a

2   1  93     24

4   2  92     34

5   2  93     35

6   3  91     44

8   3  93     42

My method timing

%timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]

100 loops, best of 3: 3.67 ms per loop

def m2():

    s1 = df1[['id', 'nr']].values

    s2 = df2[['id', 'nr']].values

    return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]

%timeit m2()

1000 loops, best of 3: 926 µs per loop

edited Nov 21 '18 at 15:29

answered Nov 21 '18 at 15:24

W-B

104k73165

edited Nov 21 '18 at 15:29

answered Nov 21 '18 at 15:24

W-B

104k73165

answered Nov 21 '18 at 15:24

W-B

104k73165

answered Nov 21 '18 at 15:24

W-B

104k73165

Amazingly fast as compared to my original code!

– UweD
Nov 22 '18 at 9:11

add a comment |

Amazingly fast as compared to my original code!

– UweD
Nov 22 '18 at 9:11

Amazingly fast as compared to my original code!

– UweD
Nov 22 '18 at 9:11

add a comment |

Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.

df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()

id_list = df3['id'].tolist()

df4 = df1[~df1['id'].isin(id_list)]

answered Nov 21 '18 at 21:35

Matías Romo

463

add a comment |

Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.

df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()

id_list = df3['id'].tolist()

df4 = df1[~df1['id'].isin(id_list)]

answered Nov 21 '18 at 21:35

Matías Romo

463

add a comment |

Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.

df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()

id_list = df3['id'].tolist()

df4 = df1[~df1['id'].isin(id_list)]

answered Nov 21 '18 at 21:35

Matías Romo

463

Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.

df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()

id_list = df3['id'].tolist()

df4 = df1[~df1['id'].isin(id_list)]

answered Nov 21 '18 at 21:35

Matías Romo

463

answered Nov 21 '18 at 21:35

Matías Romo

463

answered Nov 21 '18 at 21:35

Matías Romo

463

answered Nov 21 '18 at 21:35

Matías Romo

463

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Zbgx5Q,a,lRRRE2lEEn7JZ6VJz8QsotHRO

搜尋此網誌

Nsryjdtyk