Efficiently dropping rows from a DataFrame if some row-values are identical with row-values in a second...












3















import pandas as pd

df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})

df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
'nr': [91, 92, 91, 99, 92, 93, 92, 99],
'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})

def eliminate1 ():
for i1, row1 in df1.iterrows():
for i2, row2 in df2.iterrows():
if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
df1.drop(i1, inplace=True)
df1.reset_index(drop=True, inplace=True)
print(df1)

eliminate1()


I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.



Here are df1 and df2:



   id  nr  val_a
0 1 91 22
1 1 92 23
2 1 93 24
3 2 91 33
4 2 92 34
5 2 93 35
6 3 91 44
7 3 92 43
8 3 93 42

id nr val_a
0 1 91 72
1 1 92 27
2 2 91 74
3 3 99 83
4 4 92 84
5 4 93 85
6 3 92 84
7 5 99 83


And here the result as it should look like:



   id  nr  val_a
0 1 93 24
1 2 92 34
2 2 93 35
3 3 91 44
4 3 93 42


Does anyone know how to write a faster code and/or use an already existing function?










share|improve this question





























    3















    import pandas as pd

    df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
    'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
    'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})

    df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
    'nr': [91, 92, 91, 99, 92, 93, 92, 99],
    'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})

    def eliminate1 ():
    for i1, row1 in df1.iterrows():
    for i2, row2 in df2.iterrows():
    if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
    df1.drop(i1, inplace=True)
    df1.reset_index(drop=True, inplace=True)
    print(df1)

    eliminate1()


    I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.



    Here are df1 and df2:



       id  nr  val_a
    0 1 91 22
    1 1 92 23
    2 1 93 24
    3 2 91 33
    4 2 92 34
    5 2 93 35
    6 3 91 44
    7 3 92 43
    8 3 93 42

    id nr val_a
    0 1 91 72
    1 1 92 27
    2 2 91 74
    3 3 99 83
    4 4 92 84
    5 4 93 85
    6 3 92 84
    7 5 99 83


    And here the result as it should look like:



       id  nr  val_a
    0 1 93 24
    1 2 92 34
    2 2 93 35
    3 3 91 44
    4 3 93 42


    Does anyone know how to write a faster code and/or use an already existing function?










    share|improve this question



























      3












      3








      3








      import pandas as pd

      df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
      'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
      'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})

      df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
      'nr': [91, 92, 91, 99, 92, 93, 92, 99],
      'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})

      def eliminate1 ():
      for i1, row1 in df1.iterrows():
      for i2, row2 in df2.iterrows():
      if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
      df1.drop(i1, inplace=True)
      df1.reset_index(drop=True, inplace=True)
      print(df1)

      eliminate1()


      I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.



      Here are df1 and df2:



         id  nr  val_a
      0 1 91 22
      1 1 92 23
      2 1 93 24
      3 2 91 33
      4 2 92 34
      5 2 93 35
      6 3 91 44
      7 3 92 43
      8 3 93 42

      id nr val_a
      0 1 91 72
      1 1 92 27
      2 2 91 74
      3 3 99 83
      4 4 92 84
      5 4 93 85
      6 3 92 84
      7 5 99 83


      And here the result as it should look like:



         id  nr  val_a
      0 1 93 24
      1 2 92 34
      2 2 93 35
      3 3 91 44
      4 3 93 42


      Does anyone know how to write a faster code and/or use an already existing function?










      share|improve this question
















      import pandas as pd

      df1 = pd.DataFrame({'id': [ 1, 1, 1, 2, 2, 2, 3, 3, 3],
      'nr': [91, 92, 93, 91, 92, 93, 91, 92, 93],
      'val_a':[22, 23, 24, 33, 34, 35, 44, 43, 42]})

      df2 = pd.DataFrame({'id': [ 1, 1, 2, 3, 4, 4, 3, 5],
      'nr': [91, 92, 91, 99, 92, 93, 92, 99],
      'val_a':[72, 27, 74, 83, 84, 85, 84, 83]})

      def eliminate1 ():
      for i1, row1 in df1.iterrows():
      for i2, row2 in df2.iterrows():
      if row1['id'] == row2['id'] and row1['nr'] == row2['nr']:
      df1.drop(i1, inplace=True)
      df1.reset_index(drop=True, inplace=True)
      print(df1)

      eliminate1()


      I want to drop all rows from df1, where 'id' AND 'nr' have equal values in any row of df2. eliminate1() works well, see result below, but is very slow in case of large data sets.



      Here are df1 and df2:



         id  nr  val_a
      0 1 91 22
      1 1 92 23
      2 1 93 24
      3 2 91 33
      4 2 92 34
      5 2 93 35
      6 3 91 44
      7 3 92 43
      8 3 93 42

      id nr val_a
      0 1 91 72
      1 1 92 27
      2 2 91 74
      3 3 99 83
      4 4 92 84
      5 4 93 85
      6 3 92 84
      7 5 99 83


      And here the result as it should look like:



         id  nr  val_a
      0 1 93 24
      1 2 92 34
      2 2 93 35
      3 3 91 44
      4 3 93 42


      Does anyone know how to write a faster code and/or use an already existing function?







      python pandas performance dataframe






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 15:12









      jpp

      95.1k2157108




      95.1k2157108










      asked Nov 21 '18 at 15:03









      UweDUweD

      1495




      1495
























          3 Answers
          3






          active

          oldest

          votes


















          4














          merge



          You can merge with indicator=True and include only those rows marked 'left_only'.



          res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
          res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)

          print(res)

          id nr val_a
          2 1 93 24
          4 2 92 34
          5 2 93 35
          6 3 91 44
          8 3 93 42


          The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.






          share|improve this answer































            4














            Method 1 isin after zip the merge column into tuple



            df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
            Out[43]:
            id nr val_a
            2 1 93 24
            4 2 92 34
            5 2 93 35
            6 3 91 44
            8 3 93 42


            Method 2 numpy broadcast



            s1=df1[['id','nr']].values
            s2=df2[['id','nr']].values
            df1[~np.any(np.all(s1==s2[:,None],-1),0)]
            Out[64]:
            id nr val_a
            2 1 93 24
            4 2 92 34
            5 2 93 35
            6 3 91 44
            8 3 93 42


            My method timing



            %timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
            100 loops, best of 3: 3.67 ms per loop
            def m2():
            s1 = df1[['id', 'nr']].values
            s2 = df2[['id', 'nr']].values
            return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
            %timeit m2()
            1000 loops, best of 3: 926 µs per loop





            share|improve this answer


























            • Amazingly fast as compared to my original code!

              – UweD
              Nov 22 '18 at 9:11



















            1














            Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.



            df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
            id_list = df3['id'].tolist()
            df4 = df1[~df1['id'].isin(id_list)]





            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53414888%2fefficiently-dropping-rows-from-a-dataframe-if-some-row-values-are-identical-with%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              4














              merge



              You can merge with indicator=True and include only those rows marked 'left_only'.



              res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
              res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)

              print(res)

              id nr val_a
              2 1 93 24
              4 2 92 34
              5 2 93 35
              6 3 91 44
              8 3 93 42


              The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.






              share|improve this answer




























                4














                merge



                You can merge with indicator=True and include only those rows marked 'left_only'.



                res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
                res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)

                print(res)

                id nr val_a
                2 1 93 24
                4 2 92 34
                5 2 93 35
                6 3 91 44
                8 3 93 42


                The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.






                share|improve this answer


























                  4












                  4








                  4







                  merge



                  You can merge with indicator=True and include only those rows marked 'left_only'.



                  res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
                  res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)

                  print(res)

                  id nr val_a
                  2 1 93 24
                  4 2 92 34
                  5 2 93 35
                  6 3 91 44
                  8 3 93 42


                  The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.






                  share|improve this answer













                  merge



                  You can merge with indicator=True and include only those rows marked 'left_only'.



                  res = df1.merge(df2.drop('val_a', 1), how='left', on=['id', 'nr'], indicator=True)
                  res = res.loc[res['_merge'] == 'left_only'].drop('_merge', 1)

                  print(res)

                  id nr val_a
                  2 1 93 24
                  4 2 92 34
                  5 2 93 35
                  6 3 91 44
                  8 3 93 42


                  The solution is easily adaptable to any condition depending on 'left_only', 'right_only' or 'both'.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 21 '18 at 15:10









                  jppjpp

                  95.1k2157108




                  95.1k2157108

























                      4














                      Method 1 isin after zip the merge column into tuple



                      df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      Out[43]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      Method 2 numpy broadcast



                      s1=df1[['id','nr']].values
                      s2=df2[['id','nr']].values
                      df1[~np.any(np.all(s1==s2[:,None],-1),0)]
                      Out[64]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      My method timing



                      %timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      100 loops, best of 3: 3.67 ms per loop
                      def m2():
                      s1 = df1[['id', 'nr']].values
                      s2 = df2[['id', 'nr']].values
                      return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
                      %timeit m2()
                      1000 loops, best of 3: 926 µs per loop





                      share|improve this answer


























                      • Amazingly fast as compared to my original code!

                        – UweD
                        Nov 22 '18 at 9:11
















                      4














                      Method 1 isin after zip the merge column into tuple



                      df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      Out[43]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      Method 2 numpy broadcast



                      s1=df1[['id','nr']].values
                      s2=df2[['id','nr']].values
                      df1[~np.any(np.all(s1==s2[:,None],-1),0)]
                      Out[64]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      My method timing



                      %timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      100 loops, best of 3: 3.67 ms per loop
                      def m2():
                      s1 = df1[['id', 'nr']].values
                      s2 = df2[['id', 'nr']].values
                      return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
                      %timeit m2()
                      1000 loops, best of 3: 926 µs per loop





                      share|improve this answer


























                      • Amazingly fast as compared to my original code!

                        – UweD
                        Nov 22 '18 at 9:11














                      4












                      4








                      4







                      Method 1 isin after zip the merge column into tuple



                      df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      Out[43]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      Method 2 numpy broadcast



                      s1=df1[['id','nr']].values
                      s2=df2[['id','nr']].values
                      df1[~np.any(np.all(s1==s2[:,None],-1),0)]
                      Out[64]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      My method timing



                      %timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      100 loops, best of 3: 3.67 ms per loop
                      def m2():
                      s1 = df1[['id', 'nr']].values
                      s2 = df2[['id', 'nr']].values
                      return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
                      %timeit m2()
                      1000 loops, best of 3: 926 µs per loop





                      share|improve this answer















                      Method 1 isin after zip the merge column into tuple



                      df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      Out[43]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      Method 2 numpy broadcast



                      s1=df1[['id','nr']].values
                      s2=df2[['id','nr']].values
                      df1[~np.any(np.all(s1==s2[:,None],-1),0)]
                      Out[64]:
                      id nr val_a
                      2 1 93 24
                      4 2 92 34
                      5 2 93 35
                      6 3 91 44
                      8 3 93 42


                      My method timing



                      %timeit df1[~df1[['id','nr']].apply(tuple,1).isin(df2[['id','nr']].apply(tuple,1))]
                      100 loops, best of 3: 3.67 ms per loop
                      def m2():
                      s1 = df1[['id', 'nr']].values
                      s2 = df2[['id', 'nr']].values
                      return df1[~np.any(np.all(s1 == s2[:, None], -1), 0)]
                      %timeit m2()
                      1000 loops, best of 3: 926 µs per loop






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Nov 21 '18 at 15:29

























                      answered Nov 21 '18 at 15:24









                      W-BW-B

                      104k73165




                      104k73165













                      • Amazingly fast as compared to my original code!

                        – UweD
                        Nov 22 '18 at 9:11



















                      • Amazingly fast as compared to my original code!

                        – UweD
                        Nov 22 '18 at 9:11

















                      Amazingly fast as compared to my original code!

                      – UweD
                      Nov 22 '18 at 9:11





                      Amazingly fast as compared to my original code!

                      – UweD
                      Nov 22 '18 at 9:11











                      1














                      Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.



                      df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
                      id_list = df3['id'].tolist()
                      df4 = df1[~df1['id'].isin(id_list)]





                      share|improve this answer




























                        1














                        Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.



                        df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
                        id_list = df3['id'].tolist()
                        df4 = df1[~df1['id'].isin(id_list)]





                        share|improve this answer


























                          1












                          1








                          1







                          Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.



                          df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
                          id_list = df3['id'].tolist()
                          df4 = df1[~df1['id'].isin(id_list)]





                          share|improve this answer













                          Would an inner join solve your problem? Get the index of params that match the condition then filter it out. You'll just have to reset_index() afterwards if you wish to do so.



                          df3 = df1.merge(df2, how = 'inner', on = ['id','nr']).reset_index()
                          id_list = df3['id'].tolist()
                          df4 = df1[~df1['id'].isin(id_list)]






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 21 '18 at 21:35









                          Matías RomoMatías Romo

                          463




                          463






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53414888%2fefficiently-dropping-rows-from-a-dataframe-if-some-row-values-are-identical-with%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Costa Masnaga

                              Fotorealismo

                              Sidney Franklin