Pandas Dataframe selecting groups with minimal cardinality











up vote
2
down vote

favorite












I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.



I am using the code below



train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()


This works. But the first line, it is very slow :(. 30 minutes or more.



Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.



Thanks in advance
Regards



Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....










share|improve this question




























    up vote
    2
    down vote

    favorite












    I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.



    I am using the code below



    train = train[train.groupby('id').id.transform(len) > headRows]
    groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()


    This works. But the first line, it is very slow :(. 30 minutes or more.



    Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.



    Thanks in advance
    Regards



    Note:
    My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
    Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....










    share|improve this question


























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.



      I am using the code below



      train = train[train.groupby('id').id.transform(len) > headRows]
      groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()


      This works. But the first line, it is very slow :(. 30 minutes or more.



      Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.



      Thanks in advance
      Regards



      Note:
      My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
      Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....










      share|improve this question















      I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.



      I am using the code below



      train = train[train.groupby('id').id.transform(len) > headRows]
      groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()


      This works. But the first line, it is very slow :(. 30 minutes or more.



      Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.



      Thanks in advance
      Regards



      Note:
      My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
      Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....







      python pandas dataframe data-processing






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 23 '17 at 10:28









      Community

      11




      11










      asked May 18 '14 at 6:28









      Run2

      9621226




      9621226
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          4
          down vote



          accepted










          Use groupby/filter:



          >>> df.groupby('id').filter(lambda x: len(x) > cutoff)


          This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:



          In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
          1 loops, best of 3: 12.6 s per loop





          share|improve this answer























          • Hey thanks Karl. Let me evaluate that. I will get back. Thanks
            – Run2
            May 18 '14 at 6:54






          • 1




            Hey Karl, worked. 4 mins now. Thanks.
            – Run2
            May 18 '14 at 7:04










          • @Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
            – Jeff
            May 18 '14 at 10:57










          • Yeah, I can do that @Jeff.
            – Karl D.
            May 18 '14 at 18:35











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f23719203%2fpandas-dataframe-selecting-groups-with-minimal-cardinality%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          4
          down vote



          accepted










          Use groupby/filter:



          >>> df.groupby('id').filter(lambda x: len(x) > cutoff)


          This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:



          In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
          1 loops, best of 3: 12.6 s per loop





          share|improve this answer























          • Hey thanks Karl. Let me evaluate that. I will get back. Thanks
            – Run2
            May 18 '14 at 6:54






          • 1




            Hey Karl, worked. 4 mins now. Thanks.
            – Run2
            May 18 '14 at 7:04










          • @Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
            – Jeff
            May 18 '14 at 10:57










          • Yeah, I can do that @Jeff.
            – Karl D.
            May 18 '14 at 18:35















          up vote
          4
          down vote



          accepted










          Use groupby/filter:



          >>> df.groupby('id').filter(lambda x: len(x) > cutoff)


          This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:



          In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
          1 loops, best of 3: 12.6 s per loop





          share|improve this answer























          • Hey thanks Karl. Let me evaluate that. I will get back. Thanks
            – Run2
            May 18 '14 at 6:54






          • 1




            Hey Karl, worked. 4 mins now. Thanks.
            – Run2
            May 18 '14 at 7:04










          • @Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
            – Jeff
            May 18 '14 at 10:57










          • Yeah, I can do that @Jeff.
            – Karl D.
            May 18 '14 at 18:35













          up vote
          4
          down vote



          accepted







          up vote
          4
          down vote



          accepted






          Use groupby/filter:



          >>> df.groupby('id').filter(lambda x: len(x) > cutoff)


          This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:



          In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
          1 loops, best of 3: 12.6 s per loop





          share|improve this answer














          Use groupby/filter:



          >>> df.groupby('id').filter(lambda x: len(x) > cutoff)


          This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:



          In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
          1 loops, best of 3: 12.6 s per loop






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 18 '14 at 7:07

























          answered May 18 '14 at 6:51









          Karl D.

          7,0212827




          7,0212827












          • Hey thanks Karl. Let me evaluate that. I will get back. Thanks
            – Run2
            May 18 '14 at 6:54






          • 1




            Hey Karl, worked. 4 mins now. Thanks.
            – Run2
            May 18 '14 at 7:04










          • @Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
            – Jeff
            May 18 '14 at 10:57










          • Yeah, I can do that @Jeff.
            – Karl D.
            May 18 '14 at 18:35


















          • Hey thanks Karl. Let me evaluate that. I will get back. Thanks
            – Run2
            May 18 '14 at 6:54






          • 1




            Hey Karl, worked. 4 mins now. Thanks.
            – Run2
            May 18 '14 at 7:04










          • @Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
            – Jeff
            May 18 '14 at 10:57










          • Yeah, I can do that @Jeff.
            – Karl D.
            May 18 '14 at 18:35
















          Hey thanks Karl. Let me evaluate that. I will get back. Thanks
          – Run2
          May 18 '14 at 6:54




          Hey thanks Karl. Let me evaluate that. I will get back. Thanks
          – Run2
          May 18 '14 at 6:54




          1




          1




          Hey Karl, worked. 4 mins now. Thanks.
          – Run2
          May 18 '14 at 7:04




          Hey Karl, worked. 4 mins now. Thanks.
          – Run2
          May 18 '14 at 7:04












          @Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
          – Jeff
          May 18 '14 at 10:57




          @Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
          – Jeff
          May 18 '14 at 10:57












          Yeah, I can do that @Jeff.
          – Karl D.
          May 18 '14 at 18:35




          Yeah, I can do that @Jeff.
          – Karl D.
          May 18 '14 at 18:35


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f23719203%2fpandas-dataframe-selecting-groups-with-minimal-cardinality%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Costa Masnaga

          Fotorealismo

          Sidney Franklin