Pandas Dataframe selecting groups with minimal cardinality
up vote
2
down vote
favorite
I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.
I am using the code below
train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()
This works. But the first line, it is very slow :(. 30 minutes or more.
Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.
Thanks in advance
Regards
Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....
python pandas dataframe data-processing
add a comment |
up vote
2
down vote
favorite
I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.
I am using the code below
train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()
This works. But the first line, it is very slow :(. 30 minutes or more.
Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.
Thanks in advance
Regards
Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....
python pandas dataframe data-processing
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.
I am using the code below
train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()
This works. But the first line, it is very slow :(. 30 minutes or more.
Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.
Thanks in advance
Regards
Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....
python pandas dataframe data-processing
I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.
I am using the code below
train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()
This works. But the first line, it is very slow :(. 30 minutes or more.
Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.
Thanks in advance
Regards
Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....
python pandas dataframe data-processing
python pandas dataframe data-processing
edited May 23 '17 at 10:28
Community♦
11
11
asked May 18 '14 at 6:28
Run2
9621226
9621226
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
Use groupby/filter
:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter
here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop
Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54
1
Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04
@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57
Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f23719203%2fpandas-dataframe-selecting-groups-with-minimal-cardinality%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
Use groupby/filter
:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter
here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop
Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54
1
Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04
@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57
Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35
add a comment |
up vote
4
down vote
accepted
Use groupby/filter
:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter
here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop
Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54
1
Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04
@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57
Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35
add a comment |
up vote
4
down vote
accepted
up vote
4
down vote
accepted
Use groupby/filter
:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter
here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop
Use groupby/filter
:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter
here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop
edited May 18 '14 at 7:07
answered May 18 '14 at 6:51
Karl D.
7,0212827
7,0212827
Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54
1
Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04
@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57
Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35
add a comment |
Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54
1
Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04
@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57
Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35
Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54
Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54
1
1
Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04
Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04
@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57
@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57
Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35
Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f23719203%2fpandas-dataframe-selecting-groups-with-minimal-cardinality%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown