Calling roundrobin on result of itertools' groupby
up vote
3
down vote
favorite
I'm looking for a more efficient and Pythonic way of using itertools' roundrobin
recipe on the groups formed by itertools.groupby()
.
Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse
. Reproducible example below.
I'm currently using itertools.groupby()
plus its roundrobin recipe, but because of the nature of groupby()
,
The returned group is itself an iterator that shares the underlying iterable with
groupby()
. Because the source is shared, when thegroupby()
object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
...this seems to necessitate forming an intermediate list out of each group.
Sample data:
import itertools as it
import urllib.parse
bases = ('https://www.google.com', 'https://www.youtube.com',
'https://docs.scipy.org', 'https://www.group.me')
urls =
counts = (1, 5, 10, 15)
for c, b in zip(counts, bases):
for i in range(c):
urls.append(f'{b}/{i}')
pprint(urls)
# ['https://www.google.com/0',
# 'https://www.youtube.com/0',
# 'https://www.youtube.com/1',
# 'https://www.youtube.com/2',
# 'https://www.youtube.com/3',
# 'https://www.youtube.com/4',
# 'https://docs.scipy.org/0',
# 'https://docs.scipy.org/1',
# 'https://docs.scipy.org/2',
# 'https://docs.scipy.org/3',
# 'https://docs.scipy.org/4',
# 'https://docs.scipy.org/5',
# 'https://docs.scipy.org/6',
# 'https://docs.scipy.org/7',
# 'https://docs.scipy.org/8',
# 'https://docs.scipy.org/9',
# 'https://www.group.me/0',
# 'https://www.group.me/1',
# 'https://www.group.me/2',
# 'https://www.group.me/3',
# 'https://www.group.me/4',
# 'https://www.group.me/5',
# 'https://www.group.me/6',
# 'https://www.group.me/7',
# 'https://www.group.me/8',
# 'https://www.group.me/9',
# 'https://www.group.me/10',
# 'https://www.group.me/11',
# 'https://www.group.me/12',
# 'https://www.group.me/13',
# 'https://www.group.me/14']
Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration
):
grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
shuffled = list(roundrobin(*(list(g) for _, g in grp)))
# ^^ Each group is otherwise lost because
# groupby() itself is an iterator
The expected output for the sample is as follows:
['https://docs.scipy.org/0',
'https://www.google.com/0',
'https://www.group.me/0',
'https://www.youtube.com/0',
'https://docs.scipy.org/1',
'https://www.group.me/1',
'https://www.youtube.com/1',
'https://docs.scipy.org/2',
'https://www.group.me/10',
'https://www.youtube.com/2',
'https://docs.scipy.org/3',
'https://www.group.me/11',
'https://www.youtube.com/3',
'https://docs.scipy.org/4',
'https://www.group.me/12',
'https://www.youtube.com/4',
'https://docs.scipy.org/5',
'https://www.group.me/13',
'https://docs.scipy.org/6',
'https://www.group.me/14',
'https://docs.scipy.org/7',
'https://www.group.me/2',
'https://docs.scipy.org/8',
'https://www.group.me/3',
'https://docs.scipy.org/9',
'https://www.group.me/4',
'https://www.group.me/5',
'https://www.group.me/6',
'https://www.group.me/7',
'https://www.group.me/8',
'https://www.group.me/9']
What is a more efficient way of going about this?
python itertools
add a comment |
up vote
3
down vote
favorite
I'm looking for a more efficient and Pythonic way of using itertools' roundrobin
recipe on the groups formed by itertools.groupby()
.
Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse
. Reproducible example below.
I'm currently using itertools.groupby()
plus its roundrobin recipe, but because of the nature of groupby()
,
The returned group is itself an iterator that shares the underlying iterable with
groupby()
. Because the source is shared, when thegroupby()
object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
...this seems to necessitate forming an intermediate list out of each group.
Sample data:
import itertools as it
import urllib.parse
bases = ('https://www.google.com', 'https://www.youtube.com',
'https://docs.scipy.org', 'https://www.group.me')
urls =
counts = (1, 5, 10, 15)
for c, b in zip(counts, bases):
for i in range(c):
urls.append(f'{b}/{i}')
pprint(urls)
# ['https://www.google.com/0',
# 'https://www.youtube.com/0',
# 'https://www.youtube.com/1',
# 'https://www.youtube.com/2',
# 'https://www.youtube.com/3',
# 'https://www.youtube.com/4',
# 'https://docs.scipy.org/0',
# 'https://docs.scipy.org/1',
# 'https://docs.scipy.org/2',
# 'https://docs.scipy.org/3',
# 'https://docs.scipy.org/4',
# 'https://docs.scipy.org/5',
# 'https://docs.scipy.org/6',
# 'https://docs.scipy.org/7',
# 'https://docs.scipy.org/8',
# 'https://docs.scipy.org/9',
# 'https://www.group.me/0',
# 'https://www.group.me/1',
# 'https://www.group.me/2',
# 'https://www.group.me/3',
# 'https://www.group.me/4',
# 'https://www.group.me/5',
# 'https://www.group.me/6',
# 'https://www.group.me/7',
# 'https://www.group.me/8',
# 'https://www.group.me/9',
# 'https://www.group.me/10',
# 'https://www.group.me/11',
# 'https://www.group.me/12',
# 'https://www.group.me/13',
# 'https://www.group.me/14']
Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration
):
grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
shuffled = list(roundrobin(*(list(g) for _, g in grp)))
# ^^ Each group is otherwise lost because
# groupby() itself is an iterator
The expected output for the sample is as follows:
['https://docs.scipy.org/0',
'https://www.google.com/0',
'https://www.group.me/0',
'https://www.youtube.com/0',
'https://docs.scipy.org/1',
'https://www.group.me/1',
'https://www.youtube.com/1',
'https://docs.scipy.org/2',
'https://www.group.me/10',
'https://www.youtube.com/2',
'https://docs.scipy.org/3',
'https://www.group.me/11',
'https://www.youtube.com/3',
'https://docs.scipy.org/4',
'https://www.group.me/12',
'https://www.youtube.com/4',
'https://docs.scipy.org/5',
'https://www.group.me/13',
'https://docs.scipy.org/6',
'https://www.group.me/14',
'https://docs.scipy.org/7',
'https://www.group.me/2',
'https://docs.scipy.org/8',
'https://www.group.me/3',
'https://docs.scipy.org/9',
'https://www.group.me/4',
'https://www.group.me/5',
'https://www.group.me/6',
'https://www.group.me/7',
'https://www.group.me/8',
'https://www.group.me/9']
What is a more efficient way of going about this?
python itertools
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I'm looking for a more efficient and Pythonic way of using itertools' roundrobin
recipe on the groups formed by itertools.groupby()
.
Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse
. Reproducible example below.
I'm currently using itertools.groupby()
plus its roundrobin recipe, but because of the nature of groupby()
,
The returned group is itself an iterator that shares the underlying iterable with
groupby()
. Because the source is shared, when thegroupby()
object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
...this seems to necessitate forming an intermediate list out of each group.
Sample data:
import itertools as it
import urllib.parse
bases = ('https://www.google.com', 'https://www.youtube.com',
'https://docs.scipy.org', 'https://www.group.me')
urls =
counts = (1, 5, 10, 15)
for c, b in zip(counts, bases):
for i in range(c):
urls.append(f'{b}/{i}')
pprint(urls)
# ['https://www.google.com/0',
# 'https://www.youtube.com/0',
# 'https://www.youtube.com/1',
# 'https://www.youtube.com/2',
# 'https://www.youtube.com/3',
# 'https://www.youtube.com/4',
# 'https://docs.scipy.org/0',
# 'https://docs.scipy.org/1',
# 'https://docs.scipy.org/2',
# 'https://docs.scipy.org/3',
# 'https://docs.scipy.org/4',
# 'https://docs.scipy.org/5',
# 'https://docs.scipy.org/6',
# 'https://docs.scipy.org/7',
# 'https://docs.scipy.org/8',
# 'https://docs.scipy.org/9',
# 'https://www.group.me/0',
# 'https://www.group.me/1',
# 'https://www.group.me/2',
# 'https://www.group.me/3',
# 'https://www.group.me/4',
# 'https://www.group.me/5',
# 'https://www.group.me/6',
# 'https://www.group.me/7',
# 'https://www.group.me/8',
# 'https://www.group.me/9',
# 'https://www.group.me/10',
# 'https://www.group.me/11',
# 'https://www.group.me/12',
# 'https://www.group.me/13',
# 'https://www.group.me/14']
Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration
):
grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
shuffled = list(roundrobin(*(list(g) for _, g in grp)))
# ^^ Each group is otherwise lost because
# groupby() itself is an iterator
The expected output for the sample is as follows:
['https://docs.scipy.org/0',
'https://www.google.com/0',
'https://www.group.me/0',
'https://www.youtube.com/0',
'https://docs.scipy.org/1',
'https://www.group.me/1',
'https://www.youtube.com/1',
'https://docs.scipy.org/2',
'https://www.group.me/10',
'https://www.youtube.com/2',
'https://docs.scipy.org/3',
'https://www.group.me/11',
'https://www.youtube.com/3',
'https://docs.scipy.org/4',
'https://www.group.me/12',
'https://www.youtube.com/4',
'https://docs.scipy.org/5',
'https://www.group.me/13',
'https://docs.scipy.org/6',
'https://www.group.me/14',
'https://docs.scipy.org/7',
'https://www.group.me/2',
'https://docs.scipy.org/8',
'https://www.group.me/3',
'https://docs.scipy.org/9',
'https://www.group.me/4',
'https://www.group.me/5',
'https://www.group.me/6',
'https://www.group.me/7',
'https://www.group.me/8',
'https://www.group.me/9']
What is a more efficient way of going about this?
python itertools
I'm looking for a more efficient and Pythonic way of using itertools' roundrobin
recipe on the groups formed by itertools.groupby()
.
Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse
. Reproducible example below.
I'm currently using itertools.groupby()
plus its roundrobin recipe, but because of the nature of groupby()
,
The returned group is itself an iterator that shares the underlying iterable with
groupby()
. Because the source is shared, when thegroupby()
object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
...this seems to necessitate forming an intermediate list out of each group.
Sample data:
import itertools as it
import urllib.parse
bases = ('https://www.google.com', 'https://www.youtube.com',
'https://docs.scipy.org', 'https://www.group.me')
urls =
counts = (1, 5, 10, 15)
for c, b in zip(counts, bases):
for i in range(c):
urls.append(f'{b}/{i}')
pprint(urls)
# ['https://www.google.com/0',
# 'https://www.youtube.com/0',
# 'https://www.youtube.com/1',
# 'https://www.youtube.com/2',
# 'https://www.youtube.com/3',
# 'https://www.youtube.com/4',
# 'https://docs.scipy.org/0',
# 'https://docs.scipy.org/1',
# 'https://docs.scipy.org/2',
# 'https://docs.scipy.org/3',
# 'https://docs.scipy.org/4',
# 'https://docs.scipy.org/5',
# 'https://docs.scipy.org/6',
# 'https://docs.scipy.org/7',
# 'https://docs.scipy.org/8',
# 'https://docs.scipy.org/9',
# 'https://www.group.me/0',
# 'https://www.group.me/1',
# 'https://www.group.me/2',
# 'https://www.group.me/3',
# 'https://www.group.me/4',
# 'https://www.group.me/5',
# 'https://www.group.me/6',
# 'https://www.group.me/7',
# 'https://www.group.me/8',
# 'https://www.group.me/9',
# 'https://www.group.me/10',
# 'https://www.group.me/11',
# 'https://www.group.me/12',
# 'https://www.group.me/13',
# 'https://www.group.me/14']
Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration
):
grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
shuffled = list(roundrobin(*(list(g) for _, g in grp)))
# ^^ Each group is otherwise lost because
# groupby() itself is an iterator
The expected output for the sample is as follows:
['https://docs.scipy.org/0',
'https://www.google.com/0',
'https://www.group.me/0',
'https://www.youtube.com/0',
'https://docs.scipy.org/1',
'https://www.group.me/1',
'https://www.youtube.com/1',
'https://docs.scipy.org/2',
'https://www.group.me/10',
'https://www.youtube.com/2',
'https://docs.scipy.org/3',
'https://www.group.me/11',
'https://www.youtube.com/3',
'https://docs.scipy.org/4',
'https://www.group.me/12',
'https://www.youtube.com/4',
'https://docs.scipy.org/5',
'https://www.group.me/13',
'https://docs.scipy.org/6',
'https://www.group.me/14',
'https://docs.scipy.org/7',
'https://www.group.me/2',
'https://docs.scipy.org/8',
'https://www.group.me/3',
'https://docs.scipy.org/9',
'https://www.group.me/4',
'https://www.group.me/5',
'https://www.group.me/6',
'https://www.group.me/7',
'https://www.group.me/8',
'https://www.group.me/9']
What is a more efficient way of going about this?
python itertools
python itertools
edited Nov 18 at 3:53
Idlehands
3,4601417
3,4601417
asked Nov 18 at 1:13
Brad Solomon
13k53176
13k53176
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
Not a huge improvement, but you could use itertools.zip_longest
to achieve the same effect with a little tweak:
shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
# flattening the sublists and only returning the non-None values
The benefit is you don't have to define the roundrobin
recipe. The time saving is negligible however (timed for n=10000
):
# 3.7466756048055094 # zip_longest
# 4.077965201903506 # roundrobin
I feel like there's another solution that could use collections.Counter
or use sort(key=...)
on the sorted(list)
, but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.
It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Not a huge improvement, but you could use itertools.zip_longest
to achieve the same effect with a little tweak:
shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
# flattening the sublists and only returning the non-None values
The benefit is you don't have to define the roundrobin
recipe. The time saving is negligible however (timed for n=10000
):
# 3.7466756048055094 # zip_longest
# 4.077965201903506 # roundrobin
I feel like there's another solution that could use collections.Counter
or use sort(key=...)
on the sorted(list)
, but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.
It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33
add a comment |
up vote
1
down vote
Not a huge improvement, but you could use itertools.zip_longest
to achieve the same effect with a little tweak:
shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
# flattening the sublists and only returning the non-None values
The benefit is you don't have to define the roundrobin
recipe. The time saving is negligible however (timed for n=10000
):
# 3.7466756048055094 # zip_longest
# 4.077965201903506 # roundrobin
I feel like there's another solution that could use collections.Counter
or use sort(key=...)
on the sorted(list)
, but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.
It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33
add a comment |
up vote
1
down vote
up vote
1
down vote
Not a huge improvement, but you could use itertools.zip_longest
to achieve the same effect with a little tweak:
shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
# flattening the sublists and only returning the non-None values
The benefit is you don't have to define the roundrobin
recipe. The time saving is negligible however (timed for n=10000
):
# 3.7466756048055094 # zip_longest
# 4.077965201903506 # roundrobin
I feel like there's another solution that could use collections.Counter
or use sort(key=...)
on the sorted(list)
, but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.
Not a huge improvement, but you could use itertools.zip_longest
to achieve the same effect with a little tweak:
shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
# flattening the sublists and only returning the non-None values
The benefit is you don't have to define the roundrobin
recipe. The time saving is negligible however (timed for n=10000
):
# 3.7466756048055094 # zip_longest
# 4.077965201903506 # roundrobin
I feel like there's another solution that could use collections.Counter
or use sort(key=...)
on the sorted(list)
, but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.
answered Nov 18 at 6:13
Idlehands
3,4601417
3,4601417
It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33
add a comment |
It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33
It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33
It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53357049%2fcalling-roundrobin-on-result-of-itertools-groupby%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown