Calling roundrobin on result of itertools' groupby

up vote
3
down vote

favorite

I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().

Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse. Reproducible example below.

I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.

...this seems to necessitate forming an intermediate list out of each group.

Sample data:

import itertools as it

import urllib.parse



bases = ('https://www.google.com', 'https://www.youtube.com',

         'https://docs.scipy.org', 'https://www.group.me')

urls = 

counts = (1, 5, 10, 15)

for c, b in zip(counts, bases):

    for i in range(c):

        urls.append(f'{b}/{i}')



pprint(urls)

# ['https://www.google.com/0',

#  'https://www.youtube.com/0',

#  'https://www.youtube.com/1',

#  'https://www.youtube.com/2',

#  'https://www.youtube.com/3',

#  'https://www.youtube.com/4',

#  'https://docs.scipy.org/0',

#  'https://docs.scipy.org/1',

#  'https://docs.scipy.org/2',

#  'https://docs.scipy.org/3',

#  'https://docs.scipy.org/4',

#  'https://docs.scipy.org/5',

#  'https://docs.scipy.org/6',

#  'https://docs.scipy.org/7',

#  'https://docs.scipy.org/8',

#  'https://docs.scipy.org/9',

#  'https://www.group.me/0',

#  'https://www.group.me/1',

#  'https://www.group.me/2',

#  'https://www.group.me/3',

#  'https://www.group.me/4',

#  'https://www.group.me/5',

#  'https://www.group.me/6',

#  'https://www.group.me/7',

#  'https://www.group.me/8',

#  'https://www.group.me/9',

#  'https://www.group.me/10',

#  'https://www.group.me/11',

#  'https://www.group.me/12',

#  'https://www.group.me/13',

#  'https://www.group.me/14']

Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):

grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)

shuffled = list(roundrobin(*(list(g) for _, g in grp)))

#                            ^^ Each group is otherwise lost because

#                               groupby() itself is an iterator

The expected output for the sample is as follows:

['https://docs.scipy.org/0',

 'https://www.google.com/0',

 'https://www.group.me/0',

 'https://www.youtube.com/0',

 'https://docs.scipy.org/1',

 'https://www.group.me/1',

 'https://www.youtube.com/1',

 'https://docs.scipy.org/2',

 'https://www.group.me/10',

 'https://www.youtube.com/2',

 'https://docs.scipy.org/3',

 'https://www.group.me/11',

 'https://www.youtube.com/3',

 'https://docs.scipy.org/4',

 'https://www.group.me/12',

 'https://www.youtube.com/4',

 'https://docs.scipy.org/5',

 'https://www.group.me/13',

 'https://docs.scipy.org/6',

 'https://www.group.me/14',

 'https://docs.scipy.org/7',

 'https://www.group.me/2',

 'https://docs.scipy.org/8',

 'https://www.group.me/3',

 'https://docs.scipy.org/9',

 'https://www.group.me/4',

 'https://www.group.me/5',

 'https://www.group.me/6',

 'https://www.group.me/7',

 'https://www.group.me/8',

 'https://www.group.me/9']

What is a more efficient way of going about this?

edited Nov 18 at 3:53

Idlehands

3,4601417

asked Nov 18 at 1:13

Brad Solomon

13k53176

add a comment |

up vote
3
down vote

favorite

I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().

I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.

...this seems to necessitate forming an intermediate list out of each group.

Sample data:

import itertools as it

import urllib.parse



bases = ('https://www.google.com', 'https://www.youtube.com',

         'https://docs.scipy.org', 'https://www.group.me')

urls = 

counts = (1, 5, 10, 15)

for c, b in zip(counts, bases):

    for i in range(c):

        urls.append(f'{b}/{i}')



pprint(urls)

# ['https://www.google.com/0',

#  'https://www.youtube.com/0',

#  'https://www.youtube.com/1',

#  'https://www.youtube.com/2',

#  'https://www.youtube.com/3',

#  'https://www.youtube.com/4',

#  'https://docs.scipy.org/0',

#  'https://docs.scipy.org/1',

#  'https://docs.scipy.org/2',

#  'https://docs.scipy.org/3',

#  'https://docs.scipy.org/4',

#  'https://docs.scipy.org/5',

#  'https://docs.scipy.org/6',

#  'https://docs.scipy.org/7',

#  'https://docs.scipy.org/8',

#  'https://docs.scipy.org/9',

#  'https://www.group.me/0',

#  'https://www.group.me/1',

#  'https://www.group.me/2',

#  'https://www.group.me/3',

#  'https://www.group.me/4',

#  'https://www.group.me/5',

#  'https://www.group.me/6',

#  'https://www.group.me/7',

#  'https://www.group.me/8',

#  'https://www.group.me/9',

#  'https://www.group.me/10',

#  'https://www.group.me/11',

#  'https://www.group.me/12',

#  'https://www.group.me/13',

#  'https://www.group.me/14']

Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):

grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)

shuffled = list(roundrobin(*(list(g) for _, g in grp)))

#                            ^^ Each group is otherwise lost because

#                               groupby() itself is an iterator

The expected output for the sample is as follows:

['https://docs.scipy.org/0',

 'https://www.google.com/0',

 'https://www.group.me/0',

 'https://www.youtube.com/0',

 'https://docs.scipy.org/1',

 'https://www.group.me/1',

 'https://www.youtube.com/1',

 'https://docs.scipy.org/2',

 'https://www.group.me/10',

 'https://www.youtube.com/2',

 'https://docs.scipy.org/3',

 'https://www.group.me/11',

 'https://www.youtube.com/3',

 'https://docs.scipy.org/4',

 'https://www.group.me/12',

 'https://www.youtube.com/4',

 'https://docs.scipy.org/5',

 'https://www.group.me/13',

 'https://docs.scipy.org/6',

 'https://www.group.me/14',

 'https://docs.scipy.org/7',

 'https://www.group.me/2',

 'https://docs.scipy.org/8',

 'https://www.group.me/3',

 'https://docs.scipy.org/9',

 'https://www.group.me/4',

 'https://www.group.me/5',

 'https://www.group.me/6',

 'https://www.group.me/7',

 'https://www.group.me/8',

 'https://www.group.me/9']

What is a more efficient way of going about this?

edited Nov 18 at 3:53

Idlehands

3,4601417

asked Nov 18 at 1:13

Brad Solomon

13k53176

add a comment |

up vote
3
down vote

favorite

I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().

I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.

...this seems to necessitate forming an intermediate list out of each group.

Sample data:

import itertools as it

import urllib.parse



bases = ('https://www.google.com', 'https://www.youtube.com',

         'https://docs.scipy.org', 'https://www.group.me')

urls = 

counts = (1, 5, 10, 15)

for c, b in zip(counts, bases):

    for i in range(c):

        urls.append(f'{b}/{i}')



pprint(urls)

# ['https://www.google.com/0',

#  'https://www.youtube.com/0',

#  'https://www.youtube.com/1',

#  'https://www.youtube.com/2',

#  'https://www.youtube.com/3',

#  'https://www.youtube.com/4',

#  'https://docs.scipy.org/0',

#  'https://docs.scipy.org/1',

#  'https://docs.scipy.org/2',

#  'https://docs.scipy.org/3',

#  'https://docs.scipy.org/4',

#  'https://docs.scipy.org/5',

#  'https://docs.scipy.org/6',

#  'https://docs.scipy.org/7',

#  'https://docs.scipy.org/8',

#  'https://docs.scipy.org/9',

#  'https://www.group.me/0',

#  'https://www.group.me/1',

#  'https://www.group.me/2',

#  'https://www.group.me/3',

#  'https://www.group.me/4',

#  'https://www.group.me/5',

#  'https://www.group.me/6',

#  'https://www.group.me/7',

#  'https://www.group.me/8',

#  'https://www.group.me/9',

#  'https://www.group.me/10',

#  'https://www.group.me/11',

#  'https://www.group.me/12',

#  'https://www.group.me/13',

#  'https://www.group.me/14']

Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):

grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)

shuffled = list(roundrobin(*(list(g) for _, g in grp)))

#                            ^^ Each group is otherwise lost because

#                               groupby() itself is an iterator

The expected output for the sample is as follows:

['https://docs.scipy.org/0',

 'https://www.google.com/0',

 'https://www.group.me/0',

 'https://www.youtube.com/0',

 'https://docs.scipy.org/1',

 'https://www.group.me/1',

 'https://www.youtube.com/1',

 'https://docs.scipy.org/2',

 'https://www.group.me/10',

 'https://www.youtube.com/2',

 'https://docs.scipy.org/3',

 'https://www.group.me/11',

 'https://www.youtube.com/3',

 'https://docs.scipy.org/4',

 'https://www.group.me/12',

 'https://www.youtube.com/4',

 'https://docs.scipy.org/5',

 'https://www.group.me/13',

 'https://docs.scipy.org/6',

 'https://www.group.me/14',

 'https://docs.scipy.org/7',

 'https://www.group.me/2',

 'https://docs.scipy.org/8',

 'https://www.group.me/3',

 'https://docs.scipy.org/9',

 'https://www.group.me/4',

 'https://www.group.me/5',

 'https://www.group.me/6',

 'https://www.group.me/7',

 'https://www.group.me/8',

 'https://www.group.me/9']

What is a more efficient way of going about this?

edited Nov 18 at 3:53

Idlehands

3,4601417

asked Nov 18 at 1:13

Brad Solomon

13k53176

I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().

I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.

...this seems to necessitate forming an intermediate list out of each group.

Sample data:

import itertools as it

import urllib.parse



bases = ('https://www.google.com', 'https://www.youtube.com',

         'https://docs.scipy.org', 'https://www.group.me')

urls = 

counts = (1, 5, 10, 15)

for c, b in zip(counts, bases):

    for i in range(c):

        urls.append(f'{b}/{i}')



pprint(urls)

# ['https://www.google.com/0',

#  'https://www.youtube.com/0',

#  'https://www.youtube.com/1',

#  'https://www.youtube.com/2',

#  'https://www.youtube.com/3',

#  'https://www.youtube.com/4',

#  'https://docs.scipy.org/0',

#  'https://docs.scipy.org/1',

#  'https://docs.scipy.org/2',

#  'https://docs.scipy.org/3',

#  'https://docs.scipy.org/4',

#  'https://docs.scipy.org/5',

#  'https://docs.scipy.org/6',

#  'https://docs.scipy.org/7',

#  'https://docs.scipy.org/8',

#  'https://docs.scipy.org/9',

#  'https://www.group.me/0',

#  'https://www.group.me/1',

#  'https://www.group.me/2',

#  'https://www.group.me/3',

#  'https://www.group.me/4',

#  'https://www.group.me/5',

#  'https://www.group.me/6',

#  'https://www.group.me/7',

#  'https://www.group.me/8',

#  'https://www.group.me/9',

#  'https://www.group.me/10',

#  'https://www.group.me/11',

#  'https://www.group.me/12',

#  'https://www.group.me/13',

#  'https://www.group.me/14']

Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):

grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)

shuffled = list(roundrobin(*(list(g) for _, g in grp)))

#                            ^^ Each group is otherwise lost because

#                               groupby() itself is an iterator

The expected output for the sample is as follows:

['https://docs.scipy.org/0',

 'https://www.google.com/0',

 'https://www.group.me/0',

 'https://www.youtube.com/0',

 'https://docs.scipy.org/1',

 'https://www.group.me/1',

 'https://www.youtube.com/1',

 'https://docs.scipy.org/2',

 'https://www.group.me/10',

 'https://www.youtube.com/2',

 'https://docs.scipy.org/3',

 'https://www.group.me/11',

 'https://www.youtube.com/3',

 'https://docs.scipy.org/4',

 'https://www.group.me/12',

 'https://www.youtube.com/4',

 'https://docs.scipy.org/5',

 'https://www.group.me/13',

 'https://docs.scipy.org/6',

 'https://www.group.me/14',

 'https://docs.scipy.org/7',

 'https://www.group.me/2',

 'https://docs.scipy.org/8',

 'https://www.group.me/3',

 'https://docs.scipy.org/9',

 'https://www.group.me/4',

 'https://www.group.me/5',

 'https://www.group.me/6',

 'https://www.group.me/7',

 'https://www.group.me/8',

 'https://www.group.me/9']

What is a more efficient way of going about this?

python itertools

edited Nov 18 at 3:53

Idlehands

3,4601417

asked Nov 18 at 1:13

Brad Solomon

13k53176

edited Nov 18 at 3:53

Idlehands

3,4601417

asked Nov 18 at 1:13

Brad Solomon

13k53176

edited Nov 18 at 3:53

Idlehands

3,4601417

edited Nov 18 at 3:53

Idlehands

3,4601417

edited Nov 18 at 3:53

Idlehands

3,4601417

asked Nov 18 at 1:13

Brad Solomon

13k53176

asked Nov 18 at 1:13

Brad Solomon

13k53176

asked Nov 18 at 1:13

Brad Solomon

13k53176

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:

shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)

# flattening the sublists and only returning the non-None values

The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):

# 3.7466756048055094 # zip_longest

# 4.077965201903506  # roundrobin

I feel like there's another solution that could use collections.Counter or use sort(key=...) on the sorted(list), but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.

answered Nov 18 at 6:13

Idlehands

3,4601417

It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53357049%2fcalling-roundrobin-on-result-of-itertools-groupby%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:

shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)

# flattening the sublists and only returning the non-None values

The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):

# 3.7466756048055094 # zip_longest

# 4.077965201903506  # roundrobin

answered Nov 18 at 6:13

Idlehands

3,4601417

It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33

add a comment |

up vote
1
down vote

Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:

shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)

# flattening the sublists and only returning the non-None values

The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):

# 3.7466756048055094 # zip_longest

# 4.077965201903506  # roundrobin

answered Nov 18 at 6:13

Idlehands

3,4601417

It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33

add a comment |

up vote
1
down vote

Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:

shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)

# flattening the sublists and only returning the non-None values

The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):

# 3.7466756048055094 # zip_longest

# 4.077965201903506  # roundrobin

answered Nov 18 at 6:13

Idlehands

3,4601417

Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:

shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)

# flattening the sublists and only returning the non-None values

The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):

# 3.7466756048055094 # zip_longest

# 4.077965201903506  # roundrobin

answered Nov 18 at 6:13

Idlehands

3,4601417

answered Nov 18 at 6:13

Idlehands

3,4601417

answered Nov 18 at 6:13

Idlehands

3,4601417

answered Nov 18 at 6:13

Idlehands

3,4601417

It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33

add a comment |

It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33

It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
– Brad Solomon
Nov 18 at 13:33

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk