Calling roundrobin on result of itertools' groupby











up vote
3
down vote

favorite
1












I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().



Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse. Reproducible example below.



I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),




The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.




...this seems to necessitate forming an intermediate list out of each group.



Sample data:



import itertools as it
import urllib.parse

bases = ('https://www.google.com', 'https://www.youtube.com',
'https://docs.scipy.org', 'https://www.group.me')
urls =
counts = (1, 5, 10, 15)
for c, b in zip(counts, bases):
for i in range(c):
urls.append(f'{b}/{i}')

pprint(urls)
# ['https://www.google.com/0',
# 'https://www.youtube.com/0',
# 'https://www.youtube.com/1',
# 'https://www.youtube.com/2',
# 'https://www.youtube.com/3',
# 'https://www.youtube.com/4',
# 'https://docs.scipy.org/0',
# 'https://docs.scipy.org/1',
# 'https://docs.scipy.org/2',
# 'https://docs.scipy.org/3',
# 'https://docs.scipy.org/4',
# 'https://docs.scipy.org/5',
# 'https://docs.scipy.org/6',
# 'https://docs.scipy.org/7',
# 'https://docs.scipy.org/8',
# 'https://docs.scipy.org/9',
# 'https://www.group.me/0',
# 'https://www.group.me/1',
# 'https://www.group.me/2',
# 'https://www.group.me/3',
# 'https://www.group.me/4',
# 'https://www.group.me/5',
# 'https://www.group.me/6',
# 'https://www.group.me/7',
# 'https://www.group.me/8',
# 'https://www.group.me/9',
# 'https://www.group.me/10',
# 'https://www.group.me/11',
# 'https://www.group.me/12',
# 'https://www.group.me/13',
# 'https://www.group.me/14']


Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):



grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
shuffled = list(roundrobin(*(list(g) for _, g in grp)))
# ^^ Each group is otherwise lost because
# groupby() itself is an iterator


The expected output for the sample is as follows:



['https://docs.scipy.org/0',
'https://www.google.com/0',
'https://www.group.me/0',
'https://www.youtube.com/0',
'https://docs.scipy.org/1',
'https://www.group.me/1',
'https://www.youtube.com/1',
'https://docs.scipy.org/2',
'https://www.group.me/10',
'https://www.youtube.com/2',
'https://docs.scipy.org/3',
'https://www.group.me/11',
'https://www.youtube.com/3',
'https://docs.scipy.org/4',
'https://www.group.me/12',
'https://www.youtube.com/4',
'https://docs.scipy.org/5',
'https://www.group.me/13',
'https://docs.scipy.org/6',
'https://www.group.me/14',
'https://docs.scipy.org/7',
'https://www.group.me/2',
'https://docs.scipy.org/8',
'https://www.group.me/3',
'https://docs.scipy.org/9',
'https://www.group.me/4',
'https://www.group.me/5',
'https://www.group.me/6',
'https://www.group.me/7',
'https://www.group.me/8',
'https://www.group.me/9']


What is a more efficient way of going about this?










share|improve this question




























    up vote
    3
    down vote

    favorite
    1












    I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().



    Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse. Reproducible example below.



    I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),




    The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.




    ...this seems to necessitate forming an intermediate list out of each group.



    Sample data:



    import itertools as it
    import urllib.parse

    bases = ('https://www.google.com', 'https://www.youtube.com',
    'https://docs.scipy.org', 'https://www.group.me')
    urls =
    counts = (1, 5, 10, 15)
    for c, b in zip(counts, bases):
    for i in range(c):
    urls.append(f'{b}/{i}')

    pprint(urls)
    # ['https://www.google.com/0',
    # 'https://www.youtube.com/0',
    # 'https://www.youtube.com/1',
    # 'https://www.youtube.com/2',
    # 'https://www.youtube.com/3',
    # 'https://www.youtube.com/4',
    # 'https://docs.scipy.org/0',
    # 'https://docs.scipy.org/1',
    # 'https://docs.scipy.org/2',
    # 'https://docs.scipy.org/3',
    # 'https://docs.scipy.org/4',
    # 'https://docs.scipy.org/5',
    # 'https://docs.scipy.org/6',
    # 'https://docs.scipy.org/7',
    # 'https://docs.scipy.org/8',
    # 'https://docs.scipy.org/9',
    # 'https://www.group.me/0',
    # 'https://www.group.me/1',
    # 'https://www.group.me/2',
    # 'https://www.group.me/3',
    # 'https://www.group.me/4',
    # 'https://www.group.me/5',
    # 'https://www.group.me/6',
    # 'https://www.group.me/7',
    # 'https://www.group.me/8',
    # 'https://www.group.me/9',
    # 'https://www.group.me/10',
    # 'https://www.group.me/11',
    # 'https://www.group.me/12',
    # 'https://www.group.me/13',
    # 'https://www.group.me/14']


    Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):



    grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
    shuffled = list(roundrobin(*(list(g) for _, g in grp)))
    # ^^ Each group is otherwise lost because
    # groupby() itself is an iterator


    The expected output for the sample is as follows:



    ['https://docs.scipy.org/0',
    'https://www.google.com/0',
    'https://www.group.me/0',
    'https://www.youtube.com/0',
    'https://docs.scipy.org/1',
    'https://www.group.me/1',
    'https://www.youtube.com/1',
    'https://docs.scipy.org/2',
    'https://www.group.me/10',
    'https://www.youtube.com/2',
    'https://docs.scipy.org/3',
    'https://www.group.me/11',
    'https://www.youtube.com/3',
    'https://docs.scipy.org/4',
    'https://www.group.me/12',
    'https://www.youtube.com/4',
    'https://docs.scipy.org/5',
    'https://www.group.me/13',
    'https://docs.scipy.org/6',
    'https://www.group.me/14',
    'https://docs.scipy.org/7',
    'https://www.group.me/2',
    'https://docs.scipy.org/8',
    'https://www.group.me/3',
    'https://docs.scipy.org/9',
    'https://www.group.me/4',
    'https://www.group.me/5',
    'https://www.group.me/6',
    'https://www.group.me/7',
    'https://www.group.me/8',
    'https://www.group.me/9']


    What is a more efficient way of going about this?










    share|improve this question


























      up vote
      3
      down vote

      favorite
      1









      up vote
      3
      down vote

      favorite
      1






      1





      I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().



      Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse. Reproducible example below.



      I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),




      The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.




      ...this seems to necessitate forming an intermediate list out of each group.



      Sample data:



      import itertools as it
      import urllib.parse

      bases = ('https://www.google.com', 'https://www.youtube.com',
      'https://docs.scipy.org', 'https://www.group.me')
      urls =
      counts = (1, 5, 10, 15)
      for c, b in zip(counts, bases):
      for i in range(c):
      urls.append(f'{b}/{i}')

      pprint(urls)
      # ['https://www.google.com/0',
      # 'https://www.youtube.com/0',
      # 'https://www.youtube.com/1',
      # 'https://www.youtube.com/2',
      # 'https://www.youtube.com/3',
      # 'https://www.youtube.com/4',
      # 'https://docs.scipy.org/0',
      # 'https://docs.scipy.org/1',
      # 'https://docs.scipy.org/2',
      # 'https://docs.scipy.org/3',
      # 'https://docs.scipy.org/4',
      # 'https://docs.scipy.org/5',
      # 'https://docs.scipy.org/6',
      # 'https://docs.scipy.org/7',
      # 'https://docs.scipy.org/8',
      # 'https://docs.scipy.org/9',
      # 'https://www.group.me/0',
      # 'https://www.group.me/1',
      # 'https://www.group.me/2',
      # 'https://www.group.me/3',
      # 'https://www.group.me/4',
      # 'https://www.group.me/5',
      # 'https://www.group.me/6',
      # 'https://www.group.me/7',
      # 'https://www.group.me/8',
      # 'https://www.group.me/9',
      # 'https://www.group.me/10',
      # 'https://www.group.me/11',
      # 'https://www.group.me/12',
      # 'https://www.group.me/13',
      # 'https://www.group.me/14']


      Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):



      grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
      shuffled = list(roundrobin(*(list(g) for _, g in grp)))
      # ^^ Each group is otherwise lost because
      # groupby() itself is an iterator


      The expected output for the sample is as follows:



      ['https://docs.scipy.org/0',
      'https://www.google.com/0',
      'https://www.group.me/0',
      'https://www.youtube.com/0',
      'https://docs.scipy.org/1',
      'https://www.group.me/1',
      'https://www.youtube.com/1',
      'https://docs.scipy.org/2',
      'https://www.group.me/10',
      'https://www.youtube.com/2',
      'https://docs.scipy.org/3',
      'https://www.group.me/11',
      'https://www.youtube.com/3',
      'https://docs.scipy.org/4',
      'https://www.group.me/12',
      'https://www.youtube.com/4',
      'https://docs.scipy.org/5',
      'https://www.group.me/13',
      'https://docs.scipy.org/6',
      'https://www.group.me/14',
      'https://docs.scipy.org/7',
      'https://www.group.me/2',
      'https://docs.scipy.org/8',
      'https://www.group.me/3',
      'https://docs.scipy.org/9',
      'https://www.group.me/4',
      'https://www.group.me/5',
      'https://www.group.me/6',
      'https://www.group.me/7',
      'https://www.group.me/8',
      'https://www.group.me/9']


      What is a more efficient way of going about this?










      share|improve this question















      I'm looking for a more efficient and Pythonic way of using itertools' roundrobin recipe on the groups formed by itertools.groupby().



      Specifically, I have a list of URLs (not sorted), and want to re-order them so that the ordering of their result places the maximum "distance" (or diversification, maybe) between each unique netloc (host), as defined by the attribute from urllib.parse. Reproducible example below.



      I'm currently using itertools.groupby() plus its roundrobin recipe, but because of the nature of groupby(),




      The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.




      ...this seems to necessitate forming an intermediate list out of each group.



      Sample data:



      import itertools as it
      import urllib.parse

      bases = ('https://www.google.com', 'https://www.youtube.com',
      'https://docs.scipy.org', 'https://www.group.me')
      urls =
      counts = (1, 5, 10, 15)
      for c, b in zip(counts, bases):
      for i in range(c):
      urls.append(f'{b}/{i}')

      pprint(urls)
      # ['https://www.google.com/0',
      # 'https://www.youtube.com/0',
      # 'https://www.youtube.com/1',
      # 'https://www.youtube.com/2',
      # 'https://www.youtube.com/3',
      # 'https://www.youtube.com/4',
      # 'https://docs.scipy.org/0',
      # 'https://docs.scipy.org/1',
      # 'https://docs.scipy.org/2',
      # 'https://docs.scipy.org/3',
      # 'https://docs.scipy.org/4',
      # 'https://docs.scipy.org/5',
      # 'https://docs.scipy.org/6',
      # 'https://docs.scipy.org/7',
      # 'https://docs.scipy.org/8',
      # 'https://docs.scipy.org/9',
      # 'https://www.group.me/0',
      # 'https://www.group.me/1',
      # 'https://www.group.me/2',
      # 'https://www.group.me/3',
      # 'https://www.group.me/4',
      # 'https://www.group.me/5',
      # 'https://www.group.me/6',
      # 'https://www.group.me/7',
      # 'https://www.group.me/8',
      # 'https://www.group.me/9',
      # 'https://www.group.me/10',
      # 'https://www.group.me/11',
      # 'https://www.group.me/12',
      # 'https://www.group.me/13',
      # 'https://www.group.me/14']


      Current solution (take 1 from each group, or skip the group if it is empty, until all groups raise StopIteration):



      grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
      shuffled = list(roundrobin(*(list(g) for _, g in grp)))
      # ^^ Each group is otherwise lost because
      # groupby() itself is an iterator


      The expected output for the sample is as follows:



      ['https://docs.scipy.org/0',
      'https://www.google.com/0',
      'https://www.group.me/0',
      'https://www.youtube.com/0',
      'https://docs.scipy.org/1',
      'https://www.group.me/1',
      'https://www.youtube.com/1',
      'https://docs.scipy.org/2',
      'https://www.group.me/10',
      'https://www.youtube.com/2',
      'https://docs.scipy.org/3',
      'https://www.group.me/11',
      'https://www.youtube.com/3',
      'https://docs.scipy.org/4',
      'https://www.group.me/12',
      'https://www.youtube.com/4',
      'https://docs.scipy.org/5',
      'https://www.group.me/13',
      'https://docs.scipy.org/6',
      'https://www.group.me/14',
      'https://docs.scipy.org/7',
      'https://www.group.me/2',
      'https://docs.scipy.org/8',
      'https://www.group.me/3',
      'https://docs.scipy.org/9',
      'https://www.group.me/4',
      'https://www.group.me/5',
      'https://www.group.me/6',
      'https://www.group.me/7',
      'https://www.group.me/8',
      'https://www.group.me/9']


      What is a more efficient way of going about this?







      python itertools






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 18 at 3:53









      Idlehands

      3,4601417




      3,4601417










      asked Nov 18 at 1:13









      Brad Solomon

      13k53176




      13k53176
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote













          Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:



          shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
          # flattening the sublists and only returning the non-None values


          The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):



          # 3.7466756048055094 # zip_longest
          # 4.077965201903506 # roundrobin


          I feel like there's another solution that could use collections.Counter or use sort(key=...) on the sorted(list), but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.






          share|improve this answer





















          • It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
            – Brad Solomon
            Nov 18 at 13:33











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53357049%2fcalling-roundrobin-on-result-of-itertools-groupby%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote













          Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:



          shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
          # flattening the sublists and only returning the non-None values


          The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):



          # 3.7466756048055094 # zip_longest
          # 4.077965201903506 # roundrobin


          I feel like there's another solution that could use collections.Counter or use sort(key=...) on the sorted(list), but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.






          share|improve this answer





















          • It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
            – Brad Solomon
            Nov 18 at 13:33















          up vote
          1
          down vote













          Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:



          shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
          # flattening the sublists and only returning the non-None values


          The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):



          # 3.7466756048055094 # zip_longest
          # 4.077965201903506 # roundrobin


          I feel like there's another solution that could use collections.Counter or use sort(key=...) on the sorted(list), but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.






          share|improve this answer





















          • It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
            – Brad Solomon
            Nov 18 at 13:33













          up vote
          1
          down vote










          up vote
          1
          down vote









          Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:



          shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
          # flattening the sublists and only returning the non-None values


          The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):



          # 3.7466756048055094 # zip_longest
          # 4.077965201903506 # roundrobin


          I feel like there's another solution that could use collections.Counter or use sort(key=...) on the sorted(list), but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.






          share|improve this answer












          Not a huge improvement, but you could use itertools.zip_longest to achieve the same effect with a little tweak:



          shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
          # flattening the sublists and only returning the non-None values


          The benefit is you don't have to define the roundrobin recipe. The time saving is negligible however (timed for n=10000):



          # 3.7466756048055094 # zip_longest
          # 4.077965201903506 # roundrobin


          I feel like there's another solution that could use collections.Counter or use sort(key=...) on the sorted(list), but I haven't cracked that case yet, feels like the time complexity might be more severe than your implementation since it might rely on more python code than compiled modules. This is an interesting problem though, will probably revisit this later.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 18 at 6:13









          Idlehands

          3,4601417




          3,4601417












          • It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
            – Brad Solomon
            Nov 18 at 13:33


















          • It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
            – Brad Solomon
            Nov 18 at 13:33
















          It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
          – Brad Solomon
          Nov 18 at 13:33




          It’s a tough one from time complexity perspective, seems difficult to do in less than N^2. Likewise I wonder if there’s some magic formula given the counts
          – Brad Solomon
          Nov 18 at 13:33


















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53357049%2fcalling-roundrobin-on-result-of-itertools-groupby%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Create new schema in PostgreSQL using DBeaver

          Deepest pit of an array with Javascript: test on Codility

          Costa Masnaga