Performant cartesian product (CROSS JOIN) of two pandas DataFrames











up vote
6
down vote

favorite
2













The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.




Given two simple DataFrames;



left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

left

col1 col2
0 A 1
1 B 2
2 C 3

right

col1 col2
0 X 20
1 Y 30
2 Z 50


The cross product of these frames can be computed, and will look something like:



A       1      X      20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50


What is the most performant method of computing this result?










share|improve this question


























    up vote
    6
    down vote

    favorite
    2













    The contents of this post were originally meant to be a part of
    Pandas Merging 101,
    but due to the nature and size of the content required to fully do
    justice to this topic, it has been moved to its own QnA.




    Given two simple DataFrames;



    left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
    right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

    left

    col1 col2
    0 A 1
    1 B 2
    2 C 3

    right

    col1 col2
    0 X 20
    1 Y 30
    2 Z 50


    The cross product of these frames can be computed, and will look something like:



    A       1      X      20
    A 1 Y 30
    A 1 Z 50
    B 2 X 20
    B 2 Y 30
    B 2 Z 50
    C 3 X 20
    C 3 Y 30
    C 3 Z 50


    What is the most performant method of computing this result?










    share|improve this question
























      up vote
      6
      down vote

      favorite
      2









      up vote
      6
      down vote

      favorite
      2






      2






      The contents of this post were originally meant to be a part of
      Pandas Merging 101,
      but due to the nature and size of the content required to fully do
      justice to this topic, it has been moved to its own QnA.




      Given two simple DataFrames;



      left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
      right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

      left

      col1 col2
      0 A 1
      1 B 2
      2 C 3

      right

      col1 col2
      0 X 20
      1 Y 30
      2 Z 50


      The cross product of these frames can be computed, and will look something like:



      A       1      X      20
      A 1 Y 30
      A 1 Z 50
      B 2 X 20
      B 2 Y 30
      B 2 Z 50
      C 3 X 20
      C 3 Y 30
      C 3 Z 50


      What is the most performant method of computing this result?










      share|improve this question














      The contents of this post were originally meant to be a part of
      Pandas Merging 101,
      but due to the nature and size of the content required to fully do
      justice to this topic, it has been moved to its own QnA.




      Given two simple DataFrames;



      left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
      right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

      left

      col1 col2
      0 A 1
      1 B 2
      2 C 3

      right

      col1 col2
      0 X 20
      1 Y 30
      2 Z 50


      The cross product of these frames can be computed, and will look something like:



      A       1      X      20
      A 1 Y 30
      A 1 Z 50
      B 2 X 20
      B 2 Y 30
      B 2 Z 50
      C 3 X 20
      C 3 Y 30
      C 3 Z 50


      What is the most performant method of computing this result?







      python pandas numpy dataframe merge






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 2 hours ago









      coldspeed

      113k18104177




      113k18104177
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          6
          down vote













          Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:



          def cartesian_product_basic(left, right):
          return (
          left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

          cartesian_product_basic(left, right)

          col1_x col2_x col1_y col2_y
          0 A 1 X 20
          1 A 1 Y 30
          2 A 1 Z 50
          3 B 2 X 20
          4 B 2 Y 30
          5 B 2 Z 50
          6 C 3 X 20
          7 C 3 Y 30
          8 C 3 Z 50


          How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".



          While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.



          A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.



          def cartesian_product(*arrays):
          la = len(arrays)
          dtype = np.result_type(*arrays)
          arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
          for i, a in enumerate(np.ix_(*arrays)):
          arr[...,i] = a
          return arr.reshape(-1, la)


          Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames



          This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and



          def cartesian_product_generalized(left, right):
          la, lb = len(left), len(right)
          idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
          return pd.DataFrame(
          np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

          cartesian_product_generalized(left, right)

          0 1 2 3
          0 A 1 X 20
          1 A 1 Y 30
          2 A 1 Z 50
          3 B 2 X 20
          4 B 2 Y 30
          5 B 2 Z 50
          6 C 3 X 20
          7 C 3 Y 30
          8 C 3 Z 50

          np.array_equal(cartesian_product_generalized(left, right),
          cartesian_product_basic(left, right))
          True


          And, along similar lines,



          left2 = left.copy()
          left2.index = ['s1', 's2', 's1']

          right2 = right.copy()
          right2.index = ['x', 'y', 'y']


          left2
          col1 col2
          s1 A 1
          s2 B 2
          s1 C 3

          right2
          col1 col2
          x X 20
          y Y 30
          y Z 50

          np.array_equal(cartesian_product_generalized(left, right),
          cartesian_product_basic(left2, right2))
          True


          This solution can generalise to multiple DataFrames. For example,



          def cartesian_product_multi(*dfs):
          idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
          return pd.DataFrame(
          np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

          cartesian_product_multi(*[left, right, left]).head()

          0 1 2 3 4 5
          0 A 1 X 20 A 1
          1 A 1 X 20 B 2
          2 A 1 X 20 C 3
          3 A 1 X 20 D 4
          4 A 1 Y 30 A 1


          Further Simplification



          A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.



          def cartesian_product_simplified(left, right):
          la, lb = len(left), len(right)
          ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

          return pd.DataFrame(
          np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

          np.array_equal(cartesian_product_simplified(left, right),
          cartesian_product_basic(left2, right2))
          True




          Performance Comparison



          Benchmarking these solutions on some contrived DataFrames with unique indices, we have



          enter image description here



          Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.



          Functions from Other Answers



          # Wen's answer: https://stackoverflow.com/a/53699198/4909087
          # I've put my own spin on this to make it as fast as possible.
          def cartesian_product_itertools(left, right):
          return pd.DataFrame([
          [*x, *y] for x, y in itertools.product(
          left.values.tolist(), right.values.tolist())])


          Performance Benchmarking Code

          This is the timing script. All functions called here are defined above.



          from timeit import timeit
          import pandas as pd
          import matplotlib.pyplot as plt

          res = pd.DataFrame(
          index=['cartesian_product_basic', 'cartesian_product_generalized',
          'cartesian_product_multi', 'cartesian_product_simplified',
          'cartesian_product_itertools'],
          columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
          dtype=float
          )

          for f in res.index:
          for c in res.columns:
          # print(f,c)
          if f in {'cartesian_product_itertools'} and c > 600:
          continue
          left2 = pd.concat([left] * c, ignore_index=True)
          right2 = pd.concat([right] * c, ignore_index=True)
          stmt = '{}(left2, right2)'.format(f)
          setp = 'from __main__ import left2, right2, {}'.format(f)
          res.at[f, c] = timeit(stmt, setp, number=5)

          ax = res.div(res.min()).T.plot(loglog=True)
          ax.set_xlabel("N");
          ax.set_ylabel("time (relative)");

          plt.show()





          share|improve this answer



















          • 1




            Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
            – U9-Forward
            1 hour ago








          • 1




            @U9-Forward Thanks, appreciate your support!
            – coldspeed
            1 hour ago










          • Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
            – b-fg
            1 hour ago










          • @b-fg Fair enough, done.
            – coldspeed
            1 hour ago










          • Wow, I didn't expect that! Thanks, appreciate the effort.
            – b-fg
            57 mins ago


















          up vote
          2
          down vote













          Using itertools product and recreate the value in dataframe



          import itertools
          l=list(itertools.product(left.values.tolist(),right.values.tolist()))
          pd.DataFrame(list(map(lambda x : sum(x,),l)))
          0 1 2 3
          0 A 1 X 20
          1 A 1 Y 30
          2 A 1 Z 50
          3 B 2 X 20
          4 B 2 Y 30
          5 B 2 Z 50
          6 C 3 X 20
          7 C 3 Y 30
          8 C 3 Z 50





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53699012%2fperformant-cartesian-product-cross-join-of-two-pandas-dataframes%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            6
            down vote













            Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:



            def cartesian_product_basic(left, right):
            return (
            left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

            cartesian_product_basic(left, right)

            col1_x col2_x col1_y col2_y
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50


            How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".



            While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.



            A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.



            def cartesian_product(*arrays):
            la = len(arrays)
            dtype = np.result_type(*arrays)
            arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
            for i, a in enumerate(np.ix_(*arrays)):
            arr[...,i] = a
            return arr.reshape(-1, la)


            Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames



            This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and



            def cartesian_product_generalized(left, right):
            la, lb = len(left), len(right)
            idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
            return pd.DataFrame(
            np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

            cartesian_product_generalized(left, right)

            0 1 2 3
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left, right))
            True


            And, along similar lines,



            left2 = left.copy()
            left2.index = ['s1', 's2', 's1']

            right2 = right.copy()
            right2.index = ['x', 'y', 'y']


            left2
            col1 col2
            s1 A 1
            s2 B 2
            s1 C 3

            right2
            col1 col2
            x X 20
            y Y 30
            y Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left2, right2))
            True


            This solution can generalise to multiple DataFrames. For example,



            def cartesian_product_multi(*dfs):
            idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
            return pd.DataFrame(
            np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

            cartesian_product_multi(*[left, right, left]).head()

            0 1 2 3 4 5
            0 A 1 X 20 A 1
            1 A 1 X 20 B 2
            2 A 1 X 20 C 3
            3 A 1 X 20 D 4
            4 A 1 Y 30 A 1


            Further Simplification



            A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.



            def cartesian_product_simplified(left, right):
            la, lb = len(left), len(right)
            ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

            return pd.DataFrame(
            np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

            np.array_equal(cartesian_product_simplified(left, right),
            cartesian_product_basic(left2, right2))
            True




            Performance Comparison



            Benchmarking these solutions on some contrived DataFrames with unique indices, we have



            enter image description here



            Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.



            Functions from Other Answers



            # Wen's answer: https://stackoverflow.com/a/53699198/4909087
            # I've put my own spin on this to make it as fast as possible.
            def cartesian_product_itertools(left, right):
            return pd.DataFrame([
            [*x, *y] for x, y in itertools.product(
            left.values.tolist(), right.values.tolist())])


            Performance Benchmarking Code

            This is the timing script. All functions called here are defined above.



            from timeit import timeit
            import pandas as pd
            import matplotlib.pyplot as plt

            res = pd.DataFrame(
            index=['cartesian_product_basic', 'cartesian_product_generalized',
            'cartesian_product_multi', 'cartesian_product_simplified',
            'cartesian_product_itertools'],
            columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
            dtype=float
            )

            for f in res.index:
            for c in res.columns:
            # print(f,c)
            if f in {'cartesian_product_itertools'} and c > 600:
            continue
            left2 = pd.concat([left] * c, ignore_index=True)
            right2 = pd.concat([right] * c, ignore_index=True)
            stmt = '{}(left2, right2)'.format(f)
            setp = 'from __main__ import left2, right2, {}'.format(f)
            res.at[f, c] = timeit(stmt, setp, number=5)

            ax = res.div(res.min()).T.plot(loglog=True)
            ax.set_xlabel("N");
            ax.set_ylabel("time (relative)");

            plt.show()





            share|improve this answer



















            • 1




              Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
              – U9-Forward
              1 hour ago








            • 1




              @U9-Forward Thanks, appreciate your support!
              – coldspeed
              1 hour ago










            • Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
              – b-fg
              1 hour ago










            • @b-fg Fair enough, done.
              – coldspeed
              1 hour ago










            • Wow, I didn't expect that! Thanks, appreciate the effort.
              – b-fg
              57 mins ago















            up vote
            6
            down vote













            Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:



            def cartesian_product_basic(left, right):
            return (
            left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

            cartesian_product_basic(left, right)

            col1_x col2_x col1_y col2_y
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50


            How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".



            While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.



            A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.



            def cartesian_product(*arrays):
            la = len(arrays)
            dtype = np.result_type(*arrays)
            arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
            for i, a in enumerate(np.ix_(*arrays)):
            arr[...,i] = a
            return arr.reshape(-1, la)


            Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames



            This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and



            def cartesian_product_generalized(left, right):
            la, lb = len(left), len(right)
            idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
            return pd.DataFrame(
            np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

            cartesian_product_generalized(left, right)

            0 1 2 3
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left, right))
            True


            And, along similar lines,



            left2 = left.copy()
            left2.index = ['s1', 's2', 's1']

            right2 = right.copy()
            right2.index = ['x', 'y', 'y']


            left2
            col1 col2
            s1 A 1
            s2 B 2
            s1 C 3

            right2
            col1 col2
            x X 20
            y Y 30
            y Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left2, right2))
            True


            This solution can generalise to multiple DataFrames. For example,



            def cartesian_product_multi(*dfs):
            idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
            return pd.DataFrame(
            np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

            cartesian_product_multi(*[left, right, left]).head()

            0 1 2 3 4 5
            0 A 1 X 20 A 1
            1 A 1 X 20 B 2
            2 A 1 X 20 C 3
            3 A 1 X 20 D 4
            4 A 1 Y 30 A 1


            Further Simplification



            A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.



            def cartesian_product_simplified(left, right):
            la, lb = len(left), len(right)
            ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

            return pd.DataFrame(
            np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

            np.array_equal(cartesian_product_simplified(left, right),
            cartesian_product_basic(left2, right2))
            True




            Performance Comparison



            Benchmarking these solutions on some contrived DataFrames with unique indices, we have



            enter image description here



            Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.



            Functions from Other Answers



            # Wen's answer: https://stackoverflow.com/a/53699198/4909087
            # I've put my own spin on this to make it as fast as possible.
            def cartesian_product_itertools(left, right):
            return pd.DataFrame([
            [*x, *y] for x, y in itertools.product(
            left.values.tolist(), right.values.tolist())])


            Performance Benchmarking Code

            This is the timing script. All functions called here are defined above.



            from timeit import timeit
            import pandas as pd
            import matplotlib.pyplot as plt

            res = pd.DataFrame(
            index=['cartesian_product_basic', 'cartesian_product_generalized',
            'cartesian_product_multi', 'cartesian_product_simplified',
            'cartesian_product_itertools'],
            columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
            dtype=float
            )

            for f in res.index:
            for c in res.columns:
            # print(f,c)
            if f in {'cartesian_product_itertools'} and c > 600:
            continue
            left2 = pd.concat([left] * c, ignore_index=True)
            right2 = pd.concat([right] * c, ignore_index=True)
            stmt = '{}(left2, right2)'.format(f)
            setp = 'from __main__ import left2, right2, {}'.format(f)
            res.at[f, c] = timeit(stmt, setp, number=5)

            ax = res.div(res.min()).T.plot(loglog=True)
            ax.set_xlabel("N");
            ax.set_ylabel("time (relative)");

            plt.show()





            share|improve this answer



















            • 1




              Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
              – U9-Forward
              1 hour ago








            • 1




              @U9-Forward Thanks, appreciate your support!
              – coldspeed
              1 hour ago










            • Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
              – b-fg
              1 hour ago










            • @b-fg Fair enough, done.
              – coldspeed
              1 hour ago










            • Wow, I didn't expect that! Thanks, appreciate the effort.
              – b-fg
              57 mins ago













            up vote
            6
            down vote










            up vote
            6
            down vote









            Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:



            def cartesian_product_basic(left, right):
            return (
            left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

            cartesian_product_basic(left, right)

            col1_x col2_x col1_y col2_y
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50


            How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".



            While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.



            A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.



            def cartesian_product(*arrays):
            la = len(arrays)
            dtype = np.result_type(*arrays)
            arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
            for i, a in enumerate(np.ix_(*arrays)):
            arr[...,i] = a
            return arr.reshape(-1, la)


            Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames



            This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and



            def cartesian_product_generalized(left, right):
            la, lb = len(left), len(right)
            idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
            return pd.DataFrame(
            np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

            cartesian_product_generalized(left, right)

            0 1 2 3
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left, right))
            True


            And, along similar lines,



            left2 = left.copy()
            left2.index = ['s1', 's2', 's1']

            right2 = right.copy()
            right2.index = ['x', 'y', 'y']


            left2
            col1 col2
            s1 A 1
            s2 B 2
            s1 C 3

            right2
            col1 col2
            x X 20
            y Y 30
            y Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left2, right2))
            True


            This solution can generalise to multiple DataFrames. For example,



            def cartesian_product_multi(*dfs):
            idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
            return pd.DataFrame(
            np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

            cartesian_product_multi(*[left, right, left]).head()

            0 1 2 3 4 5
            0 A 1 X 20 A 1
            1 A 1 X 20 B 2
            2 A 1 X 20 C 3
            3 A 1 X 20 D 4
            4 A 1 Y 30 A 1


            Further Simplification



            A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.



            def cartesian_product_simplified(left, right):
            la, lb = len(left), len(right)
            ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

            return pd.DataFrame(
            np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

            np.array_equal(cartesian_product_simplified(left, right),
            cartesian_product_basic(left2, right2))
            True




            Performance Comparison



            Benchmarking these solutions on some contrived DataFrames with unique indices, we have



            enter image description here



            Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.



            Functions from Other Answers



            # Wen's answer: https://stackoverflow.com/a/53699198/4909087
            # I've put my own spin on this to make it as fast as possible.
            def cartesian_product_itertools(left, right):
            return pd.DataFrame([
            [*x, *y] for x, y in itertools.product(
            left.values.tolist(), right.values.tolist())])


            Performance Benchmarking Code

            This is the timing script. All functions called here are defined above.



            from timeit import timeit
            import pandas as pd
            import matplotlib.pyplot as plt

            res = pd.DataFrame(
            index=['cartesian_product_basic', 'cartesian_product_generalized',
            'cartesian_product_multi', 'cartesian_product_simplified',
            'cartesian_product_itertools'],
            columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
            dtype=float
            )

            for f in res.index:
            for c in res.columns:
            # print(f,c)
            if f in {'cartesian_product_itertools'} and c > 600:
            continue
            left2 = pd.concat([left] * c, ignore_index=True)
            right2 = pd.concat([right] * c, ignore_index=True)
            stmt = '{}(left2, right2)'.format(f)
            setp = 'from __main__ import left2, right2, {}'.format(f)
            res.at[f, c] = timeit(stmt, setp, number=5)

            ax = res.div(res.min()).T.plot(loglog=True)
            ax.set_xlabel("N");
            ax.set_ylabel("time (relative)");

            plt.show()





            share|improve this answer














            Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:



            def cartesian_product_basic(left, right):
            return (
            left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

            cartesian_product_basic(left, right)

            col1_x col2_x col1_y col2_y
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50


            How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".



            While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.



            A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.



            def cartesian_product(*arrays):
            la = len(arrays)
            dtype = np.result_type(*arrays)
            arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
            for i, a in enumerate(np.ix_(*arrays)):
            arr[...,i] = a
            return arr.reshape(-1, la)


            Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames



            This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and



            def cartesian_product_generalized(left, right):
            la, lb = len(left), len(right)
            idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
            return pd.DataFrame(
            np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

            cartesian_product_generalized(left, right)

            0 1 2 3
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left, right))
            True


            And, along similar lines,



            left2 = left.copy()
            left2.index = ['s1', 's2', 's1']

            right2 = right.copy()
            right2.index = ['x', 'y', 'y']


            left2
            col1 col2
            s1 A 1
            s2 B 2
            s1 C 3

            right2
            col1 col2
            x X 20
            y Y 30
            y Z 50

            np.array_equal(cartesian_product_generalized(left, right),
            cartesian_product_basic(left2, right2))
            True


            This solution can generalise to multiple DataFrames. For example,



            def cartesian_product_multi(*dfs):
            idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
            return pd.DataFrame(
            np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

            cartesian_product_multi(*[left, right, left]).head()

            0 1 2 3 4 5
            0 A 1 X 20 A 1
            1 A 1 X 20 B 2
            2 A 1 X 20 C 3
            3 A 1 X 20 D 4
            4 A 1 Y 30 A 1


            Further Simplification



            A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.



            def cartesian_product_simplified(left, right):
            la, lb = len(left), len(right)
            ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

            return pd.DataFrame(
            np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

            np.array_equal(cartesian_product_simplified(left, right),
            cartesian_product_basic(left2, right2))
            True




            Performance Comparison



            Benchmarking these solutions on some contrived DataFrames with unique indices, we have



            enter image description here



            Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.



            Functions from Other Answers



            # Wen's answer: https://stackoverflow.com/a/53699198/4909087
            # I've put my own spin on this to make it as fast as possible.
            def cartesian_product_itertools(left, right):
            return pd.DataFrame([
            [*x, *y] for x, y in itertools.product(
            left.values.tolist(), right.values.tolist())])


            Performance Benchmarking Code

            This is the timing script. All functions called here are defined above.



            from timeit import timeit
            import pandas as pd
            import matplotlib.pyplot as plt

            res = pd.DataFrame(
            index=['cartesian_product_basic', 'cartesian_product_generalized',
            'cartesian_product_multi', 'cartesian_product_simplified',
            'cartesian_product_itertools'],
            columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
            dtype=float
            )

            for f in res.index:
            for c in res.columns:
            # print(f,c)
            if f in {'cartesian_product_itertools'} and c > 600:
            continue
            left2 = pd.concat([left] * c, ignore_index=True)
            right2 = pd.concat([right] * c, ignore_index=True)
            stmt = '{}(left2, right2)'.format(f)
            setp = 'from __main__ import left2, right2, {}'.format(f)
            res.at[f, c] = timeit(stmt, setp, number=5)

            ax = res.div(res.min()).T.plot(loglog=True)
            ax.set_xlabel("N");
            ax.set_ylabel("time (relative)");

            plt.show()






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 1 hour ago

























            answered 2 hours ago









            coldspeed

            113k18104177




            113k18104177








            • 1




              Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
              – U9-Forward
              1 hour ago








            • 1




              @U9-Forward Thanks, appreciate your support!
              – coldspeed
              1 hour ago










            • Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
              – b-fg
              1 hour ago










            • @b-fg Fair enough, done.
              – coldspeed
              1 hour ago










            • Wow, I didn't expect that! Thanks, appreciate the effort.
              – b-fg
              57 mins ago














            • 1




              Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
              – U9-Forward
              1 hour ago








            • 1




              @U9-Forward Thanks, appreciate your support!
              – coldspeed
              1 hour ago










            • Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
              – b-fg
              1 hour ago










            • @b-fg Fair enough, done.
              – coldspeed
              1 hour ago










            • Wow, I didn't expect that! Thanks, appreciate the effort.
              – b-fg
              57 mins ago








            1




            1




            Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
            – U9-Forward
            1 hour ago






            Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
            – U9-Forward
            1 hour ago






            1




            1




            @U9-Forward Thanks, appreciate your support!
            – coldspeed
            1 hour ago




            @U9-Forward Thanks, appreciate your support!
            – coldspeed
            1 hour ago












            Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
            – b-fg
            1 hour ago




            Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
            – b-fg
            1 hour ago












            @b-fg Fair enough, done.
            – coldspeed
            1 hour ago




            @b-fg Fair enough, done.
            – coldspeed
            1 hour ago












            Wow, I didn't expect that! Thanks, appreciate the effort.
            – b-fg
            57 mins ago




            Wow, I didn't expect that! Thanks, appreciate the effort.
            – b-fg
            57 mins ago












            up vote
            2
            down vote













            Using itertools product and recreate the value in dataframe



            import itertools
            l=list(itertools.product(left.values.tolist(),right.values.tolist()))
            pd.DataFrame(list(map(lambda x : sum(x,),l)))
            0 1 2 3
            0 A 1 X 20
            1 A 1 Y 30
            2 A 1 Z 50
            3 B 2 X 20
            4 B 2 Y 30
            5 B 2 Z 50
            6 C 3 X 20
            7 C 3 Y 30
            8 C 3 Z 50





            share|improve this answer

























              up vote
              2
              down vote













              Using itertools product and recreate the value in dataframe



              import itertools
              l=list(itertools.product(left.values.tolist(),right.values.tolist()))
              pd.DataFrame(list(map(lambda x : sum(x,),l)))
              0 1 2 3
              0 A 1 X 20
              1 A 1 Y 30
              2 A 1 Z 50
              3 B 2 X 20
              4 B 2 Y 30
              5 B 2 Z 50
              6 C 3 X 20
              7 C 3 Y 30
              8 C 3 Z 50





              share|improve this answer























                up vote
                2
                down vote










                up vote
                2
                down vote









                Using itertools product and recreate the value in dataframe



                import itertools
                l=list(itertools.product(left.values.tolist(),right.values.tolist()))
                pd.DataFrame(list(map(lambda x : sum(x,),l)))
                0 1 2 3
                0 A 1 X 20
                1 A 1 Y 30
                2 A 1 Z 50
                3 B 2 X 20
                4 B 2 Y 30
                5 B 2 Z 50
                6 C 3 X 20
                7 C 3 Y 30
                8 C 3 Z 50





                share|improve this answer












                Using itertools product and recreate the value in dataframe



                import itertools
                l=list(itertools.product(left.values.tolist(),right.values.tolist()))
                pd.DataFrame(list(map(lambda x : sum(x,),l)))
                0 1 2 3
                0 A 1 X 20
                1 A 1 Y 30
                2 A 1 Z 50
                3 B 2 X 20
                4 B 2 Y 30
                5 B 2 Z 50
                6 C 3 X 20
                7 C 3 Y 30
                8 C 3 Z 50






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 1 hour ago









                W-B

                96.7k73162




                96.7k73162






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53699012%2fperformant-cartesian-product-cross-join-of-two-pandas-dataframes%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Costa Masnaga

                    Fotorealismo

                    Sidney Franklin