Evaluating association strengths of items












0












$begingroup$


Say I have a list of animals with their counts:



import numpy as np
import pandas as pd
from random import randint

table = np.zeros((5,1), dtype=int)
for i in range(5):
table[i]=randint(10, 20)

df1 = pd.DataFrame(columns=['Animal', 'Count'])
df1['Animal'] = animal_list
df1['Count'] = table
df1


enter image description here



And I have a matrix of how many times they appear together:



table = np.zeros((5,5), dtype=int)
animal_list = ['Monkey', 'Tiger', 'Cat', 'Dog', 'Lion']

for i in range(5):
for j in range(5):
table[i][j]=randint(0, 9)

df2 = pd.DataFrame(table, columns=animal_list, index=animal_list)
df2


enter image description here



I want to find the animals' association strength, which is defined like so - if Lion and Cat appear together 5 times, and Lion's count is 10 and Cat's count is 15, then Lion -> Cat association strength is 5/10=0.5, and Cat -> Lion association strength is 5/15=0.33.



I do it like so:



assoc_df = pd.DataFrame(columns=['Animal 1', 'Animal 2', 'Association Strength'])
for row_word in df2:
for col_word in df2:
if row_word!=col_word:
assoc_df = assoc_df.append({'Animal 1': row_word, 'Animal 2': col_word,
'Association Strength': df2[col_word][row_word]/df1[df1.Animal==row_word]['Count'].values[0]}, ignore_index=True)

assoc_df


enter image description here



The problem is, since there are 2 for loops, the complexity is O(n^2). This is a problem when (in my real dataset) I have ~1000 animals to loop on, which takes hours to finish computing the association strength table.



So, how to do I better optimize the creation/generation process of this association table?



P.S.: In most practical use cases, df2 is a symmetric matrix, as "X appears together with Y" generally also means the same as "Y appears together with X. So, I am ok with solutions that assume that df2 is symmetric, and cut down the running time by half. In the above example, df2 is not a symmetric matrix, which is applicable for situations where we want to express meanings such as "X appears after Y" and "Y appears after X", which are not the same.










share|improve this question











$endgroup$

















    0












    $begingroup$


    Say I have a list of animals with their counts:



    import numpy as np
    import pandas as pd
    from random import randint

    table = np.zeros((5,1), dtype=int)
    for i in range(5):
    table[i]=randint(10, 20)

    df1 = pd.DataFrame(columns=['Animal', 'Count'])
    df1['Animal'] = animal_list
    df1['Count'] = table
    df1


    enter image description here



    And I have a matrix of how many times they appear together:



    table = np.zeros((5,5), dtype=int)
    animal_list = ['Monkey', 'Tiger', 'Cat', 'Dog', 'Lion']

    for i in range(5):
    for j in range(5):
    table[i][j]=randint(0, 9)

    df2 = pd.DataFrame(table, columns=animal_list, index=animal_list)
    df2


    enter image description here



    I want to find the animals' association strength, which is defined like so - if Lion and Cat appear together 5 times, and Lion's count is 10 and Cat's count is 15, then Lion -> Cat association strength is 5/10=0.5, and Cat -> Lion association strength is 5/15=0.33.



    I do it like so:



    assoc_df = pd.DataFrame(columns=['Animal 1', 'Animal 2', 'Association Strength'])
    for row_word in df2:
    for col_word in df2:
    if row_word!=col_word:
    assoc_df = assoc_df.append({'Animal 1': row_word, 'Animal 2': col_word,
    'Association Strength': df2[col_word][row_word]/df1[df1.Animal==row_word]['Count'].values[0]}, ignore_index=True)

    assoc_df


    enter image description here



    The problem is, since there are 2 for loops, the complexity is O(n^2). This is a problem when (in my real dataset) I have ~1000 animals to loop on, which takes hours to finish computing the association strength table.



    So, how to do I better optimize the creation/generation process of this association table?



    P.S.: In most practical use cases, df2 is a symmetric matrix, as "X appears together with Y" generally also means the same as "Y appears together with X. So, I am ok with solutions that assume that df2 is symmetric, and cut down the running time by half. In the above example, df2 is not a symmetric matrix, which is applicable for situations where we want to express meanings such as "X appears after Y" and "Y appears after X", which are not the same.










    share|improve this question











    $endgroup$















      0












      0








      0





      $begingroup$


      Say I have a list of animals with their counts:



      import numpy as np
      import pandas as pd
      from random import randint

      table = np.zeros((5,1), dtype=int)
      for i in range(5):
      table[i]=randint(10, 20)

      df1 = pd.DataFrame(columns=['Animal', 'Count'])
      df1['Animal'] = animal_list
      df1['Count'] = table
      df1


      enter image description here



      And I have a matrix of how many times they appear together:



      table = np.zeros((5,5), dtype=int)
      animal_list = ['Monkey', 'Tiger', 'Cat', 'Dog', 'Lion']

      for i in range(5):
      for j in range(5):
      table[i][j]=randint(0, 9)

      df2 = pd.DataFrame(table, columns=animal_list, index=animal_list)
      df2


      enter image description here



      I want to find the animals' association strength, which is defined like so - if Lion and Cat appear together 5 times, and Lion's count is 10 and Cat's count is 15, then Lion -> Cat association strength is 5/10=0.5, and Cat -> Lion association strength is 5/15=0.33.



      I do it like so:



      assoc_df = pd.DataFrame(columns=['Animal 1', 'Animal 2', 'Association Strength'])
      for row_word in df2:
      for col_word in df2:
      if row_word!=col_word:
      assoc_df = assoc_df.append({'Animal 1': row_word, 'Animal 2': col_word,
      'Association Strength': df2[col_word][row_word]/df1[df1.Animal==row_word]['Count'].values[0]}, ignore_index=True)

      assoc_df


      enter image description here



      The problem is, since there are 2 for loops, the complexity is O(n^2). This is a problem when (in my real dataset) I have ~1000 animals to loop on, which takes hours to finish computing the association strength table.



      So, how to do I better optimize the creation/generation process of this association table?



      P.S.: In most practical use cases, df2 is a symmetric matrix, as "X appears together with Y" generally also means the same as "Y appears together with X. So, I am ok with solutions that assume that df2 is symmetric, and cut down the running time by half. In the above example, df2 is not a symmetric matrix, which is applicable for situations where we want to express meanings such as "X appears after Y" and "Y appears after X", which are not the same.










      share|improve this question











      $endgroup$




      Say I have a list of animals with their counts:



      import numpy as np
      import pandas as pd
      from random import randint

      table = np.zeros((5,1), dtype=int)
      for i in range(5):
      table[i]=randint(10, 20)

      df1 = pd.DataFrame(columns=['Animal', 'Count'])
      df1['Animal'] = animal_list
      df1['Count'] = table
      df1


      enter image description here



      And I have a matrix of how many times they appear together:



      table = np.zeros((5,5), dtype=int)
      animal_list = ['Monkey', 'Tiger', 'Cat', 'Dog', 'Lion']

      for i in range(5):
      for j in range(5):
      table[i][j]=randint(0, 9)

      df2 = pd.DataFrame(table, columns=animal_list, index=animal_list)
      df2


      enter image description here



      I want to find the animals' association strength, which is defined like so - if Lion and Cat appear together 5 times, and Lion's count is 10 and Cat's count is 15, then Lion -> Cat association strength is 5/10=0.5, and Cat -> Lion association strength is 5/15=0.33.



      I do it like so:



      assoc_df = pd.DataFrame(columns=['Animal 1', 'Animal 2', 'Association Strength'])
      for row_word in df2:
      for col_word in df2:
      if row_word!=col_word:
      assoc_df = assoc_df.append({'Animal 1': row_word, 'Animal 2': col_word,
      'Association Strength': df2[col_word][row_word]/df1[df1.Animal==row_word]['Count'].values[0]}, ignore_index=True)

      assoc_df


      enter image description here



      The problem is, since there are 2 for loops, the complexity is O(n^2). This is a problem when (in my real dataset) I have ~1000 animals to loop on, which takes hours to finish computing the association strength table.



      So, how to do I better optimize the creation/generation process of this association table?



      P.S.: In most practical use cases, df2 is a symmetric matrix, as "X appears together with Y" generally also means the same as "Y appears together with X. So, I am ok with solutions that assume that df2 is symmetric, and cut down the running time by half. In the above example, df2 is not a symmetric matrix, which is applicable for situations where we want to express meanings such as "X appears after Y" and "Y appears after X", which are not the same.







      python performance python-3.x community-challenge data-mining






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 13 mins ago







      Kristada673

















      asked 19 mins ago









      Kristada673Kristada673

      1463




      1463






















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214381%2fevaluating-association-strengths-of-items%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214381%2fevaluating-association-strengths-of-items%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Costa Masnaga

          Fotorealismo

          Sidney Franklin