PYSPARK : Join a table column with one of the two columns from another table












0














My problem is as follow:



Table 1
ID1 ID2
1 2
3 4

Table 2
C1 VALUE
1 London
4 Texas

Table3
C3 VALUE
2 Paris
3 Arizona


Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.



i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.



i.e my final output should look like:

ID Aggregated
1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3


Any suggestion how to achieve this in pyspark.



I am not sure if joining the tables is going to help in this problem.



I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.



Thanks










share|improve this question



























    0














    My problem is as follow:



    Table 1
    ID1 ID2
    1 2
    3 4

    Table 2
    C1 VALUE
    1 London
    4 Texas

    Table3
    C3 VALUE
    2 Paris
    3 Arizona


    Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.



    i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.



    i.e my final output should look like:

    ID Aggregated
    1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
    3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3


    Any suggestion how to achieve this in pyspark.



    I am not sure if joining the tables is going to help in this problem.



    I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.



    Thanks










    share|improve this question

























      0












      0








      0







      My problem is as follow:



      Table 1
      ID1 ID2
      1 2
      3 4

      Table 2
      C1 VALUE
      1 London
      4 Texas

      Table3
      C3 VALUE
      2 Paris
      3 Arizona


      Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.



      i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.



      i.e my final output should look like:

      ID Aggregated
      1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
      3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3


      Any suggestion how to achieve this in pyspark.



      I am not sure if joining the tables is going to help in this problem.



      I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.



      Thanks










      share|improve this question













      My problem is as follow:



      Table 1
      ID1 ID2
      1 2
      3 4

      Table 2
      C1 VALUE
      1 London
      4 Texas

      Table3
      C3 VALUE
      2 Paris
      3 Arizona


      Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.



      i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.



      i.e my final output should look like:

      ID Aggregated
      1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
      3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3


      Any suggestion how to achieve this in pyspark.



      I am not sure if joining the tables is going to help in this problem.



      I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.



      Thanks







      apache-spark pyspark apache-spark-sql pyspark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 20 at 15:32









      Alok

      4142726




      4142726
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Below is a very straightforward approach:



          spark.sql(
          """
          select 1 as id1,2 as id2
          union
          select 3 as id1,4 as id2
          """).createOrReplaceTempView("table1")

          spark.sql(
          """
          select 1 as c1, 'london' as city
          union
          select 4 as c1, 'texas' as city
          """).createOrReplaceTempView("table2")

          spark.sql(
          """
          select 2 as c1, 'paris' as city
          union
          select 3 as c1, 'arizona' as city
          """).createOrReplaceTempView("table3")

          spark.table("table1").show()
          spark.table("table2").show()
          spark.table("table3").show()

          # for simplicity, union table2 and table 3

          spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
          spark.table("city_mappings").show()

          # now join to the ids:

          spark.sql("""
          select id1, id2, city from table1
          join city_mappings on c1 = id1 or c1 = id2
          """).createOrReplaceTempView("id_to_city")

          # and finally you can aggregate:

          spark.sql("""
          select id1, id2, collect_list(city)
          from id_to_city
          group by id1, id2
          """).createOrReplaceTempView("result")

          table("result").show()

          # result looks like this, you can reshape to better suit your needs :
          +---+---+------------------+
          |id1|id2|collect_list(city)|
          +---+---+------------------+
          | 1| 2| [london, paris]|
          | 3| 4| [texas, arizona]|
          +---+---+------------------+





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396385%2fpyspark-join-a-table-column-with-one-of-the-two-columns-from-another-table%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Below is a very straightforward approach:



            spark.sql(
            """
            select 1 as id1,2 as id2
            union
            select 3 as id1,4 as id2
            """).createOrReplaceTempView("table1")

            spark.sql(
            """
            select 1 as c1, 'london' as city
            union
            select 4 as c1, 'texas' as city
            """).createOrReplaceTempView("table2")

            spark.sql(
            """
            select 2 as c1, 'paris' as city
            union
            select 3 as c1, 'arizona' as city
            """).createOrReplaceTempView("table3")

            spark.table("table1").show()
            spark.table("table2").show()
            spark.table("table3").show()

            # for simplicity, union table2 and table 3

            spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
            spark.table("city_mappings").show()

            # now join to the ids:

            spark.sql("""
            select id1, id2, city from table1
            join city_mappings on c1 = id1 or c1 = id2
            """).createOrReplaceTempView("id_to_city")

            # and finally you can aggregate:

            spark.sql("""
            select id1, id2, collect_list(city)
            from id_to_city
            group by id1, id2
            """).createOrReplaceTempView("result")

            table("result").show()

            # result looks like this, you can reshape to better suit your needs :
            +---+---+------------------+
            |id1|id2|collect_list(city)|
            +---+---+------------------+
            | 1| 2| [london, paris]|
            | 3| 4| [texas, arizona]|
            +---+---+------------------+





            share|improve this answer


























              0














              Below is a very straightforward approach:



              spark.sql(
              """
              select 1 as id1,2 as id2
              union
              select 3 as id1,4 as id2
              """).createOrReplaceTempView("table1")

              spark.sql(
              """
              select 1 as c1, 'london' as city
              union
              select 4 as c1, 'texas' as city
              """).createOrReplaceTempView("table2")

              spark.sql(
              """
              select 2 as c1, 'paris' as city
              union
              select 3 as c1, 'arizona' as city
              """).createOrReplaceTempView("table3")

              spark.table("table1").show()
              spark.table("table2").show()
              spark.table("table3").show()

              # for simplicity, union table2 and table 3

              spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
              spark.table("city_mappings").show()

              # now join to the ids:

              spark.sql("""
              select id1, id2, city from table1
              join city_mappings on c1 = id1 or c1 = id2
              """).createOrReplaceTempView("id_to_city")

              # and finally you can aggregate:

              spark.sql("""
              select id1, id2, collect_list(city)
              from id_to_city
              group by id1, id2
              """).createOrReplaceTempView("result")

              table("result").show()

              # result looks like this, you can reshape to better suit your needs :
              +---+---+------------------+
              |id1|id2|collect_list(city)|
              +---+---+------------------+
              | 1| 2| [london, paris]|
              | 3| 4| [texas, arizona]|
              +---+---+------------------+





              share|improve this answer
























                0












                0








                0






                Below is a very straightforward approach:



                spark.sql(
                """
                select 1 as id1,2 as id2
                union
                select 3 as id1,4 as id2
                """).createOrReplaceTempView("table1")

                spark.sql(
                """
                select 1 as c1, 'london' as city
                union
                select 4 as c1, 'texas' as city
                """).createOrReplaceTempView("table2")

                spark.sql(
                """
                select 2 as c1, 'paris' as city
                union
                select 3 as c1, 'arizona' as city
                """).createOrReplaceTempView("table3")

                spark.table("table1").show()
                spark.table("table2").show()
                spark.table("table3").show()

                # for simplicity, union table2 and table 3

                spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
                spark.table("city_mappings").show()

                # now join to the ids:

                spark.sql("""
                select id1, id2, city from table1
                join city_mappings on c1 = id1 or c1 = id2
                """).createOrReplaceTempView("id_to_city")

                # and finally you can aggregate:

                spark.sql("""
                select id1, id2, collect_list(city)
                from id_to_city
                group by id1, id2
                """).createOrReplaceTempView("result")

                table("result").show()

                # result looks like this, you can reshape to better suit your needs :
                +---+---+------------------+
                |id1|id2|collect_list(city)|
                +---+---+------------------+
                | 1| 2| [london, paris]|
                | 3| 4| [texas, arizona]|
                +---+---+------------------+





                share|improve this answer












                Below is a very straightforward approach:



                spark.sql(
                """
                select 1 as id1,2 as id2
                union
                select 3 as id1,4 as id2
                """).createOrReplaceTempView("table1")

                spark.sql(
                """
                select 1 as c1, 'london' as city
                union
                select 4 as c1, 'texas' as city
                """).createOrReplaceTempView("table2")

                spark.sql(
                """
                select 2 as c1, 'paris' as city
                union
                select 3 as c1, 'arizona' as city
                """).createOrReplaceTempView("table3")

                spark.table("table1").show()
                spark.table("table2").show()
                spark.table("table3").show()

                # for simplicity, union table2 and table 3

                spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
                spark.table("city_mappings").show()

                # now join to the ids:

                spark.sql("""
                select id1, id2, city from table1
                join city_mappings on c1 = id1 or c1 = id2
                """).createOrReplaceTempView("id_to_city")

                # and finally you can aggregate:

                spark.sql("""
                select id1, id2, collect_list(city)
                from id_to_city
                group by id1, id2
                """).createOrReplaceTempView("result")

                table("result").show()

                # result looks like this, you can reshape to better suit your needs :
                +---+---+------------------+
                |id1|id2|collect_list(city)|
                +---+---+------------------+
                | 1| 2| [london, paris]|
                | 3| 4| [texas, arizona]|
                +---+---+------------------+






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 20 at 16:41









                Vitaliy

                4,41932644




                4,41932644






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396385%2fpyspark-join-a-table-column-with-one-of-the-two-columns-from-another-table%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Create new schema in PostgreSQL using DBeaver

                    Deepest pit of an array with Javascript: test on Codility

                    Fotorealismo