PYSPARK : Join a table column with one of the two columns from another table

My problem is as follow:

Table 1

ID1 ID2

 1  2 

 3  4



Table 2

C1    VALUE

 1    London

 4    Texas



Table3 

 C3    VALUE

  2     Paris

  3     Arizona

Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.

i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.

i.e my final output should look like:



ID  Aggregated

1  [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1

3  [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3

Any suggestion how to achieve this in pyspark.

I am not sure if joining the tables is going to help in this problem.

I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.

Thanks

asked Nov 20 at 15:32

Alok

4142726

add a comment |

My problem is as follow:

Table 1

ID1 ID2

 1  2 

 3  4



Table 2

C1    VALUE

 1    London

 4    Texas



Table3 

 C3    VALUE

  2     Paris

  3     Arizona

Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.

i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.

i.e my final output should look like:



ID  Aggregated

1  [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1

3  [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3

Any suggestion how to achieve this in pyspark.

I am not sure if joining the tables is going to help in this problem.

I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.

Thanks

asked Nov 20 at 15:32

Alok

4142726

add a comment |

My problem is as follow:

Table 1

ID1 ID2

 1  2 

 3  4



Table 2

C1    VALUE

 1    London

 4    Texas



Table3 

 C3    VALUE

  2     Paris

  3     Arizona

Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.

i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.

i.e my final output should look like:



ID  Aggregated

1  [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1

3  [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3

Any suggestion how to achieve this in pyspark.

I am not sure if joining the tables is going to help in this problem.

I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.

Thanks

asked Nov 20 at 15:32

Alok

4142726

My problem is as follow:

Table 1

ID1 ID2

 1  2 

 3  4



Table 2

C1    VALUE

 1    London

 4    Texas



Table3 

 C3    VALUE

  2     Paris

  3     Arizona

Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.

i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.

i.e my final output should look like:



ID  Aggregated

1  [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1

3  [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3

Any suggestion how to achieve this in pyspark.

I am not sure if joining the tables is going to help in this problem.

I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.

Thanks

apache-spark pyspark apache-spark-sql pyspark-sql

asked Nov 20 at 15:32

Alok

4142726

asked Nov 20 at 15:32

Alok

4142726

asked Nov 20 at 15:32

Alok

4142726

asked Nov 20 at 15:32

Alok

4142726

asked Nov 20 at 15:32

Alok

4142726

add a comment |

1 Answer
1

active

oldest

votes

Below is a very straightforward approach:

spark.sql(

"""

  select 1 as id1,2 as id2 

  union

  select 3 as id1,4 as id2 

""").createOrReplaceTempView("table1")



spark.sql(

"""

  select 1 as c1, 'london' as city 

  union

  select 4 as c1, 'texas' as city 

""").createOrReplaceTempView("table2")



spark.sql(

"""

  select 2 as c1, 'paris' as city 

  union

  select 3 as c1, 'arizona' as city 

""").createOrReplaceTempView("table3")



spark.table("table1").show()

spark.table("table2").show()

spark.table("table3").show()



# for simplicity, union table2 and table 3



spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")

spark.table("city_mappings").show()



# now join to the ids:



spark.sql("""

  select id1, id2, city from table1

  join city_mappings on c1 = id1 or c1 = id2

""").createOrReplaceTempView("id_to_city")



# and finally you can aggregate: 



spark.sql("""

select id1, id2, collect_list(city)

from id_to_city

group by id1, id2

""").createOrReplaceTempView("result")



table("result").show()



# result looks like this, you can reshape to better suit your needs :

+---+---+------------------+

|id1|id2|collect_list(city)|

+---+---+------------------+

|  1|  2|   [london, paris]|

|  3|  4|  [texas, arizona]|

+---+---+------------------+

answered Nov 20 at 16:41

Vitaliy

4,41932644

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396385%2fpyspark-join-a-table-column-with-one-of-the-two-columns-from-another-table%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Below is a very straightforward approach:

spark.sql(

"""

  select 1 as id1,2 as id2 

  union

  select 3 as id1,4 as id2 

""").createOrReplaceTempView("table1")



spark.sql(

"""

  select 1 as c1, 'london' as city 

  union

  select 4 as c1, 'texas' as city 

""").createOrReplaceTempView("table2")



spark.sql(

"""

  select 2 as c1, 'paris' as city 

  union

  select 3 as c1, 'arizona' as city 

""").createOrReplaceTempView("table3")



spark.table("table1").show()

spark.table("table2").show()

spark.table("table3").show()



# for simplicity, union table2 and table 3



spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")

spark.table("city_mappings").show()



# now join to the ids:



spark.sql("""

  select id1, id2, city from table1

  join city_mappings on c1 = id1 or c1 = id2

""").createOrReplaceTempView("id_to_city")



# and finally you can aggregate: 



spark.sql("""

select id1, id2, collect_list(city)

from id_to_city

group by id1, id2

""").createOrReplaceTempView("result")



table("result").show()



# result looks like this, you can reshape to better suit your needs :

+---+---+------------------+

|id1|id2|collect_list(city)|

+---+---+------------------+

|  1|  2|   [london, paris]|

|  3|  4|  [texas, arizona]|

+---+---+------------------+

answered Nov 20 at 16:41

Vitaliy

4,41932644

add a comment |

Below is a very straightforward approach:

spark.sql(

"""

  select 1 as id1,2 as id2 

  union

  select 3 as id1,4 as id2 

""").createOrReplaceTempView("table1")



spark.sql(

"""

  select 1 as c1, 'london' as city 

  union

  select 4 as c1, 'texas' as city 

""").createOrReplaceTempView("table2")



spark.sql(

"""

  select 2 as c1, 'paris' as city 

  union

  select 3 as c1, 'arizona' as city 

""").createOrReplaceTempView("table3")



spark.table("table1").show()

spark.table("table2").show()

spark.table("table3").show()



# for simplicity, union table2 and table 3



spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")

spark.table("city_mappings").show()



# now join to the ids:



spark.sql("""

  select id1, id2, city from table1

  join city_mappings on c1 = id1 or c1 = id2

""").createOrReplaceTempView("id_to_city")



# and finally you can aggregate: 



spark.sql("""

select id1, id2, collect_list(city)

from id_to_city

group by id1, id2

""").createOrReplaceTempView("result")



table("result").show()



# result looks like this, you can reshape to better suit your needs :

+---+---+------------------+

|id1|id2|collect_list(city)|

+---+---+------------------+

|  1|  2|   [london, paris]|

|  3|  4|  [texas, arizona]|

+---+---+------------------+

answered Nov 20 at 16:41

Vitaliy

4,41932644

add a comment |

Below is a very straightforward approach:

spark.sql(

"""

  select 1 as id1,2 as id2 

  union

  select 3 as id1,4 as id2 

""").createOrReplaceTempView("table1")



spark.sql(

"""

  select 1 as c1, 'london' as city 

  union

  select 4 as c1, 'texas' as city 

""").createOrReplaceTempView("table2")



spark.sql(

"""

  select 2 as c1, 'paris' as city 

  union

  select 3 as c1, 'arizona' as city 

""").createOrReplaceTempView("table3")



spark.table("table1").show()

spark.table("table2").show()

spark.table("table3").show()



# for simplicity, union table2 and table 3



spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")

spark.table("city_mappings").show()



# now join to the ids:



spark.sql("""

  select id1, id2, city from table1

  join city_mappings on c1 = id1 or c1 = id2

""").createOrReplaceTempView("id_to_city")



# and finally you can aggregate: 



spark.sql("""

select id1, id2, collect_list(city)

from id_to_city

group by id1, id2

""").createOrReplaceTempView("result")



table("result").show()



# result looks like this, you can reshape to better suit your needs :

+---+---+------------------+

|id1|id2|collect_list(city)|

+---+---+------------------+

|  1|  2|   [london, paris]|

|  3|  4|  [texas, arizona]|

+---+---+------------------+

answered Nov 20 at 16:41

Vitaliy

4,41932644

Below is a very straightforward approach:

spark.sql(

"""

  select 1 as id1,2 as id2 

  union

  select 3 as id1,4 as id2 

""").createOrReplaceTempView("table1")



spark.sql(

"""

  select 1 as c1, 'london' as city 

  union

  select 4 as c1, 'texas' as city 

""").createOrReplaceTempView("table2")



spark.sql(

"""

  select 2 as c1, 'paris' as city 

  union

  select 3 as c1, 'arizona' as city 

""").createOrReplaceTempView("table3")



spark.table("table1").show()

spark.table("table2").show()

spark.table("table3").show()



# for simplicity, union table2 and table 3



spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")

spark.table("city_mappings").show()



# now join to the ids:



spark.sql("""

  select id1, id2, city from table1

  join city_mappings on c1 = id1 or c1 = id2

""").createOrReplaceTempView("id_to_city")



# and finally you can aggregate: 



spark.sql("""

select id1, id2, collect_list(city)

from id_to_city

group by id1, id2

""").createOrReplaceTempView("result")



table("result").show()



# result looks like this, you can reshape to better suit your needs :

+---+---+------------------+

|id1|id2|collect_list(city)|

+---+---+------------------+

|  1|  2|   [london, paris]|

|  3|  4|  [texas, arizona]|

+---+---+------------------+

answered Nov 20 at 16:41

Vitaliy

4,41932644

answered Nov 20 at 16:41

Vitaliy

4,41932644

answered Nov 20 at 16:41

Vitaliy

4,41932644

answered Nov 20 at 16:41

Vitaliy

4,41932644

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk