PYSPARK : Join a table column with one of the two columns from another table
My problem is as follow:
Table 1
ID1 ID2
1 2
3 4
Table 2
C1 VALUE
1 London
4 Texas
Table3
C3 VALUE
2 Paris
3 Arizona
Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.
i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.
i.e my final output should look like:
ID Aggregated
1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3
Any suggestion how to achieve this in pyspark.
I am not sure if joining the tables is going to help in this problem.
I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.
Thanks
apache-spark pyspark apache-spark-sql pyspark-sql
add a comment |
My problem is as follow:
Table 1
ID1 ID2
1 2
3 4
Table 2
C1 VALUE
1 London
4 Texas
Table3
C3 VALUE
2 Paris
3 Arizona
Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.
i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.
i.e my final output should look like:
ID Aggregated
1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3
Any suggestion how to achieve this in pyspark.
I am not sure if joining the tables is going to help in this problem.
I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.
Thanks
apache-spark pyspark apache-spark-sql pyspark-sql
add a comment |
My problem is as follow:
Table 1
ID1 ID2
1 2
3 4
Table 2
C1 VALUE
1 London
4 Texas
Table3
C3 VALUE
2 Paris
3 Arizona
Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.
i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.
i.e my final output should look like:
ID Aggregated
1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3
Any suggestion how to achieve this in pyspark.
I am not sure if joining the tables is going to help in this problem.
I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.
Thanks
apache-spark pyspark apache-spark-sql pyspark-sql
My problem is as follow:
Table 1
ID1 ID2
1 2
3 4
Table 2
C1 VALUE
1 London
4 Texas
Table3
C3 VALUE
2 Paris
3 Arizona
Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.
i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.
i.e my final output should look like:
ID Aggregated
1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3
Any suggestion how to achieve this in pyspark.
I am not sure if joining the tables is going to help in this problem.
I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.
Thanks
apache-spark pyspark apache-spark-sql pyspark-sql
apache-spark pyspark apache-spark-sql pyspark-sql
asked Nov 20 at 15:32
Alok
4142726
4142726
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Below is a very straightforward approach:
spark.sql(
"""
select 1 as id1,2 as id2
union
select 3 as id1,4 as id2
""").createOrReplaceTempView("table1")
spark.sql(
"""
select 1 as c1, 'london' as city
union
select 4 as c1, 'texas' as city
""").createOrReplaceTempView("table2")
spark.sql(
"""
select 2 as c1, 'paris' as city
union
select 3 as c1, 'arizona' as city
""").createOrReplaceTempView("table3")
spark.table("table1").show()
spark.table("table2").show()
spark.table("table3").show()
# for simplicity, union table2 and table 3
spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
spark.table("city_mappings").show()
# now join to the ids:
spark.sql("""
select id1, id2, city from table1
join city_mappings on c1 = id1 or c1 = id2
""").createOrReplaceTempView("id_to_city")
# and finally you can aggregate:
spark.sql("""
select id1, id2, collect_list(city)
from id_to_city
group by id1, id2
""").createOrReplaceTempView("result")
table("result").show()
# result looks like this, you can reshape to better suit your needs :
+---+---+------------------+
|id1|id2|collect_list(city)|
+---+---+------------------+
| 1| 2| [london, paris]|
| 3| 4| [texas, arizona]|
+---+---+------------------+
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396385%2fpyspark-join-a-table-column-with-one-of-the-two-columns-from-another-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Below is a very straightforward approach:
spark.sql(
"""
select 1 as id1,2 as id2
union
select 3 as id1,4 as id2
""").createOrReplaceTempView("table1")
spark.sql(
"""
select 1 as c1, 'london' as city
union
select 4 as c1, 'texas' as city
""").createOrReplaceTempView("table2")
spark.sql(
"""
select 2 as c1, 'paris' as city
union
select 3 as c1, 'arizona' as city
""").createOrReplaceTempView("table3")
spark.table("table1").show()
spark.table("table2").show()
spark.table("table3").show()
# for simplicity, union table2 and table 3
spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
spark.table("city_mappings").show()
# now join to the ids:
spark.sql("""
select id1, id2, city from table1
join city_mappings on c1 = id1 or c1 = id2
""").createOrReplaceTempView("id_to_city")
# and finally you can aggregate:
spark.sql("""
select id1, id2, collect_list(city)
from id_to_city
group by id1, id2
""").createOrReplaceTempView("result")
table("result").show()
# result looks like this, you can reshape to better suit your needs :
+---+---+------------------+
|id1|id2|collect_list(city)|
+---+---+------------------+
| 1| 2| [london, paris]|
| 3| 4| [texas, arizona]|
+---+---+------------------+
add a comment |
Below is a very straightforward approach:
spark.sql(
"""
select 1 as id1,2 as id2
union
select 3 as id1,4 as id2
""").createOrReplaceTempView("table1")
spark.sql(
"""
select 1 as c1, 'london' as city
union
select 4 as c1, 'texas' as city
""").createOrReplaceTempView("table2")
spark.sql(
"""
select 2 as c1, 'paris' as city
union
select 3 as c1, 'arizona' as city
""").createOrReplaceTempView("table3")
spark.table("table1").show()
spark.table("table2").show()
spark.table("table3").show()
# for simplicity, union table2 and table 3
spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
spark.table("city_mappings").show()
# now join to the ids:
spark.sql("""
select id1, id2, city from table1
join city_mappings on c1 = id1 or c1 = id2
""").createOrReplaceTempView("id_to_city")
# and finally you can aggregate:
spark.sql("""
select id1, id2, collect_list(city)
from id_to_city
group by id1, id2
""").createOrReplaceTempView("result")
table("result").show()
# result looks like this, you can reshape to better suit your needs :
+---+---+------------------+
|id1|id2|collect_list(city)|
+---+---+------------------+
| 1| 2| [london, paris]|
| 3| 4| [texas, arizona]|
+---+---+------------------+
add a comment |
Below is a very straightforward approach:
spark.sql(
"""
select 1 as id1,2 as id2
union
select 3 as id1,4 as id2
""").createOrReplaceTempView("table1")
spark.sql(
"""
select 1 as c1, 'london' as city
union
select 4 as c1, 'texas' as city
""").createOrReplaceTempView("table2")
spark.sql(
"""
select 2 as c1, 'paris' as city
union
select 3 as c1, 'arizona' as city
""").createOrReplaceTempView("table3")
spark.table("table1").show()
spark.table("table2").show()
spark.table("table3").show()
# for simplicity, union table2 and table 3
spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
spark.table("city_mappings").show()
# now join to the ids:
spark.sql("""
select id1, id2, city from table1
join city_mappings on c1 = id1 or c1 = id2
""").createOrReplaceTempView("id_to_city")
# and finally you can aggregate:
spark.sql("""
select id1, id2, collect_list(city)
from id_to_city
group by id1, id2
""").createOrReplaceTempView("result")
table("result").show()
# result looks like this, you can reshape to better suit your needs :
+---+---+------------------+
|id1|id2|collect_list(city)|
+---+---+------------------+
| 1| 2| [london, paris]|
| 3| 4| [texas, arizona]|
+---+---+------------------+
Below is a very straightforward approach:
spark.sql(
"""
select 1 as id1,2 as id2
union
select 3 as id1,4 as id2
""").createOrReplaceTempView("table1")
spark.sql(
"""
select 1 as c1, 'london' as city
union
select 4 as c1, 'texas' as city
""").createOrReplaceTempView("table2")
spark.sql(
"""
select 2 as c1, 'paris' as city
union
select 3 as c1, 'arizona' as city
""").createOrReplaceTempView("table3")
spark.table("table1").show()
spark.table("table2").show()
spark.table("table3").show()
# for simplicity, union table2 and table 3
spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
spark.table("city_mappings").show()
# now join to the ids:
spark.sql("""
select id1, id2, city from table1
join city_mappings on c1 = id1 or c1 = id2
""").createOrReplaceTempView("id_to_city")
# and finally you can aggregate:
spark.sql("""
select id1, id2, collect_list(city)
from id_to_city
group by id1, id2
""").createOrReplaceTempView("result")
table("result").show()
# result looks like this, you can reshape to better suit your needs :
+---+---+------------------+
|id1|id2|collect_list(city)|
+---+---+------------------+
| 1| 2| [london, paris]|
| 3| 4| [texas, arizona]|
+---+---+------------------+
answered Nov 20 at 16:41
Vitaliy
4,41932644
4,41932644
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396385%2fpyspark-join-a-table-column-with-one-of-the-two-columns-from-another-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown