Transpose a dataframe in Pyspark
how can I do to transpose the following data frame in Pyspark?
The idea is to achieve the result that appears below.
import pandas as pd
d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}
id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth
d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print(df)
id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20
Any help is welcome. From already thank you very much
apache-spark pyspark apache-spark-sql
add a comment |
how can I do to transpose the following data frame in Pyspark?
The idea is to achieve the result that appears below.
import pandas as pd
d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}
id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth
d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print(df)
id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20
Any help is welcome. From already thank you very much
apache-spark pyspark apache-spark-sql
Possible duplicate of How to pivot DataFrame?
– user10465355
Nov 24 '18 at 10:55
add a comment |
how can I do to transpose the following data frame in Pyspark?
The idea is to achieve the result that appears below.
import pandas as pd
d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}
id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth
d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print(df)
id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20
Any help is welcome. From already thank you very much
apache-spark pyspark apache-spark-sql
how can I do to transpose the following data frame in Pyspark?
The idea is to achieve the result that appears below.
import pandas as pd
d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}
id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth
d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print(df)
id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20
Any help is welcome. From already thank you very much
apache-spark pyspark apache-spark-sql
apache-spark pyspark apache-spark-sql
edited Nov 25 '18 at 9:40
user10465355
1,9452418
1,9452418
asked Nov 23 '18 at 21:28
lolololo
208211
208211
Possible duplicate of How to pivot DataFrame?
– user10465355
Nov 24 '18 at 10:55
add a comment |
Possible duplicate of How to pivot DataFrame?
– user10465355
Nov 24 '18 at 10:55
Possible duplicate of How to pivot DataFrame?
– user10465355
Nov 24 '18 at 10:55
Possible duplicate of How to pivot DataFrame?
– user10465355
Nov 24 '18 at 10:55
add a comment |
2 Answers
2
active
oldest
votes
First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:
from pyspark.sql.functions import first
l =[( 1 ,'A', 10, 'size' ),
( 1 , 'A', 30, 'height' ),
( 1 , 'A', 20, 'weigth' ),
( 2 , 'A', 10, 'size' ),
( 2 , 'A', 30, 'height' ),
( 2 , 'A', 20, 'weigth' ),
( 3 , 'A', 10, 'size' ),
( 3 , 'A', 30, 'height' ),
( 3 , 'A', 20, 'weigth' )]
df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])
df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()
+---+-----+------+----+------+
| id|place|height|size|weigth|
+---+-----+------+----+------+
| 2| A| 30| 10| 20|
| 3| A| 30| 10| 20|
| 1| A| 30| 10| 20|
+---+-----+------+----+------+
Thank you! that is what I was looking for!
– lolo
Nov 26 '18 at 0:36
add a comment |
Refer to the documentation. Pivoting
is always done in context to aggregation, and I have chosen sum
here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.
df = df.groupBy(["id","place"]).pivot("attribute").sum("value")
This link also addresses the same question.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53453118%2ftranspose-a-dataframe-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:
from pyspark.sql.functions import first
l =[( 1 ,'A', 10, 'size' ),
( 1 , 'A', 30, 'height' ),
( 1 , 'A', 20, 'weigth' ),
( 2 , 'A', 10, 'size' ),
( 2 , 'A', 30, 'height' ),
( 2 , 'A', 20, 'weigth' ),
( 3 , 'A', 10, 'size' ),
( 3 , 'A', 30, 'height' ),
( 3 , 'A', 20, 'weigth' )]
df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])
df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()
+---+-----+------+----+------+
| id|place|height|size|weigth|
+---+-----+------+----+------+
| 2| A| 30| 10| 20|
| 3| A| 30| 10| 20|
| 1| A| 30| 10| 20|
+---+-----+------+----+------+
Thank you! that is what I was looking for!
– lolo
Nov 26 '18 at 0:36
add a comment |
First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:
from pyspark.sql.functions import first
l =[( 1 ,'A', 10, 'size' ),
( 1 , 'A', 30, 'height' ),
( 1 , 'A', 20, 'weigth' ),
( 2 , 'A', 10, 'size' ),
( 2 , 'A', 30, 'height' ),
( 2 , 'A', 20, 'weigth' ),
( 3 , 'A', 10, 'size' ),
( 3 , 'A', 30, 'height' ),
( 3 , 'A', 20, 'weigth' )]
df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])
df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()
+---+-----+------+----+------+
| id|place|height|size|weigth|
+---+-----+------+----+------+
| 2| A| 30| 10| 20|
| 3| A| 30| 10| 20|
| 1| A| 30| 10| 20|
+---+-----+------+----+------+
Thank you! that is what I was looking for!
– lolo
Nov 26 '18 at 0:36
add a comment |
First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:
from pyspark.sql.functions import first
l =[( 1 ,'A', 10, 'size' ),
( 1 , 'A', 30, 'height' ),
( 1 , 'A', 20, 'weigth' ),
( 2 , 'A', 10, 'size' ),
( 2 , 'A', 30, 'height' ),
( 2 , 'A', 20, 'weigth' ),
( 3 , 'A', 10, 'size' ),
( 3 , 'A', 30, 'height' ),
( 3 , 'A', 20, 'weigth' )]
df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])
df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()
+---+-----+------+----+------+
| id|place|height|size|weigth|
+---+-----+------+----+------+
| 2| A| 30| 10| 20|
| 3| A| 30| 10| 20|
| 1| A| 30| 10| 20|
+---+-----+------+----+------+
First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:
from pyspark.sql.functions import first
l =[( 1 ,'A', 10, 'size' ),
( 1 , 'A', 30, 'height' ),
( 1 , 'A', 20, 'weigth' ),
( 2 , 'A', 10, 'size' ),
( 2 , 'A', 30, 'height' ),
( 2 , 'A', 20, 'weigth' ),
( 3 , 'A', 10, 'size' ),
( 3 , 'A', 30, 'height' ),
( 3 , 'A', 20, 'weigth' )]
df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])
df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()
+---+-----+------+----+------+
| id|place|height|size|weigth|
+---+-----+------+----+------+
| 2| A| 30| 10| 20|
| 3| A| 30| 10| 20|
| 1| A| 30| 10| 20|
+---+-----+------+----+------+
answered Nov 24 '18 at 1:27
cronoikcronoik
425314
425314
Thank you! that is what I was looking for!
– lolo
Nov 26 '18 at 0:36
add a comment |
Thank you! that is what I was looking for!
– lolo
Nov 26 '18 at 0:36
Thank you! that is what I was looking for!
– lolo
Nov 26 '18 at 0:36
Thank you! that is what I was looking for!
– lolo
Nov 26 '18 at 0:36
add a comment |
Refer to the documentation. Pivoting
is always done in context to aggregation, and I have chosen sum
here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.
df = df.groupBy(["id","place"]).pivot("attribute").sum("value")
This link also addresses the same question.
add a comment |
Refer to the documentation. Pivoting
is always done in context to aggregation, and I have chosen sum
here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.
df = df.groupBy(["id","place"]).pivot("attribute").sum("value")
This link also addresses the same question.
add a comment |
Refer to the documentation. Pivoting
is always done in context to aggregation, and I have chosen sum
here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.
df = df.groupBy(["id","place"]).pivot("attribute").sum("value")
This link also addresses the same question.
Refer to the documentation. Pivoting
is always done in context to aggregation, and I have chosen sum
here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.
df = df.groupBy(["id","place"]).pivot("attribute").sum("value")
This link also addresses the same question.
answered Nov 25 '18 at 10:10
cph_stocph_sto
2,3392421
2,3392421
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53453118%2ftranspose-a-dataframe-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of How to pivot DataFrame?
– user10465355
Nov 24 '18 at 10:55