Transpose a dataframe in Pyspark












0















how can I do to transpose the following data frame in Pyspark?



The idea is to achieve the result that appears below.



import pandas as pd

d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}

id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth

d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print(df)

id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20


Any help is welcome. From already thank you very much










share|improve this question

























  • Possible duplicate of How to pivot DataFrame?

    – user10465355
    Nov 24 '18 at 10:55
















0















how can I do to transpose the following data frame in Pyspark?



The idea is to achieve the result that appears below.



import pandas as pd

d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}

id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth

d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print(df)

id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20


Any help is welcome. From already thank you very much










share|improve this question

























  • Possible duplicate of How to pivot DataFrame?

    – user10465355
    Nov 24 '18 at 10:55














0












0








0








how can I do to transpose the following data frame in Pyspark?



The idea is to achieve the result that appears below.



import pandas as pd

d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}

id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth

d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print(df)

id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20


Any help is welcome. From already thank you very much










share|improve this question
















how can I do to transpose the following data frame in Pyspark?



The idea is to achieve the result that appears below.



import pandas as pd

d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}

id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth

d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print(df)

id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20


Any help is welcome. From already thank you very much







apache-spark pyspark apache-spark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 9:40









user10465355

1,9452418




1,9452418










asked Nov 23 '18 at 21:28









lolololo

208211




208211













  • Possible duplicate of How to pivot DataFrame?

    – user10465355
    Nov 24 '18 at 10:55



















  • Possible duplicate of How to pivot DataFrame?

    – user10465355
    Nov 24 '18 at 10:55

















Possible duplicate of How to pivot DataFrame?

– user10465355
Nov 24 '18 at 10:55





Possible duplicate of How to pivot DataFrame?

– user10465355
Nov 24 '18 at 10:55












2 Answers
2






active

oldest

votes


















1














First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:



from pyspark.sql.functions import first
l =[( 1 ,'A', 10, 'size' ),
( 1 , 'A', 30, 'height' ),
( 1 , 'A', 20, 'weigth' ),
( 2 , 'A', 10, 'size' ),
( 2 , 'A', 30, 'height' ),
( 2 , 'A', 20, 'weigth' ),
( 3 , 'A', 10, 'size' ),
( 3 , 'A', 30, 'height' ),
( 3 , 'A', 20, 'weigth' )]

df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

+---+-----+------+----+------+
| id|place|height|size|weigth|
+---+-----+------+----+------+
| 2| A| 30| 10| 20|
| 3| A| 30| 10| 20|
| 1| A| 30| 10| 20|
+---+-----+------+----+------+





share|improve this answer
























  • Thank you! that is what I was looking for!

    – lolo
    Nov 26 '18 at 0:36



















0














Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.



df = df.groupBy(["id","place"]).pivot("attribute").sum("value")


This link also addresses the same question.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53453118%2ftranspose-a-dataframe-in-pyspark%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:



    from pyspark.sql.functions import first
    l =[( 1 ,'A', 10, 'size' ),
    ( 1 , 'A', 30, 'height' ),
    ( 1 , 'A', 20, 'weigth' ),
    ( 2 , 'A', 10, 'size' ),
    ( 2 , 'A', 30, 'height' ),
    ( 2 , 'A', 20, 'weigth' ),
    ( 3 , 'A', 10, 'size' ),
    ( 3 , 'A', 30, 'height' ),
    ( 3 , 'A', 20, 'weigth' )]

    df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

    df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

    +---+-----+------+----+------+
    | id|place|height|size|weigth|
    +---+-----+------+----+------+
    | 2| A| 30| 10| 20|
    | 3| A| 30| 10| 20|
    | 1| A| 30| 10| 20|
    +---+-----+------+----+------+





    share|improve this answer
























    • Thank you! that is what I was looking for!

      – lolo
      Nov 26 '18 at 0:36
















    1














    First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:



    from pyspark.sql.functions import first
    l =[( 1 ,'A', 10, 'size' ),
    ( 1 , 'A', 30, 'height' ),
    ( 1 , 'A', 20, 'weigth' ),
    ( 2 , 'A', 10, 'size' ),
    ( 2 , 'A', 30, 'height' ),
    ( 2 , 'A', 20, 'weigth' ),
    ( 3 , 'A', 10, 'size' ),
    ( 3 , 'A', 30, 'height' ),
    ( 3 , 'A', 20, 'weigth' )]

    df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

    df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

    +---+-----+------+----+------+
    | id|place|height|size|weigth|
    +---+-----+------+----+------+
    | 2| A| 30| 10| 20|
    | 3| A| 30| 10| 20|
    | 1| A| 30| 10| 20|
    +---+-----+------+----+------+





    share|improve this answer
























    • Thank you! that is what I was looking for!

      – lolo
      Nov 26 '18 at 0:36














    1












    1








    1







    First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:



    from pyspark.sql.functions import first
    l =[( 1 ,'A', 10, 'size' ),
    ( 1 , 'A', 30, 'height' ),
    ( 1 , 'A', 20, 'weigth' ),
    ( 2 , 'A', 10, 'size' ),
    ( 2 , 'A', 30, 'height' ),
    ( 2 , 'A', 20, 'weigth' ),
    ( 3 , 'A', 10, 'size' ),
    ( 3 , 'A', 30, 'height' ),
    ( 3 , 'A', 20, 'weigth' )]

    df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

    df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

    +---+-----+------+----+------+
    | id|place|height|size|weigth|
    +---+-----+------+----+------+
    | 2| A| 30| 10| 20|
    | 3| A| 30| 10| 20|
    | 1| A| 30| 10| 20|
    +---+-----+------+----+------+





    share|improve this answer













    First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:



    from pyspark.sql.functions import first
    l =[( 1 ,'A', 10, 'size' ),
    ( 1 , 'A', 30, 'height' ),
    ( 1 , 'A', 20, 'weigth' ),
    ( 2 , 'A', 10, 'size' ),
    ( 2 , 'A', 30, 'height' ),
    ( 2 , 'A', 20, 'weigth' ),
    ( 3 , 'A', 10, 'size' ),
    ( 3 , 'A', 30, 'height' ),
    ( 3 , 'A', 20, 'weigth' )]

    df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

    df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

    +---+-----+------+----+------+
    | id|place|height|size|weigth|
    +---+-----+------+----+------+
    | 2| A| 30| 10| 20|
    | 3| A| 30| 10| 20|
    | 1| A| 30| 10| 20|
    +---+-----+------+----+------+






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 24 '18 at 1:27









    cronoikcronoik

    425314




    425314













    • Thank you! that is what I was looking for!

      – lolo
      Nov 26 '18 at 0:36



















    • Thank you! that is what I was looking for!

      – lolo
      Nov 26 '18 at 0:36

















    Thank you! that is what I was looking for!

    – lolo
    Nov 26 '18 at 0:36





    Thank you! that is what I was looking for!

    – lolo
    Nov 26 '18 at 0:36













    0














    Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.



    df = df.groupBy(["id","place"]).pivot("attribute").sum("value")


    This link also addresses the same question.






    share|improve this answer




























      0














      Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.



      df = df.groupBy(["id","place"]).pivot("attribute").sum("value")


      This link also addresses the same question.






      share|improve this answer


























        0












        0








        0







        Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.



        df = df.groupBy(["id","place"]).pivot("attribute").sum("value")


        This link also addresses the same question.






        share|improve this answer













        Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.



        df = df.groupBy(["id","place"]).pivot("attribute").sum("value")


        This link also addresses the same question.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 25 '18 at 10:10









        cph_stocph_sto

        2,3392421




        2,3392421






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53453118%2ftranspose-a-dataframe-in-pyspark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Create new schema in PostgreSQL using DBeaver

            Deepest pit of an array with Javascript: test on Codility

            Costa Masnaga