does coalesce(1) the dataframe before write have any impact on performance?











up vote
0
down vote

favorite












Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...



I would code like this to write output.



outputData.coalesce(1).write.parquet(outputPath)


(outputData is org.apache.spark.sql.DataFrame)



I would like to ask if their are any impact on performance vs not coalesce



outputData.write.parquet(outputPath)









share|improve this question


























    up vote
    0
    down vote

    favorite












    Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...



    I would code like this to write output.



    outputData.coalesce(1).write.parquet(outputPath)


    (outputData is org.apache.spark.sql.DataFrame)



    I would like to ask if their are any impact on performance vs not coalesce



    outputData.write.parquet(outputPath)









    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...



      I would code like this to write output.



      outputData.coalesce(1).write.parquet(outputPath)


      (outputData is org.apache.spark.sql.DataFrame)



      I would like to ask if their are any impact on performance vs not coalesce



      outputData.write.parquet(outputPath)









      share|improve this question













      Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...



      I would code like this to write output.



      outputData.coalesce(1).write.parquet(outputPath)


      (outputData is org.apache.spark.sql.DataFrame)



      I would like to ask if their are any impact on performance vs not coalesce



      outputData.write.parquet(outputPath)






      apache-spark dataframe hdfs parquet






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 19 at 4:31









      Haha TTpro

      1,37631031




      1,37631031
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).



          In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368343%2fdoes-coalesce1-the-dataframe-before-write-have-any-impact-on-performance%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).



            In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.






            share|improve this answer



























              up vote
              0
              down vote













              I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).



              In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.






              share|improve this answer

























                up vote
                0
                down vote










                up vote
                0
                down vote









                I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).



                In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.






                share|improve this answer














                I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).



                In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 19 at 11:48









                user6910411

                32k76692




                32k76692










                answered Nov 19 at 7:38









                BDA

                1919




                1919






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368343%2fdoes-coalesce1-the-dataframe-before-write-have-any-impact-on-performance%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Ottavio Pratesi

                    Tricia Helfer

                    15 giugno