does coalesce(1) the dataframe before write have any impact on performance?
up vote
0
down vote
favorite
Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...
I would code like this to write output.
outputData.coalesce(1).write.parquet(outputPath)
(outputData is org.apache.spark.sql.DataFrame)
I would like to ask if their are any impact on performance vs not coalesce
outputData.write.parquet(outputPath)
apache-spark dataframe hdfs parquet
add a comment |
up vote
0
down vote
favorite
Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...
I would code like this to write output.
outputData.coalesce(1).write.parquet(outputPath)
(outputData is org.apache.spark.sql.DataFrame)
I would like to ask if their are any impact on performance vs not coalesce
outputData.write.parquet(outputPath)
apache-spark dataframe hdfs parquet
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...
I would code like this to write output.
outputData.coalesce(1).write.parquet(outputPath)
(outputData is org.apache.spark.sql.DataFrame)
I would like to ask if their are any impact on performance vs not coalesce
outputData.write.parquet(outputPath)
apache-spark dataframe hdfs parquet
Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...
I would code like this to write output.
outputData.coalesce(1).write.parquet(outputPath)
(outputData is org.apache.spark.sql.DataFrame)
I would like to ask if their are any impact on performance vs not coalesce
outputData.write.parquet(outputPath)
apache-spark dataframe hdfs parquet
apache-spark dataframe hdfs parquet
asked Nov 19 at 4:31
Haha TTpro
1,37631031
1,37631031
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).
In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).
In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.
add a comment |
up vote
0
down vote
I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).
In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.
add a comment |
up vote
0
down vote
up vote
0
down vote
I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).
In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.
I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).
In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.
edited Nov 19 at 11:48
user6910411
32k76692
32k76692
answered Nov 19 at 7:38
BDA
1919
1919
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368343%2fdoes-coalesce1-the-dataframe-before-write-have-any-impact-on-performance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown