How to efficiently wait for an external task to finish while Spark is working?
I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.
A naive implementation would look like this:
val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()
val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time
val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")
The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.
How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.
apache-spark
add a comment |
I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.
A naive implementation would look like this:
val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()
val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time
val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")
The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.
How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.
apache-spark
how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?
– mangusta
Nov 25 '18 at 12:44
The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.
– Csaba
Nov 25 '18 at 20:43
add a comment |
I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.
A naive implementation would look like this:
val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()
val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time
val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")
The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.
How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.
apache-spark
I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.
A naive implementation would look like this:
val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()
val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time
val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")
The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.
How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.
apache-spark
apache-spark
asked Nov 25 '18 at 12:06
CsabaCsaba
1331211
1331211
how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?
– mangusta
Nov 25 '18 at 12:44
The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.
– Csaba
Nov 25 '18 at 20:43
add a comment |
how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?
– mangusta
Nov 25 '18 at 12:44
The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.
– Csaba
Nov 25 '18 at 20:43
how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?
– mangusta
Nov 25 '18 at 12:44
how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?
– mangusta
Nov 25 '18 at 12:44
The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.
– Csaba
Nov 25 '18 at 20:43
The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.
– Csaba
Nov 25 '18 at 20:43
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53467259%2fhow-to-efficiently-wait-for-an-external-task-to-finish-while-spark-is-working%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53467259%2fhow-to-efficiently-wait-for-an-external-task-to-finish-while-spark-is-working%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?
– mangusta
Nov 25 '18 at 12:44
The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.
– Csaba
Nov 25 '18 at 20:43