How to efficiently wait for an external task to finish while Spark is working?












0















I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.



A naive implementation would look like this:



val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()

val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time

val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")


The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.



How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.










share|improve this question























  • how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?

    – mangusta
    Nov 25 '18 at 12:44











  • The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.

    – Csaba
    Nov 25 '18 at 20:43


















0















I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.



A naive implementation would look like this:



val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()

val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time

val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")


The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.



How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.










share|improve this question























  • how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?

    – mangusta
    Nov 25 '18 at 12:44











  • The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.

    – Csaba
    Nov 25 '18 at 20:43
















0












0








0








I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.



A naive implementation would look like this:



val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()

val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time

val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")


The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.



How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.










share|improve this question














I have the following problem: there is a large CSV file which we read with spark. We need to transform each line of the file and write it back to another text file. During this transformation, we need to invoke an external service via REST and before we write the file, we need to get the answer back from the service. In the meantime, we can do other actions on Spark.



A naive implementation would look like this:



val keys = spark.read.csv("/path/to/myfile.csv")
.map(row => {
val result = new Param(row(0), row(3), row(7))
result
})
.collect()

val enrichedData: Map[String,String] = keys.map(key =>
externalService.getValue(key)) // unpredictable response time

val finalResult = spark.read.csv("/path/to/myfile.csv") // read same file twice
.map(row => doSomeTransformation(row))
.map(row => doSomeMoreTransformation(row))
.map(row => andAnotherCostlyOperation(row))
.map(row => {
val key = row(0)
val myData = enrichedData(key) // only here we use the data from ext.serv.
enrichRow(row, myData)
})
.write.csv("/path/to/output.csv")


The problem here is that I'm reading the same input file twice. First I need to collect all the keys from the file and send them to the external service. Once the external service responds, I can go over the same file again and use the data I've obtained to finish processing the data.



How to do this more efficiently? I don't really need the data from the external service until the very last step, so it would be nice if Spark could do the other transformations in parallel, and then when it reaches the last map() function, it would wait (potentially 0 sec) for the data from the external service.







apache-spark






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 25 '18 at 12:06









CsabaCsaba

1331211




1331211













  • how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?

    – mangusta
    Nov 25 '18 at 12:44











  • The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.

    – Csaba
    Nov 25 '18 at 20:43





















  • how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?

    – mangusta
    Nov 25 '18 at 12:44











  • The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.

    – Csaba
    Nov 25 '18 at 20:43



















how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?

– mangusta
Nov 25 '18 at 12:44





how about invoking an external service while creating "finalResult", i.e. without preparing "enrichedData" beforehand?

– mangusta
Nov 25 '18 at 12:44













The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.

– Csaba
Nov 25 '18 at 20:43







The problem is that if I invoke the external service, it can take a long time to get back a response. So if I invoke it in the end when creating finalResult, it could block the whole process. It would be a more efficient use of time to invoke the service at the beginning, do all the other work in the meantime, and in the end collect the result from the external service.

– Csaba
Nov 25 '18 at 20:43














0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53467259%2fhow-to-efficiently-wait-for-an-external-task-to-finish-while-spark-is-working%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53467259%2fhow-to-efficiently-wait-for-an-external-task-to-finish-while-spark-is-working%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Costa Masnaga

Fotorealismo

Sidney Franklin