How RDDs are created in Structured streaming in spark?












0















How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?










share|improve this question

























  • Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

    – Jacek Laskowski
    Nov 25 '18 at 19:33
















0















How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?










share|improve this question

























  • Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

    – Jacek Laskowski
    Nov 25 '18 at 19:33














0












0








0








How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?










share|improve this question
















How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?







spark-structured-streaming






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 19:31









Jacek Laskowski

45k18132270




45k18132270










asked Nov 24 '18 at 1:40









AmanAman

242




242













  • Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

    – Jacek Laskowski
    Nov 25 '18 at 19:33



















  • Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

    – Jacek Laskowski
    Nov 25 '18 at 19:33

















Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

– Jacek Laskowski
Nov 25 '18 at 19:33





Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

– Jacek Laskowski
Nov 25 '18 at 19:33












1 Answer
1






active

oldest

votes


















-1














Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval



IN the word count example:-



import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()


So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.



DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.



DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.



DStream is actually a sequence of RDD from the code of DStream:-



  // RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()


number of executors generated depends upon partition as well as configuration provided.



There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-



http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/






share|improve this answer


























  • How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

    – Aman
    Nov 24 '18 at 5:22













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454474%2fhow-rdds-are-created-in-structured-streaming-in-spark%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









-1














Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval



IN the word count example:-



import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()


So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.



DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.



DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.



DStream is actually a sequence of RDD from the code of DStream:-



  // RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()


number of executors generated depends upon partition as well as configuration provided.



There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-



http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/






share|improve this answer


























  • How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

    – Aman
    Nov 24 '18 at 5:22


















-1














Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval



IN the word count example:-



import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()


So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.



DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.



DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.



DStream is actually a sequence of RDD from the code of DStream:-



  // RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()


number of executors generated depends upon partition as well as configuration provided.



There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-



http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/






share|improve this answer


























  • How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

    – Aman
    Nov 24 '18 at 5:22
















-1












-1








-1







Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval



IN the word count example:-



import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()


So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.



DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.



DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.



DStream is actually a sequence of RDD from the code of DStream:-



  // RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()


number of executors generated depends upon partition as well as configuration provided.



There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-



http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/






share|improve this answer















Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval



IN the word count example:-



import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()


So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.



DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.



DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.



DStream is actually a sequence of RDD from the code of DStream:-



  // RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()


number of executors generated depends upon partition as well as configuration provided.



There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-



http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 24 '18 at 17:09

























answered Nov 24 '18 at 4:59









GarvitGarvit

154




154













  • How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

    – Aman
    Nov 24 '18 at 5:22





















  • How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

    – Aman
    Nov 24 '18 at 5:22



















How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22







How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454474%2fhow-rdds-are-created-in-structured-streaming-in-spark%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ottavio Pratesi

Tricia Helfer

15 giugno