How RDDs are created in Structured streaming in spark?
How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?
spark-structured-streaming
add a comment |
How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?
spark-structured-streaming
Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?
– Jacek Laskowski
Nov 25 '18 at 19:33
add a comment |
How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?
spark-structured-streaming
How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?
spark-structured-streaming
spark-structured-streaming
edited Nov 25 '18 at 19:31
Jacek Laskowski
45k18132270
45k18132270
asked Nov 24 '18 at 1:40
AmanAman
242
242
Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?
– Jacek Laskowski
Nov 25 '18 at 19:33
add a comment |
Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?
– Jacek Laskowski
Nov 25 '18 at 19:33
Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?
– Jacek Laskowski
Nov 25 '18 at 19:33
Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?
– Jacek Laskowski
Nov 25 '18 at 19:33
add a comment |
1 Answer
1
active
oldest
votes
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?
– Aman
Nov 24 '18 at 5:22
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454474%2fhow-rdds-are-created-in-structured-streaming-in-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?
– Aman
Nov 24 '18 at 5:22
add a comment |
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?
– Aman
Nov 24 '18 at 5:22
add a comment |
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
edited Nov 24 '18 at 17:09
answered Nov 24 '18 at 4:59
GarvitGarvit
154
154
How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?
– Aman
Nov 24 '18 at 5:22
add a comment |
How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?
– Aman
Nov 24 '18 at 5:22
How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?
– Aman
Nov 24 '18 at 5:22
How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?
– Aman
Nov 24 '18 at 5:22
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454474%2fhow-rdds-are-created-in-structured-streaming-in-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?
– Jacek Laskowski
Nov 25 '18 at 19:33