How RDDs are created in Structured streaming in spark?

How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

asked Nov 24 '18 at 1:40

Aman

242

Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

– Jacek Laskowski
Nov 25 '18 at 19:33

add a comment |

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

asked Nov 24 '18 at 1:40

Aman

242

Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

– Jacek Laskowski
Nov 25 '18 at 19:33

add a comment |

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

asked Nov 24 '18 at 1:40

Aman

242

spark-structured-streaming

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

asked Nov 24 '18 at 1:40

Aman

242

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

asked Nov 24 '18 at 1:40

Aman

242

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

edited Nov 25 '18 at 19:31

Jacek Laskowski

45k18132270

asked Nov 24 '18 at 1:40

Aman

242

asked Nov 24 '18 at 1:40

Aman

242

asked Nov 24 '18 at 1:40

Aman

242

Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

– Jacek Laskowski
Nov 25 '18 at 19:33

add a comment |

Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

– Jacek Laskowski
Nov 25 '18 at 19:33

Is this about Spark Structured Streaming (Dataset API) or Spark Streaming (DStream API)?

– Jacek Laskowski
Nov 25 '18 at 19:33

add a comment |

1 Answer
1

active

oldest

votes

-1

Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval

IN the word count example:-

import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)



// Print the first ten elements of each RDD generated in this DStream to the console

wordCounts.print()

So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.

DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.

DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.

DStream is actually a sequence of RDD from the code of DStream:-

  // RDDs generated, marked as private[streaming] so that testsuites can access it

  @transient

  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()

number of executors generated depends upon partition as well as configuration provided.

There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-

http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

edited Nov 24 '18 at 17:09

answered Nov 24 '18 at 4:59

Garvit

154

How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454474%2fhow-rdds-are-created-in-structured-streaming-in-spark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

-1

Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval

IN the word count example:-

import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)



// Print the first ten elements of each RDD generated in this DStream to the console

wordCounts.print()

DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.

DStream is actually a sequence of RDD from the code of DStream:-

  // RDDs generated, marked as private[streaming] so that testsuites can access it

  @transient

  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()

number of executors generated depends upon partition as well as configuration provided.

There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-

http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

edited Nov 24 '18 at 17:09

answered Nov 24 '18 at 4:59

Garvit

154

How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22

add a comment |

-1

Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval

IN the word count example:-

import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)



// Print the first ten elements of each RDD generated in this DStream to the console

wordCounts.print()

DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.

DStream is actually a sequence of RDD from the code of DStream:-

  // RDDs generated, marked as private[streaming] so that testsuites can access it

  @transient

  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()

number of executors generated depends upon partition as well as configuration provided.

There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-

http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

edited Nov 24 '18 at 17:09

answered Nov 24 '18 at 4:59

Garvit

154

How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22

add a comment |

-1

Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval

IN the word count example:-

import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)



// Print the first ten elements of each RDD generated in this DStream to the console

wordCounts.print()

DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.

DStream is actually a sequence of RDD from the code of DStream:-

  // RDDs generated, marked as private[streaming] so that testsuites can access it

  @transient

  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()

number of executors generated depends upon partition as well as configuration provided.

There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-

http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

edited Nov 24 '18 at 17:09

answered Nov 24 '18 at 4:59

Garvit

154

Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval

IN the word count example:-

import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)



// Print the first ten elements of each RDD generated in this DStream to the console

wordCounts.print()

DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.

DStream is actually a sequence of RDD from the code of DStream:-

  // RDDs generated, marked as private[streaming] so that testsuites can access it

  @transient

  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()

number of executors generated depends upon partition as well as configuration provided.

There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-

http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

edited Nov 24 '18 at 17:09

answered Nov 24 '18 at 4:59

Garvit

154

edited Nov 24 '18 at 17:09

answered Nov 24 '18 at 4:59

Garvit

154

answered Nov 24 '18 at 4:59

Garvit

154

answered Nov 24 '18 at 4:59

Garvit

154

How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22

add a comment |

How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22

How does it create RDD with structured streaming, I mean how many executors are created and how partitions are distributed?

– Aman
Nov 24 '18 at 5:22

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk