Readstream on Apache Spark with a bad schema is retrying 1830 times











up vote
0
down vote

favorite












In Spark structured streaming, When the incoming record from S3 doesn't match the schema I enforced with .schema(..), and if the size of the record is large (mine is 397KB), that record is retried exactly 1830 times, tested multiple times. Has anyone noticed this weird behaviour?










share|improve this question




























    up vote
    0
    down vote

    favorite












    In Spark structured streaming, When the incoming record from S3 doesn't match the schema I enforced with .schema(..), and if the size of the record is large (mine is 397KB), that record is retried exactly 1830 times, tested multiple times. Has anyone noticed this weird behaviour?










    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      In Spark structured streaming, When the incoming record from S3 doesn't match the schema I enforced with .schema(..), and if the size of the record is large (mine is 397KB), that record is retried exactly 1830 times, tested multiple times. Has anyone noticed this weird behaviour?










      share|improve this question















      In Spark structured streaming, When the incoming record from S3 doesn't match the schema I enforced with .schema(..), and if the size of the record is large (mine is 397KB), that record is retried exactly 1830 times, tested multiple times. Has anyone noticed this weird behaviour?







      apache-spark apache-spark-sql spark-structured-streaming






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 25 at 19:31









      Jacek Laskowski

      42.8k16126256




      42.8k16126256










      asked Nov 19 at 18:01









      Naveen Cotha

      1249




      1249
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          In my case the s3 object was a json array, and it turns out that spark-s3 json reader processes each entry of the array as an individual record in spark dataframe. So the s3 object had 1830 items, which is why the same s3 object is iterated for 1830 items with errors. However, I could not find any official documentation for this behaviour.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53380283%2freadstream-on-apache-spark-with-a-bad-schema-is-retrying-1830-times%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote



            accepted










            In my case the s3 object was a json array, and it turns out that spark-s3 json reader processes each entry of the array as an individual record in spark dataframe. So the s3 object had 1830 items, which is why the same s3 object is iterated for 1830 items with errors. However, I could not find any official documentation for this behaviour.






            share|improve this answer

























              up vote
              0
              down vote



              accepted










              In my case the s3 object was a json array, and it turns out that spark-s3 json reader processes each entry of the array as an individual record in spark dataframe. So the s3 object had 1830 items, which is why the same s3 object is iterated for 1830 items with errors. However, I could not find any official documentation for this behaviour.






              share|improve this answer























                up vote
                0
                down vote



                accepted







                up vote
                0
                down vote



                accepted






                In my case the s3 object was a json array, and it turns out that spark-s3 json reader processes each entry of the array as an individual record in spark dataframe. So the s3 object had 1830 items, which is why the same s3 object is iterated for 1830 items with errors. However, I could not find any official documentation for this behaviour.






                share|improve this answer












                In my case the s3 object was a json array, and it turns out that spark-s3 json reader processes each entry of the array as an individual record in spark dataframe. So the s3 object had 1830 items, which is why the same s3 object is iterated for 1830 items with errors. However, I could not find any official documentation for this behaviour.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 22 at 20:53









                Naveen Cotha

                1249




                1249






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53380283%2freadstream-on-apache-spark-with-a-bad-schema-is-retrying-1830-times%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown