Split spark DataFrame column












-1















I'm using spark 2.3



I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):



_c0                     | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x


I want to split _c0, and create new DataFrame like this:



col1 |col2  |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x


I know how to solve this using getItem():



df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])


But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.



Is there a way to use flatMap to generate the df?



Is there a way to insert df[1] as inner field of df[0]?



Is there a way to use df[0].getItem(), so it returns all inner fields?



Is there a simpler way to generate the data-frame?



Any help will be appreciated



Thanks










share|improve this question

























  • pls share the structure of your dataframe

    – thebluephantom
    Nov 25 '18 at 11:35











  • Possible duplicate of Split Spark Dataframe string column into multiple columns

    – pault
    Nov 26 '18 at 15:47











  • pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

    – Nir
    Nov 27 '18 at 7:22
















-1















I'm using spark 2.3



I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):



_c0                     | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x


I want to split _c0, and create new DataFrame like this:



col1 |col2  |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x


I know how to solve this using getItem():



df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])


But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.



Is there a way to use flatMap to generate the df?



Is there a way to insert df[1] as inner field of df[0]?



Is there a way to use df[0].getItem(), so it returns all inner fields?



Is there a simpler way to generate the data-frame?



Any help will be appreciated



Thanks










share|improve this question

























  • pls share the structure of your dataframe

    – thebluephantom
    Nov 25 '18 at 11:35











  • Possible duplicate of Split Spark Dataframe string column into multiple columns

    – pault
    Nov 26 '18 at 15:47











  • pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

    – Nir
    Nov 27 '18 at 7:22














-1












-1








-1








I'm using spark 2.3



I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):



_c0                     | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x


I want to split _c0, and create new DataFrame like this:



col1 |col2  |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x


I know how to solve this using getItem():



df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])


But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.



Is there a way to use flatMap to generate the df?



Is there a way to insert df[1] as inner field of df[0]?



Is there a way to use df[0].getItem(), so it returns all inner fields?



Is there a simpler way to generate the data-frame?



Any help will be appreciated



Thanks










share|improve this question
















I'm using spark 2.3



I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):



_c0                     | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x


I want to split _c0, and create new DataFrame like this:



col1 |col2  |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x


I know how to solve this using getItem():



df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])


But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.



Is there a way to use flatMap to generate the df?



Is there a way to insert df[1] as inner field of df[0]?



Is there a way to use df[0].getItem(), so it returns all inner fields?



Is there a simpler way to generate the data-frame?



Any help will be appreciated



Thanks







apache-spark dataframe pyspark rdd






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 '18 at 7:10







Nir

















asked Nov 25 '18 at 10:11









NirNir

1177




1177













  • pls share the structure of your dataframe

    – thebluephantom
    Nov 25 '18 at 11:35











  • Possible duplicate of Split Spark Dataframe string column into multiple columns

    – pault
    Nov 26 '18 at 15:47











  • pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

    – Nir
    Nov 27 '18 at 7:22



















  • pls share the structure of your dataframe

    – thebluephantom
    Nov 25 '18 at 11:35











  • Possible duplicate of Split Spark Dataframe string column into multiple columns

    – pault
    Nov 26 '18 at 15:47











  • pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

    – Nir
    Nov 27 '18 at 7:22

















pls share the structure of your dataframe

– thebluephantom
Nov 25 '18 at 11:35





pls share the structure of your dataframe

– thebluephantom
Nov 25 '18 at 11:35













Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Nov 26 '18 at 15:47





Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Nov 26 '18 at 15:47













pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

– Nir
Nov 27 '18 at 7:22





pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

– Nir
Nov 27 '18 at 7:22












1 Answer
1






active

oldest

votes


















0














Use df split function and regex pattern for whitespaces ("\s+").
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html



def split(str, pattern):
"""
Splits str around pattern (pattern is a regular expression).

.. note:: pattern is a string represent the regular expression.

>>> df = spark.createDataFrame([('ab12cd',)], ['s',])
>>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
[Row(s=[u'ab', u'cd'])]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.split(_to_java_column(str), pattern))


Then you can use getItem on array col to get particular field value.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53466482%2fsplit-spark-dataframe-column%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Use df split function and regex pattern for whitespaces ("\s+").
    Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html



    def split(str, pattern):
    """
    Splits str around pattern (pattern is a regular expression).

    .. note:: pattern is a string represent the regular expression.

    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    [Row(s=[u'ab', u'cd'])]
    """
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))


    Then you can use getItem on array col to get particular field value.






    share|improve this answer






























      0














      Use df split function and regex pattern for whitespaces ("\s+").
      Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html



      def split(str, pattern):
      """
      Splits str around pattern (pattern is a regular expression).

      .. note:: pattern is a string represent the regular expression.

      >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
      >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
      [Row(s=[u'ab', u'cd'])]
      """
      sc = SparkContext._active_spark_context
      return Column(sc._jvm.functions.split(_to_java_column(str), pattern))


      Then you can use getItem on array col to get particular field value.






      share|improve this answer




























        0












        0








        0







        Use df split function and regex pattern for whitespaces ("\s+").
        Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html



        def split(str, pattern):
        """
        Splits str around pattern (pattern is a regular expression).

        .. note:: pattern is a string represent the regular expression.

        >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
        >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
        [Row(s=[u'ab', u'cd'])]
        """
        sc = SparkContext._active_spark_context
        return Column(sc._jvm.functions.split(_to_java_column(str), pattern))


        Then you can use getItem on array col to get particular field value.






        share|improve this answer















        Use df split function and regex pattern for whitespaces ("\s+").
        Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html



        def split(str, pattern):
        """
        Splits str around pattern (pattern is a regular expression).

        .. note:: pattern is a string represent the regular expression.

        >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
        >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
        [Row(s=[u'ab', u'cd'])]
        """
        sc = SparkContext._active_spark_context
        return Column(sc._jvm.functions.split(_to_java_column(str), pattern))


        Then you can use getItem on array col to get particular field value.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 25 '18 at 12:03

























        answered Nov 25 '18 at 11:56









        morsikmorsik

        709815




        709815
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53466482%2fsplit-spark-dataframe-column%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Costa Masnaga

            Fotorealismo

            Sidney Franklin