Split spark DataFrame column

-1

I'm using spark 2.3

I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):

_c0                     | _c1

-----------------------------

1.1   1.2          4.55 | a

4.44  3.1          9.99 | b

1.2   99.88        10.1 | x

I want to split _c0, and create new DataFrame like this:

col1 |col2  |col3 |col4

-----------------------------

1.1  |1.2   |4.55 | a

4.44 |3.1   |9.99 | b

1.2  |99.88 |10.1 | x

I know how to solve this using getItem():

df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()

# now, df[0] is a array of string , and df[1] is string

df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])

But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.

Is there a way to use flatMap to generate the df?

Is there a way to insert df[1] as inner field of df[0]?

Is there a way to use df[0].getItem(), so it returns all inner fields?

Is there a simpler way to generate the data-frame?

Any help will be appreciated

Thanks

edited Nov 27 '18 at 7:10

asked Nov 25 '18 at 10:11

Nir

1177

pls share the structure of your dataframe

– thebluephantom
Nov 25 '18 at 11:35

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Nov 26 '18 at 15:47

pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

– Nir
Nov 27 '18 at 7:22

add a comment |

-1

I'm using spark 2.3

I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):

_c0                     | _c1

-----------------------------

1.1   1.2          4.55 | a

4.44  3.1          9.99 | b

1.2   99.88        10.1 | x

I want to split _c0, and create new DataFrame like this:

col1 |col2  |col3 |col4

-----------------------------

1.1  |1.2   |4.55 | a

4.44 |3.1   |9.99 | b

1.2  |99.88 |10.1 | x

I know how to solve this using getItem():

df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()

# now, df[0] is a array of string , and df[1] is string

df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])

But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.

Is there a way to use flatMap to generate the df?

Is there a way to insert df[1] as inner field of df[0]?

Is there a way to use df[0].getItem(), so it returns all inner fields?

Is there a simpler way to generate the data-frame?

Any help will be appreciated

Thanks

edited Nov 27 '18 at 7:10

asked Nov 25 '18 at 10:11

Nir

1177

pls share the structure of your dataframe

– thebluephantom
Nov 25 '18 at 11:35

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Nov 26 '18 at 15:47

pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

– Nir
Nov 27 '18 at 7:22

add a comment |

-1

I'm using spark 2.3

I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):

_c0                     | _c1

-----------------------------

1.1   1.2          4.55 | a

4.44  3.1          9.99 | b

1.2   99.88        10.1 | x

I want to split _c0, and create new DataFrame like this:

col1 |col2  |col3 |col4

-----------------------------

1.1  |1.2   |4.55 | a

4.44 |3.1   |9.99 | b

1.2  |99.88 |10.1 | x

I know how to solve this using getItem():

df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()

# now, df[0] is a array of string , and df[1] is string

df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])

But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.

Is there a way to use flatMap to generate the df?

Is there a way to insert df[1] as inner field of df[0]?

Is there a way to use df[0].getItem(), so it returns all inner fields?

Is there a simpler way to generate the data-frame?

Any help will be appreciated

Thanks

edited Nov 27 '18 at 7:10

asked Nov 25 '18 at 10:11

Nir

1177

I'm using spark 2.3

I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):

_c0                     | _c1

-----------------------------

1.1   1.2          4.55 | a

4.44  3.1          9.99 | b

1.2   99.88        10.1 | x

I want to split _c0, and create new DataFrame like this:

col1 |col2  |col3 |col4

-----------------------------

1.1  |1.2   |4.55 | a

4.44 |3.1   |9.99 | b

1.2  |99.88 |10.1 | x

I know how to solve this using getItem():

df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()

# now, df[0] is a array of string , and df[1] is string

df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])

But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.

Is there a way to use flatMap to generate the df?

Is there a way to insert df[1] as inner field of df[0]?

Is there a way to use df[0].getItem(), so it returns all inner fields?

Is there a simpler way to generate the data-frame?

Any help will be appreciated

Thanks

apache-spark dataframe pyspark rdd

edited Nov 27 '18 at 7:10

asked Nov 25 '18 at 10:11

Nir

1177

edited Nov 27 '18 at 7:10

asked Nov 25 '18 at 10:11

Nir

1177

edited Nov 27 '18 at 7:10

asked Nov 25 '18 at 10:11

Nir

1177

asked Nov 25 '18 at 10:11

Nir

1177

asked Nov 25 '18 at 10:11

Nir

1177

pls share the structure of your dataframe

– thebluephantom
Nov 25 '18 at 11:35

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Nov 26 '18 at 15:47

pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

– Nir
Nov 27 '18 at 7:22

add a comment |

pls share the structure of your dataframe

– thebluephantom
Nov 25 '18 at 11:35

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Nov 26 '18 at 15:47

pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

– Nir
Nov 27 '18 at 7:22

pls share the structure of your dataframe

– thebluephantom
Nov 25 '18 at 11:35

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Nov 26 '18 at 15:47

pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields

– Nir
Nov 27 '18 at 7:22

add a comment |

1 Answer
1

active

oldest

votes

Use df split function and regex pattern for whitespaces ("\s+").
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html

def split(str, pattern):

    """

    Splits str around pattern (pattern is a regular expression).



    .. note:: pattern is a string represent the regular expression.



    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])

    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()

    [Row(s=[u'ab', u'cd'])]

    """

    sc = SparkContext._active_spark_context

    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))

Then you can use getItem on array col to get particular field value.

edited Nov 25 '18 at 12:03

answered Nov 25 '18 at 11:56

morsik

709815

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53466482%2fsplit-spark-dataframe-column%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Use df split function and regex pattern for whitespaces ("\s+").
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html

def split(str, pattern):

    """

    Splits str around pattern (pattern is a regular expression).



    .. note:: pattern is a string represent the regular expression.



    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])

    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()

    [Row(s=[u'ab', u'cd'])]

    """

    sc = SparkContext._active_spark_context

    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))

Then you can use getItem on array col to get particular field value.

edited Nov 25 '18 at 12:03

answered Nov 25 '18 at 11:56

morsik

709815

add a comment |

Use df split function and regex pattern for whitespaces ("\s+").
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html

def split(str, pattern):

    """

    Splits str around pattern (pattern is a regular expression).



    .. note:: pattern is a string represent the regular expression.



    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])

    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()

    [Row(s=[u'ab', u'cd'])]

    """

    sc = SparkContext._active_spark_context

    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))

Then you can use getItem on array col to get particular field value.

edited Nov 25 '18 at 12:03

answered Nov 25 '18 at 11:56

morsik

709815

add a comment |

Use df split function and regex pattern for whitespaces ("\s+").
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html

def split(str, pattern):

    """

    Splits str around pattern (pattern is a regular expression).



    .. note:: pattern is a string represent the regular expression.



    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])

    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()

    [Row(s=[u'ab', u'cd'])]

    """

    sc = SparkContext._active_spark_context

    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))

Then you can use getItem on array col to get particular field value.

edited Nov 25 '18 at 12:03

answered Nov 25 '18 at 11:56

morsik

709815

Use df split function and regex pattern for whitespaces ("\s+").
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html

def split(str, pattern):

    """

    Splits str around pattern (pattern is a regular expression).



    .. note:: pattern is a string represent the regular expression.



    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])

    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()

    [Row(s=[u'ab', u'cd'])]

    """

    sc = SparkContext._active_spark_context

    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))

Then you can use getItem on array col to get particular field value.

edited Nov 25 '18 at 12:03

answered Nov 25 '18 at 11:56

morsik

709815

edited Nov 25 '18 at 12:03

answered Nov 25 '18 at 11:56

morsik

709815

answered Nov 25 '18 at 11:56

morsik

709815

answered Nov 25 '18 at 11:56

morsik

709815

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

gKJvBdEHAGSJjb7X,iMLlxp1UUSiI2hT,iRYfySEG TOD1uYImSfZaTjXH8 DM,h9VlU,xN0M9ndK,v

搜尋此網誌

Nsryjdtyk