Split spark DataFrame column
I'm using spark 2.3
I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):
_c0 | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x
I want to split _c0, and create new DataFrame like this:
col1 |col2 |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x
I know how to solve this using getItem():
df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])
But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.
Is there a way to use flatMap to generate the df?
Is there a way to insert df[1] as inner field of df[0]?
Is there a way to use df[0].getItem(), so it returns all inner fields?
Is there a simpler way to generate the data-frame?
Any help will be appreciated
Thanks
apache-spark dataframe pyspark rdd
add a comment |
I'm using spark 2.3
I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):
_c0 | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x
I want to split _c0, and create new DataFrame like this:
col1 |col2 |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x
I know how to solve this using getItem():
df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])
But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.
Is there a way to use flatMap to generate the df?
Is there a way to insert df[1] as inner field of df[0]?
Is there a way to use df[0].getItem(), so it returns all inner fields?
Is there a simpler way to generate the data-frame?
Any help will be appreciated
Thanks
apache-spark dataframe pyspark rdd
pls share the structure of your dataframe
– thebluephantom
Nov 25 '18 at 11:35
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Nov 26 '18 at 15:47
pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields
– Nir
Nov 27 '18 at 7:22
add a comment |
I'm using spark 2.3
I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):
_c0 | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x
I want to split _c0, and create new DataFrame like this:
col1 |col2 |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x
I know how to solve this using getItem():
df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])
But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.
Is there a way to use flatMap to generate the df?
Is there a way to insert df[1] as inner field of df[0]?
Is there a way to use df[0].getItem(), so it returns all inner fields?
Is there a simpler way to generate the data-frame?
Any help will be appreciated
Thanks
apache-spark dataframe pyspark rdd
I'm using spark 2.3
I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):
_c0 | _c1
-----------------------------
1.1 1.2 4.55 | a
4.44 3.1 9.99 | b
1.2 99.88 10.1 | x
I want to split _c0, and create new DataFrame like this:
col1 |col2 |col3 |col4
-----------------------------
1.1 |1.2 |4.55 | a
4.44 |3.1 |9.99 | b
1.2 |99.88 |10.1 | x
I know how to solve this using getItem():
df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])
But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.
Is there a way to use flatMap to generate the df?
Is there a way to insert df[1] as inner field of df[0]?
Is there a way to use df[0].getItem(), so it returns all inner fields?
Is there a simpler way to generate the data-frame?
Any help will be appreciated
Thanks
apache-spark dataframe pyspark rdd
apache-spark dataframe pyspark rdd
edited Nov 27 '18 at 7:10
Nir
asked Nov 25 '18 at 10:11
NirNir
1177
1177
pls share the structure of your dataframe
– thebluephantom
Nov 25 '18 at 11:35
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Nov 26 '18 at 15:47
pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields
– Nir
Nov 27 '18 at 7:22
add a comment |
pls share the structure of your dataframe
– thebluephantom
Nov 25 '18 at 11:35
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Nov 26 '18 at 15:47
pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields
– Nir
Nov 27 '18 at 7:22
pls share the structure of your dataframe
– thebluephantom
Nov 25 '18 at 11:35
pls share the structure of your dataframe
– thebluephantom
Nov 25 '18 at 11:35
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Nov 26 '18 at 15:47
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Nov 26 '18 at 15:47
pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields
– Nir
Nov 27 '18 at 7:22
pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields
– Nir
Nov 27 '18 at 7:22
add a comment |
1 Answer
1
active
oldest
votes
Use df split
function and regex pattern for whitespaces ("\s+"
).
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html
def split(str, pattern):
"""
Splits str around pattern (pattern is a regular expression).
.. note:: pattern is a string represent the regular expression.
>>> df = spark.createDataFrame([('ab12cd',)], ['s',])
>>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
[Row(s=[u'ab', u'cd'])]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
Then you can use getItem
on array col to get particular field value.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53466482%2fsplit-spark-dataframe-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use df split
function and regex pattern for whitespaces ("\s+"
).
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html
def split(str, pattern):
"""
Splits str around pattern (pattern is a regular expression).
.. note:: pattern is a string represent the regular expression.
>>> df = spark.createDataFrame([('ab12cd',)], ['s',])
>>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
[Row(s=[u'ab', u'cd'])]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
Then you can use getItem
on array col to get particular field value.
add a comment |
Use df split
function and regex pattern for whitespaces ("\s+"
).
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html
def split(str, pattern):
"""
Splits str around pattern (pattern is a regular expression).
.. note:: pattern is a string represent the regular expression.
>>> df = spark.createDataFrame([('ab12cd',)], ['s',])
>>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
[Row(s=[u'ab', u'cd'])]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
Then you can use getItem
on array col to get particular field value.
add a comment |
Use df split
function and regex pattern for whitespaces ("\s+"
).
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html
def split(str, pattern):
"""
Splits str around pattern (pattern is a regular expression).
.. note:: pattern is a string represent the regular expression.
>>> df = spark.createDataFrame([('ab12cd',)], ['s',])
>>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
[Row(s=[u'ab', u'cd'])]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
Then you can use getItem
on array col to get particular field value.
Use df split
function and regex pattern for whitespaces ("\s+"
).
Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html
def split(str, pattern):
"""
Splits str around pattern (pattern is a regular expression).
.. note:: pattern is a string represent the regular expression.
>>> df = spark.createDataFrame([('ab12cd',)], ['s',])
>>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
[Row(s=[u'ab', u'cd'])]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
Then you can use getItem
on array col to get particular field value.
edited Nov 25 '18 at 12:03
answered Nov 25 '18 at 11:56
morsikmorsik
709815
709815
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53466482%2fsplit-spark-dataframe-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
pls share the structure of your dataframe
– thebluephantom
Nov 25 '18 at 11:35
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Nov 26 '18 at 15:47
pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields
– Nir
Nov 27 '18 at 7:22