Splitting pandas dataframe column (into two) after the first letter in the cell

up vote
3
down vote

favorite

The problem

I would like to split a column from a pandas dataframe into 2 columns, in the percentage column (see below), each entry starts with a capitalised alphabet character, I would like to split the 'Percentage' column immediately after this letter, with the new column labelled 'Amino Acid'.

Current Code:

import pandas as pd



df = pd.read_csv('foo.csv')



df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))



df.to_csv('bar.csv',index=False)

Example of input data

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     |                                         Percentage                                          |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

Example of desired output

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     | Amino Acid |                                         Percentage                                         |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E          |  is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R          | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A          | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

edited Nov 19 at 7:13

Cœur

17k9102140

asked Jul 9 at 10:44

Biomage

908

add a comment |

up vote
3
down vote

favorite

The problem

Current Code:

import pandas as pd



df = pd.read_csv('foo.csv')



df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))



df.to_csv('bar.csv',index=False)

Example of input data

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     |                                         Percentage                                          |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

Example of desired output

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     | Amino Acid |                                         Percentage                                         |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E          |  is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R          | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A          | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

edited Nov 19 at 7:13

Cœur

17k9102140

asked Jul 9 at 10:44

Biomage

908

add a comment |

up vote
3
down vote

favorite

The problem

Current Code:

import pandas as pd



df = pd.read_csv('foo.csv')



df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))



df.to_csv('bar.csv',index=False)

Example of input data

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     |                                         Percentage                                          |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

Example of desired output

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     | Amino Acid |                                         Percentage                                         |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E          |  is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R          | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A          | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

edited Nov 19 at 7:13

Cœur

17k9102140

asked Jul 9 at 10:44

Biomage

908

The problem

Current Code:

import pandas as pd



df = pd.read_csv('foo.csv')



df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))



df.to_csv('bar.csv',index=False)

Example of input data

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     |                                         Percentage                                          |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+

Example of desired output

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

|           Species           |  ID   | OGT |    DB     | Amino Acid |                                         Percentage                                         |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | E          |  is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | R          | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

| Halogeometricum borinquense | 60847 |  37 | ATCC/DSMZ | A          | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa  |

+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+

python pandas

edited Nov 19 at 7:13

Cœur

17k9102140

asked Jul 9 at 10:44

Biomage

908

edited Nov 19 at 7:13

Cœur

17k9102140

asked Jul 9 at 10:44

Biomage

908

edited Nov 19 at 7:13

Cœur

17k9102140

edited Nov 19 at 7:13

Cœur

17k9102140

edited Nov 19 at 7:13

Cœur

17k9102140

asked Jul 9 at 10:44

Biomage

908

asked Jul 9 at 10:44

Biomage

908

asked Jul 9 at 10:44

Biomage

908

add a comment |

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

Use split be first whitespace:

df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)

answered Jul 9 at 10:47

jezrael

310k21246321

Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27

@Biomage - I think need df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46

Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48

1

@Biomage - do you think split by word archaea or bacteria ? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49

if I try: df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True) I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57

|
show 5 more comments

up vote
2
down vote

You can extract the first letter directly:

df['Amino Acid'] = df['Percentage'].str[0]

df['Percentage'] = df['Percentage'].str[1:]

answered Jul 9 at 10:47

jpp

84.8k194897

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51243702%2fsplitting-pandas-dataframe-column-into-two-after-the-first-letter-in-the-cell%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

Use split be first whitespace:

df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)

answered Jul 9 at 10:47

jezrael

310k21246321

Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27

@Biomage - I think need df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46

Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48

1

@Biomage - do you think split by word archaea or bacteria ? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49

if I try: df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True) I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57

|
show 5 more comments

up vote
4
down vote

accepted

Use split be first whitespace:

df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)

answered Jul 9 at 10:47

jezrael

310k21246321

Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27

@Biomage - I think need df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46

Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48

1

@Biomage - do you think split by word archaea or bacteria ? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49

if I try: df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True) I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57

|
show 5 more comments

up vote
4
down vote

accepted

Use split be first whitespace:

df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)

answered Jul 9 at 10:47

jezrael

310k21246321

Use split be first whitespace:

df[['Amino Acid', 'Percentage']] = df['Percentage'].str.split(n=1, expand=True)

answered Jul 9 at 10:47

jezrael

310k21246321

answered Jul 9 at 10:47

jezrael

310k21246321

answered Jul 9 at 10:47

jezrael

310k21246321

answered Jul 9 at 10:47

jezrael

310k21246321

Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27

@Biomage - I think need df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46

Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48

1

@Biomage - do you think split by word archaea or bacteria ? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49

if I try: df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True) I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57

|
show 5 more comments

Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27

@Biomage - I think need df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46

Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48

1

@Biomage - do you think split by word archaea or bacteria ? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49

if I try: df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True) I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57

Do you know how I could also split the string 'archaea' out of the input csv into another column? My input has archaea or bacteria which in that entry
– Biomage
Jul 9 at 12:27

@Biomage - I think need df[['before_arch', 'after_arch']] = df['Percentage'].str.split('archaea',n=1, expand=True)
– jezrael
Jul 9 at 12:46

Hmm, in my actual input data I also have some 'bacteria' instead of 'archaea' is there a solution that put either result in it's own column?
– Biomage
Jul 9 at 12:48

@Biomage - do you think split by word archaea or bacteria ? Then need .str.split('archaea|bacteria',n=1, expand=True)
– jezrael
Jul 9 at 12:49

if I try: df['Domain'] = df['Percentage'].str.split('archaea|bacteria',n=1, expand=True) I get a wrong number of items passed error
– Biomage
Jul 9 at 12:57

|
show 5 more comments

up vote
2
down vote

You can extract the first letter directly:

df['Amino Acid'] = df['Percentage'].str[0]

df['Percentage'] = df['Percentage'].str[1:]

answered Jul 9 at 10:47

jpp

84.8k194897

add a comment |

up vote
2
down vote

You can extract the first letter directly:

df['Amino Acid'] = df['Percentage'].str[0]

df['Percentage'] = df['Percentage'].str[1:]

answered Jul 9 at 10:47

jpp

84.8k194897

add a comment |

up vote
2
down vote

You can extract the first letter directly:

df['Amino Acid'] = df['Percentage'].str[0]

df['Percentage'] = df['Percentage'].str[1:]

answered Jul 9 at 10:47

jpp

84.8k194897

You can extract the first letter directly:

df['Amino Acid'] = df['Percentage'].str[0]

df['Percentage'] = df['Percentage'].str[1:]

answered Jul 9 at 10:47

jpp

84.8k194897

answered Jul 9 at 10:47

jpp

84.8k194897

answered Jul 9 at 10:47

jpp

84.8k194897

answered Jul 9 at 10:47

jpp

84.8k194897

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk