How to handle categorical data for preprocessing in Machine Learning

This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.

My Sample DF:

  T-size Gender  Label

0      L      M      1

1      L      M      1

2      M      F      1

3      S      F      0

4      M      M      1

5      L      M      0

6      S      F      1

7      S      F      0

8      M      M      1

I know this following code convert my categorical data into numerical

Type-1:

df['T-size'] = df['T-size'].cat.codes

Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.

For this example I know S < M < L. What should I do when I have want to convert data like above.

Type-2:

In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample

for Male,

(4/5)

for Female,

(2/4)

WKT,

(4/5) > (2/4)

How should I replace for this kind of column?

Can I replace M with (4/5) and F with (2/4) for this problem?

What is the proper way to dealing with column?

help me to understand this better.

edited Nov 26 '18 at 14:32

Joe

6,12421630

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

add a comment |

My Sample DF:

  T-size Gender  Label

0      L      M      1

1      L      M      1

2      M      F      1

3      S      F      0

4      M      M      1

5      L      M      0

6      S      F      1

7      S      F      0

8      M      M      1

I know this following code convert my categorical data into numerical

Type-1:

df['T-size'] = df['T-size'].cat.codes

Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.

For this example I know S < M < L. What should I do when I have want to convert data like above.

Type-2:

In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample

for Male,

(4/5)

for Female,

(2/4)

WKT,

(4/5) > (2/4)

How should I replace for this kind of column?

Can I replace M with (4/5) and F with (2/4) for this problem?

What is the proper way to dealing with column?

help me to understand this better.

edited Nov 26 '18 at 14:32

Joe

6,12421630

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

add a comment |

My Sample DF:

  T-size Gender  Label

0      L      M      1

1      L      M      1

2      M      F      1

3      S      F      0

4      M      M      1

5      L      M      0

6      S      F      1

7      S      F      0

8      M      M      1

I know this following code convert my categorical data into numerical

Type-1:

df['T-size'] = df['T-size'].cat.codes

Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.

For this example I know S < M < L. What should I do when I have want to convert data like above.

Type-2:

In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample

for Male,

(4/5)

for Female,

(2/4)

WKT,

(4/5) > (2/4)

How should I replace for this kind of column?

Can I replace M with (4/5) and F with (2/4) for this problem?

What is the proper way to dealing with column?

help me to understand this better.

edited Nov 26 '18 at 14:32

Joe

6,12421630

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

My Sample DF:

  T-size Gender  Label

0      L      M      1

1      L      M      1

2      M      F      1

3      S      F      0

4      M      M      1

5      L      M      0

6      S      F      1

7      S      F      0

8      M      M      1

I know this following code convert my categorical data into numerical

Type-1:

df['T-size'] = df['T-size'].cat.codes

Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.

For this example I know S < M < L. What should I do when I have want to convert data like above.

Type-2:

In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample

for Male,

(4/5)

for Female,

(2/4)

WKT,

(4/5) > (2/4)

How should I replace for this kind of column?

Can I replace M with (4/5) and F with (2/4) for this problem?

What is the proper way to dealing with column?

help me to understand this better.

python pandas dataframe machine-learning feature-selection

edited Nov 26 '18 at 14:32

Joe

6,12421630

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

edited Nov 26 '18 at 14:32

Joe

6,12421630

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

edited Nov 26 '18 at 14:32

Joe

6,12421630

edited Nov 26 '18 at 14:32

Joe

6,12421630

edited Nov 26 '18 at 14:32

Joe

6,12421630

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

asked Nov 26 '18 at 9:26

Mohamed Thasin ah

4,10132041

add a comment |

4 Answers
4

active

oldest

votes

There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.

Regarding your t-shirts above, you can give a pandas categorical type an order:

df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).

if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.

Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

– Mohamed Thasin ah
Nov 26 '18 at 10:39

1

Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

– Dan
Nov 26 '18 at 10:39

yeah got it ;-)

– Mohamed Thasin ah
Nov 26 '18 at 10:40

For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

– Mohamed Thasin ah
Nov 26 '18 at 10:58

1

@MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

– Dan
Nov 26 '18 at 11:08

|
show 2 more comments

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:

d = {'L':2, 'M':1, 'S':0}

df['T-size'] = df['T-size'].map(d)

Output:

   T-size Gender  Label

0       2      M      1

1       2      M      1

2       1      F      1

3       0      F      0 

4       1      M      1

5       2      M      0

6       0      F      1

7       0      F      0

8       1      M      1

For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.

edited Nov 26 '18 at 9:58

answered Nov 26 '18 at 9:51

Joe

6,12421630

thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

– Mohamed Thasin ah
Nov 26 '18 at 9:55

I would start from the smallest value 0, and increase till the biggest

– Joe
Nov 26 '18 at 9:56

I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

– Mohamed Thasin ah
Nov 26 '18 at 10:02

Yes it is enough

– Joe
Nov 26 '18 at 10:03

sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

|
show 4 more comments

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :

size_mapping = {"S": 1, "M":2 , "L":3}



#mapping to the DataFrame

df['T-size_num'] = df['T-size'].map(size_mapping)

This allows you to treat the input as numerical data while preserving the hierarchy

And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.

df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})

For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

answered Nov 26 '18 at 9:54

SantiStSupery

117112

Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

– SantiStSupery
Nov 26 '18 at 10:20

I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

– Mohamed Thasin ah
Nov 26 '18 at 10:22

1

Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

– SantiStSupery
Nov 26 '18 at 10:29

add a comment |

It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.

So a dataset of:

Gender

M

F

M

M

F

Would become

Gender_M    Gender_F

1           0

0           1

1           0

1           0

0           1

This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

– Mohamed Thasin ah
Nov 26 '18 at 10:31

1

So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

– Thomas Kimber
Nov 26 '18 at 10:44

Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

– Mohamed Thasin ah
Nov 26 '18 at 10:52

1

Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

– Thomas Kimber
Nov 26 '18 at 11:02

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53478046%2fhow-to-handle-categorical-data-for-preprocessing-in-machine-learning%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Regarding your t-shirts above, you can give a pandas categorical type an order:

df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

– Mohamed Thasin ah
Nov 26 '18 at 10:39

1

Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

– Dan
Nov 26 '18 at 10:39

yeah got it ;-)

– Mohamed Thasin ah
Nov 26 '18 at 10:40

For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

– Mohamed Thasin ah
Nov 26 '18 at 10:58

1

@MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

– Dan
Nov 26 '18 at 11:08

|
show 2 more comments

Regarding your t-shirts above, you can give a pandas categorical type an order:

df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

– Mohamed Thasin ah
Nov 26 '18 at 10:39

1

Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

– Dan
Nov 26 '18 at 10:39

yeah got it ;-)

– Mohamed Thasin ah
Nov 26 '18 at 10:40

For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

– Mohamed Thasin ah
Nov 26 '18 at 10:58

1

@MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

– Dan
Nov 26 '18 at 11:08

|
show 2 more comments

Regarding your t-shirts above, you can give a pandas categorical type an order:

df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

Regarding your t-shirts above, you can give a pandas categorical type an order:

df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

edited Nov 29 '18 at 5:52

Mohamed Thasin ah

4,10132041

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

answered Nov 26 '18 at 10:19

Dan

37.1k1056102

What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

– Mohamed Thasin ah
Nov 26 '18 at 10:39

1

Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

– Dan
Nov 26 '18 at 10:39

yeah got it ;-)

– Mohamed Thasin ah
Nov 26 '18 at 10:40

For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

– Mohamed Thasin ah
Nov 26 '18 at 10:58

1

@MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

– Dan
Nov 26 '18 at 11:08

|
show 2 more comments

What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

– Mohamed Thasin ah
Nov 26 '18 at 10:39

1

Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

– Dan
Nov 26 '18 at 10:39

yeah got it ;-)

– Mohamed Thasin ah
Nov 26 '18 at 10:40

For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

– Mohamed Thasin ah
Nov 26 '18 at 10:58

1

@MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

– Dan
Nov 26 '18 at 11:08

What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

– Mohamed Thasin ah
Nov 26 '18 at 10:39

Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

– Dan
Nov 26 '18 at 10:39

yeah got it ;-)

– Mohamed Thasin ah
Nov 26 '18 at 10:40

For a clarification in

You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen.

this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

– Mohamed Thasin ah
Nov 26 '18 at 10:58

For a clarification in

You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen.

this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

– Mohamed Thasin ah
Nov 26 '18 at 10:58

@MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

– Dan
Nov 26 '18 at 11:08

|
show 2 more comments

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:

d = {'L':2, 'M':1, 'S':0}

df['T-size'] = df['T-size'].map(d)

Output:

   T-size Gender  Label

0       2      M      1

1       2      M      1

2       1      F      1

3       0      F      0 

4       1      M      1

5       2      M      0

6       0      F      1

7       0      F      0

8       1      M      1

edited Nov 26 '18 at 9:58

answered Nov 26 '18 at 9:51

Joe

6,12421630

thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

– Mohamed Thasin ah
Nov 26 '18 at 9:55

I would start from the smallest value 0, and increase till the biggest

– Joe
Nov 26 '18 at 9:56

I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

– Mohamed Thasin ah
Nov 26 '18 at 10:02

Yes it is enough

– Joe
Nov 26 '18 at 10:03

sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

|
show 4 more comments

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:

d = {'L':2, 'M':1, 'S':0}

df['T-size'] = df['T-size'].map(d)

Output:

   T-size Gender  Label

0       2      M      1

1       2      M      1

2       1      F      1

3       0      F      0 

4       1      M      1

5       2      M      0

6       0      F      1

7       0      F      0

8       1      M      1

edited Nov 26 '18 at 9:58

answered Nov 26 '18 at 9:51

Joe

6,12421630

thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

– Mohamed Thasin ah
Nov 26 '18 at 9:55

I would start from the smallest value 0, and increase till the biggest

– Joe
Nov 26 '18 at 9:56

I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

– Mohamed Thasin ah
Nov 26 '18 at 10:02

Yes it is enough

– Joe
Nov 26 '18 at 10:03

sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

|
show 4 more comments

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:

d = {'L':2, 'M':1, 'S':0}

df['T-size'] = df['T-size'].map(d)

Output:

   T-size Gender  Label

0       2      M      1

1       2      M      1

2       1      F      1

3       0      F      0 

4       1      M      1

5       2      M      0

6       0      F      1

7       0      F      0

8       1      M      1

edited Nov 26 '18 at 9:58

answered Nov 26 '18 at 9:51

Joe

6,12421630

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:

d = {'L':2, 'M':1, 'S':0}

df['T-size'] = df['T-size'].map(d)

Output:

   T-size Gender  Label

0       2      M      1

1       2      M      1

2       1      F      1

3       0      F      0 

4       1      M      1

5       2      M      0

6       0      F      1

7       0      F      0

8       1      M      1

edited Nov 26 '18 at 9:58

answered Nov 26 '18 at 9:51

Joe

6,12421630

edited Nov 26 '18 at 9:58

answered Nov 26 '18 at 9:51

Joe

6,12421630

answered Nov 26 '18 at 9:51

Joe

6,12421630

answered Nov 26 '18 at 9:51

Joe

6,12421630

thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

– Mohamed Thasin ah
Nov 26 '18 at 9:55

I would start from the smallest value 0, and increase till the biggest

– Joe
Nov 26 '18 at 9:56

I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

– Mohamed Thasin ah
Nov 26 '18 at 10:02

Yes it is enough

– Joe
Nov 26 '18 at 10:03

sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

|
show 4 more comments

thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

– Mohamed Thasin ah
Nov 26 '18 at 9:55

I would start from the smallest value 0, and increase till the biggest

– Joe
Nov 26 '18 at 9:56

I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

– Mohamed Thasin ah
Nov 26 '18 at 10:02

Yes it is enough

– Joe
Nov 26 '18 at 10:03

sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

– Mohamed Thasin ah
Nov 26 '18 at 9:55

I would start from the smallest value 0, and increase till the biggest

– Joe
Nov 26 '18 at 9:56

I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

– Mohamed Thasin ah
Nov 26 '18 at 10:02

Yes it is enough

– Joe
Nov 26 '18 at 10:03

sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

|
show 4 more comments

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :

size_mapping = {"S": 1, "M":2 , "L":3}



#mapping to the DataFrame

df['T-size_num'] = df['T-size'].map(size_mapping)

This allows you to treat the input as numerical data while preserving the hierarchy

df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})

For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

answered Nov 26 '18 at 9:54

SantiStSupery

117112

Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

– SantiStSupery
Nov 26 '18 at 10:20

I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

– Mohamed Thasin ah
Nov 26 '18 at 10:22

1

Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

– SantiStSupery
Nov 26 '18 at 10:29

add a comment |

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :

size_mapping = {"S": 1, "M":2 , "L":3}



#mapping to the DataFrame

df['T-size_num'] = df['T-size'].map(size_mapping)

This allows you to treat the input as numerical data while preserving the hierarchy

df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})

For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

answered Nov 26 '18 at 9:54

SantiStSupery

117112

Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

– SantiStSupery
Nov 26 '18 at 10:20

I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

– Mohamed Thasin ah
Nov 26 '18 at 10:22

1

Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

– SantiStSupery
Nov 26 '18 at 10:29

add a comment |

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :

size_mapping = {"S": 1, "M":2 , "L":3}



#mapping to the DataFrame

df['T-size_num'] = df['T-size'].map(size_mapping)

This allows you to treat the input as numerical data while preserving the hierarchy

df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})

For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

answered Nov 26 '18 at 9:54

SantiStSupery

117112

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :

size_mapping = {"S": 1, "M":2 , "L":3}



#mapping to the DataFrame

df['T-size_num'] = df['T-size'].map(size_mapping)

This allows you to treat the input as numerical data while preserving the hierarchy

df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})

For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

answered Nov 26 '18 at 9:54

SantiStSupery

117112

answered Nov 26 '18 at 9:54

SantiStSupery

117112

answered Nov 26 '18 at 9:54

SantiStSupery

117112

answered Nov 26 '18 at 9:54

SantiStSupery

117112

Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

– SantiStSupery
Nov 26 '18 at 10:20

I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

– Mohamed Thasin ah
Nov 26 '18 at 10:22

1

Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

– SantiStSupery
Nov 26 '18 at 10:29

add a comment |

Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

– SantiStSupery
Nov 26 '18 at 10:20

I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

– Mohamed Thasin ah
Nov 26 '18 at 10:22

1

Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

– SantiStSupery
Nov 26 '18 at 10:29

Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

– Mohamed Thasin ah
Nov 26 '18 at 10:04

The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

– SantiStSupery
Nov 26 '18 at 10:20

I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

– Mohamed Thasin ah
Nov 26 '18 at 10:22

Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

– SantiStSupery
Nov 26 '18 at 10:29

add a comment |

So a dataset of:

Gender

M

F

M

M

F

Would become

Gender_M    Gender_F

1           0

0           1

1           0

1           0

0           1

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

– Mohamed Thasin ah
Nov 26 '18 at 10:31

1

So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

– Thomas Kimber
Nov 26 '18 at 10:44

Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

– Mohamed Thasin ah
Nov 26 '18 at 10:52

1

Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

– Thomas Kimber
Nov 26 '18 at 11:02

add a comment |

So a dataset of:

Gender

M

F

M

M

F

Would become

Gender_M    Gender_F

1           0

0           1

1           0

1           0

0           1

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

– Mohamed Thasin ah
Nov 26 '18 at 10:31

1

So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

– Thomas Kimber
Nov 26 '18 at 10:44

Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

– Mohamed Thasin ah
Nov 26 '18 at 10:52

1

Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

– Thomas Kimber
Nov 26 '18 at 11:02

add a comment |

So a dataset of:

Gender

M

F

M

M

F

Would become

Gender_M    Gender_F

1           0

0           1

1           0

1           0

0           1

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

So a dataset of:

Gender

M

F

M

M

F

Would become

Gender_M    Gender_F

1           0

0           1

1           0

1           0

0           1

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

answered Nov 26 '18 at 10:23

Thomas Kimber

3,45421324

Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

– Mohamed Thasin ah
Nov 26 '18 at 10:31

1

So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

– Thomas Kimber
Nov 26 '18 at 10:44

Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

– Mohamed Thasin ah
Nov 26 '18 at 10:52

1

Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

– Thomas Kimber
Nov 26 '18 at 11:02

add a comment |

Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

– Mohamed Thasin ah
Nov 26 '18 at 10:31

1

So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

– Thomas Kimber
Nov 26 '18 at 10:44

Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

– Mohamed Thasin ah
Nov 26 '18 at 10:52

1

Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

– Thomas Kimber
Nov 26 '18 at 11:02

Thanks for the answer, I understood the first part, But I failed to understand

an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

this point. can you make me clear.

– Mohamed Thasin ah
Nov 26 '18 at 10:31

Thanks for the answer, I understood the first part, But I failed to understand

an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

this point. can you make me clear.

– Mohamed Thasin ah
Nov 26 '18 at 10:31

So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

– Thomas Kimber
Nov 26 '18 at 10:44

Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

– Mohamed Thasin ah
Nov 26 '18 at 10:52

Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

– Thomas Kimber
Nov 26 '18 at 11:02

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk