How to handle categorical data for preprocessing in Machine Learning












2















This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.



My Sample DF:



  T-size Gender  Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1


I know this following code convert my categorical data into numerical



Type-1:



df['T-size'] = df['T-size'].cat.codes


Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.



For this example I know S < M < L. What should I do when I have want to convert data like above.



Type-2:



In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample



for Male,



(4/5)



for Female,



(2/4)



WKT,



(4/5) > (2/4)



How should I replace for this kind of column?



Can I replace M with (4/5) and F with (2/4) for this problem?



What is the proper way to dealing with column?



help me to understand this better.










share|improve this question





























    2















    This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.



    My Sample DF:



      T-size Gender  Label
    0 L M 1
    1 L M 1
    2 M F 1
    3 S F 0
    4 M M 1
    5 L M 0
    6 S F 1
    7 S F 0
    8 M M 1


    I know this following code convert my categorical data into numerical



    Type-1:



    df['T-size'] = df['T-size'].cat.codes


    Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.



    For this example I know S < M < L. What should I do when I have want to convert data like above.



    Type-2:



    In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample



    for Male,



    (4/5)



    for Female,



    (2/4)



    WKT,



    (4/5) > (2/4)



    How should I replace for this kind of column?



    Can I replace M with (4/5) and F with (2/4) for this problem?



    What is the proper way to dealing with column?



    help me to understand this better.










    share|improve this question



























      2












      2








      2


      1






      This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.



      My Sample DF:



        T-size Gender  Label
      0 L M 1
      1 L M 1
      2 M F 1
      3 S F 0
      4 M M 1
      5 L M 0
      6 S F 1
      7 S F 0
      8 M M 1


      I know this following code convert my categorical data into numerical



      Type-1:



      df['T-size'] = df['T-size'].cat.codes


      Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.



      For this example I know S < M < L. What should I do when I have want to convert data like above.



      Type-2:



      In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample



      for Male,



      (4/5)



      for Female,



      (2/4)



      WKT,



      (4/5) > (2/4)



      How should I replace for this kind of column?



      Can I replace M with (4/5) and F with (2/4) for this problem?



      What is the proper way to dealing with column?



      help me to understand this better.










      share|improve this question
















      This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.



      My Sample DF:



        T-size Gender  Label
      0 L M 1
      1 L M 1
      2 M F 1
      3 S F 0
      4 M M 1
      5 L M 0
      6 S F 1
      7 S F 0
      8 M M 1


      I know this following code convert my categorical data into numerical



      Type-1:



      df['T-size'] = df['T-size'].cat.codes


      Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.



      For this example I know S < M < L. What should I do when I have want to convert data like above.



      Type-2:



      In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample



      for Male,



      (4/5)



      for Female,



      (2/4)



      WKT,



      (4/5) > (2/4)



      How should I replace for this kind of column?



      Can I replace M with (4/5) and F with (2/4) for this problem?



      What is the proper way to dealing with column?



      help me to understand this better.







      python pandas dataframe machine-learning feature-selection






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 26 '18 at 14:32









      Joe

      6,12421630




      6,12421630










      asked Nov 26 '18 at 9:26









      Mohamed Thasin ahMohamed Thasin ah

      4,10132041




      4,10132041
























          4 Answers
          4






          active

          oldest

          votes


















          2














          There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.



          Regarding your t-shirts above, you can give a pandas categorical type an order:



          df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).


          if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.



          Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.






          share|improve this answer


























          • What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

            – Mohamed Thasin ah
            Nov 26 '18 at 10:39






          • 1





            Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

            – Dan
            Nov 26 '18 at 10:39











          • yeah got it ;-)

            – Mohamed Thasin ah
            Nov 26 '18 at 10:40











          • For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:58






          • 1





            @MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

            – Dan
            Nov 26 '18 at 11:08



















          2














          For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:



          d = {'L':2, 'M':1, 'S':0}
          df['T-size'] = df['T-size'].map(d)


          Output:



             T-size Gender  Label
          0 2 M 1
          1 2 M 1
          2 1 F 1
          3 0 F 0
          4 1 M 1
          5 2 M 0
          6 0 F 1
          7 0 F 0
          8 1 M 1


          For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.






          share|improve this answer


























          • thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

            – Mohamed Thasin ah
            Nov 26 '18 at 9:55













          • I would start from the smallest value 0, and increase till the biggest

            – Joe
            Nov 26 '18 at 9:56











          • I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:02











          • Yes it is enough

            – Joe
            Nov 26 '18 at 10:03











          • sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04



















          1














          If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :



          size_mapping = {"S": 1, "M":2 , "L":3}

          #mapping to the DataFrame
          df['T-size_num'] = df['T-size'].map(size_mapping)


          This allows you to treat the input as numerical data while preserving the hierarchy



          And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.



          df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})


          For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning






          share|improve this answer
























          • Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04











          • The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

            – SantiStSupery
            Nov 26 '18 at 10:20













          • I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:22






          • 1





            Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

            – SantiStSupery
            Nov 26 '18 at 10:29



















          1














          It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.



          So a dataset of:



          Gender
          M
          F
          M
          M
          F


          Would become



          Gender_M    Gender_F
          1 0
          0 1
          1 0
          1 0
          0 1


          This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.






          share|improve this answer
























          • Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

            – Mohamed Thasin ah
            Nov 26 '18 at 10:31






          • 1





            So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

            – Thomas Kimber
            Nov 26 '18 at 10:44











          • Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:52






          • 1





            Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

            – Thomas Kimber
            Nov 26 '18 at 11:02











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53478046%2fhow-to-handle-categorical-data-for-preprocessing-in-machine-learning%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          4 Answers
          4






          active

          oldest

          votes








          4 Answers
          4






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.



          Regarding your t-shirts above, you can give a pandas categorical type an order:



          df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).


          if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.



          Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.






          share|improve this answer


























          • What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

            – Mohamed Thasin ah
            Nov 26 '18 at 10:39






          • 1





            Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

            – Dan
            Nov 26 '18 at 10:39











          • yeah got it ;-)

            – Mohamed Thasin ah
            Nov 26 '18 at 10:40











          • For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:58






          • 1





            @MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

            – Dan
            Nov 26 '18 at 11:08
















          2














          There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.



          Regarding your t-shirts above, you can give a pandas categorical type an order:



          df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).


          if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.



          Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.






          share|improve this answer


























          • What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

            – Mohamed Thasin ah
            Nov 26 '18 at 10:39






          • 1





            Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

            – Dan
            Nov 26 '18 at 10:39











          • yeah got it ;-)

            – Mohamed Thasin ah
            Nov 26 '18 at 10:40











          • For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:58






          • 1





            @MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

            – Dan
            Nov 26 '18 at 11:08














          2












          2








          2







          There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.



          Regarding your t-shirts above, you can give a pandas categorical type an order:



          df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).


          if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.



          Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.






          share|improve this answer















          There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.



          Regarding your t-shirts above, you can give a pandas categorical type an order:



          df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).


          if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.



          Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 29 '18 at 5:52









          Mohamed Thasin ah

          4,10132041




          4,10132041










          answered Nov 26 '18 at 10:19









          DanDan

          37.1k1056102




          37.1k1056102













          • What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

            – Mohamed Thasin ah
            Nov 26 '18 at 10:39






          • 1





            Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

            – Dan
            Nov 26 '18 at 10:39











          • yeah got it ;-)

            – Mohamed Thasin ah
            Nov 26 '18 at 10:40











          • For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:58






          • 1





            @MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

            – Dan
            Nov 26 '18 at 11:08



















          • What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

            – Mohamed Thasin ah
            Nov 26 '18 at 10:39






          • 1





            Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

            – Dan
            Nov 26 '18 at 10:39











          • yeah got it ;-)

            – Mohamed Thasin ah
            Nov 26 '18 at 10:40











          • For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:58






          • 1





            @MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

            – Dan
            Nov 26 '18 at 11:08

















          What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

          – Mohamed Thasin ah
          Nov 26 '18 at 10:39





          What is the importance of ordered. i.e., what is the difference between df['T-size'].astype('category') and df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True))

          – Mohamed Thasin ah
          Nov 26 '18 at 10:39




          1




          1





          Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

          – Dan
          Nov 26 '18 at 10:39





          Well, you said you wanted to be sure that S<M<L, so that's what ordered does for you

          – Dan
          Nov 26 '18 at 10:39













          yeah got it ;-)

          – Mohamed Thasin ah
          Nov 26 '18 at 10:40





          yeah got it ;-)

          – Mohamed Thasin ah
          Nov 26 '18 at 10:40













          For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:58





          For a clarification in You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. this statement this encoding represents which encoding? you mean my assumption of encoding right? or you mean something other?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:58




          1




          1





          @MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

          – Dan
          Nov 26 '18 at 11:08





          @MohamedThasinah technically, any encoding, But particularly any encoding that uses the target variable or the distribution of the variable being encoded.

          – Dan
          Nov 26 '18 at 11:08













          2














          For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:



          d = {'L':2, 'M':1, 'S':0}
          df['T-size'] = df['T-size'].map(d)


          Output:



             T-size Gender  Label
          0 2 M 1
          1 2 M 1
          2 1 F 1
          3 0 F 0
          4 1 M 1
          5 2 M 0
          6 0 F 1
          7 0 F 0
          8 1 M 1


          For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.






          share|improve this answer


























          • thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

            – Mohamed Thasin ah
            Nov 26 '18 at 9:55













          • I would start from the smallest value 0, and increase till the biggest

            – Joe
            Nov 26 '18 at 9:56











          • I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:02











          • Yes it is enough

            – Joe
            Nov 26 '18 at 10:03











          • sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04
















          2














          For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:



          d = {'L':2, 'M':1, 'S':0}
          df['T-size'] = df['T-size'].map(d)


          Output:



             T-size Gender  Label
          0 2 M 1
          1 2 M 1
          2 1 F 1
          3 0 F 0
          4 1 M 1
          5 2 M 0
          6 0 F 1
          7 0 F 0
          8 1 M 1


          For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.






          share|improve this answer


























          • thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

            – Mohamed Thasin ah
            Nov 26 '18 at 9:55













          • I would start from the smallest value 0, and increase till the biggest

            – Joe
            Nov 26 '18 at 9:56











          • I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:02











          • Yes it is enough

            – Joe
            Nov 26 '18 at 10:03











          • sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04














          2












          2








          2







          For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:



          d = {'L':2, 'M':1, 'S':0}
          df['T-size'] = df['T-size'].map(d)


          Output:



             T-size Gender  Label
          0 2 M 1
          1 2 M 1
          2 1 F 1
          3 0 F 0
          4 1 M 1
          5 2 M 0
          6 0 F 1
          7 0 F 0
          8 1 M 1


          For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.






          share|improve this answer















          For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:



          d = {'L':2, 'M':1, 'S':0}
          df['T-size'] = df['T-size'].map(d)


          Output:



             T-size Gender  Label
          0 2 M 1
          1 2 M 1
          2 1 F 1
          3 0 F 0
          4 1 M 1
          5 2 M 0
          6 0 F 1
          7 0 F 0
          8 1 M 1


          For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 26 '18 at 9:58

























          answered Nov 26 '18 at 9:51









          JoeJoe

          6,12421630




          6,12421630













          • thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

            – Mohamed Thasin ah
            Nov 26 '18 at 9:55













          • I would start from the smallest value 0, and increase till the biggest

            – Joe
            Nov 26 '18 at 9:56











          • I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:02











          • Yes it is enough

            – Joe
            Nov 26 '18 at 10:03











          • sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04



















          • thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

            – Mohamed Thasin ah
            Nov 26 '18 at 9:55













          • I would start from the smallest value 0, and increase till the biggest

            – Joe
            Nov 26 '18 at 9:56











          • I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:02











          • Yes it is enough

            – Joe
            Nov 26 '18 at 10:03











          • sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04

















          thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

          – Mohamed Thasin ah
          Nov 26 '18 at 9:55







          thanks for the answer, here you used mapping values for s->0, M->2, L->2 May I know why did you choose 0,1,2. If it represents weight can I use 10, 20, 30 respectively. will it make any difference in my model

          – Mohamed Thasin ah
          Nov 26 '18 at 9:55















          I would start from the smallest value 0, and increase till the biggest

          – Joe
          Nov 26 '18 at 9:56





          I would start from the smallest value 0, and increase till the biggest

          – Joe
          Nov 26 '18 at 9:56













          I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:02





          I'm really confused with what value to assign. Is replacing a scalar value to the category sufficient to deal with categorical data?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:02













          Yes it is enough

          – Joe
          Nov 26 '18 at 10:03





          Yes it is enough

          – Joe
          Nov 26 '18 at 10:03













          sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:04





          sorry for the repeated question, I'm really wondering, Don't I need to give weight explicitly?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:04











          1














          If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :



          size_mapping = {"S": 1, "M":2 , "L":3}

          #mapping to the DataFrame
          df['T-size_num'] = df['T-size'].map(size_mapping)


          This allows you to treat the input as numerical data while preserving the hierarchy



          And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.



          df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})


          For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning






          share|improve this answer
























          • Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04











          • The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

            – SantiStSupery
            Nov 26 '18 at 10:20













          • I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:22






          • 1





            Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

            – SantiStSupery
            Nov 26 '18 at 10:29
















          1














          If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :



          size_mapping = {"S": 1, "M":2 , "L":3}

          #mapping to the DataFrame
          df['T-size_num'] = df['T-size'].map(size_mapping)


          This allows you to treat the input as numerical data while preserving the hierarchy



          And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.



          df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})


          For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning






          share|improve this answer
























          • Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04











          • The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

            – SantiStSupery
            Nov 26 '18 at 10:20













          • I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:22






          • 1





            Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

            – SantiStSupery
            Nov 26 '18 at 10:29














          1












          1








          1







          If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :



          size_mapping = {"S": 1, "M":2 , "L":3}

          #mapping to the DataFrame
          df['T-size_num'] = df['T-size'].map(size_mapping)


          This allows you to treat the input as numerical data while preserving the hierarchy



          And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.



          df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})


          For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning






          share|improve this answer













          If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :



          size_mapping = {"S": 1, "M":2 , "L":3}

          #mapping to the DataFrame
          df['T-size_num'] = df['T-size'].map(size_mapping)


          This allows you to treat the input as numerical data while preserving the hierarchy



          And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.



          df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})


          For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 26 '18 at 9:54









          SantiStSuperySantiStSupery

          117112




          117112













          • Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04











          • The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

            – SantiStSupery
            Nov 26 '18 at 10:20













          • I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:22






          • 1





            Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

            – SantiStSupery
            Nov 26 '18 at 10:29



















          • Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:04











          • The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

            – SantiStSupery
            Nov 26 '18 at 10:20













          • I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:22






          • 1





            Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

            – SantiStSupery
            Nov 26 '18 at 10:29

















          Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:04





          Thanks for the answer, for first question, you assigns 1, 2, 3 may I know the reson behind that?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:04













          The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

          – SantiStSupery
          Nov 26 '18 at 10:20







          The relationship of the values of your size parameters are unknown, so I assume it to be linear. And since sizes are positive entities, I took your minimum value to be greater than 0. It makes sure that the M size is a multiple of the S size (which is also true for the L size of course), and it would not be true for S=0, M=1 and L=3

          – SantiStSupery
          Nov 26 '18 at 10:20















          I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:22





          I'm curious to know instead of 1, 2, 3 Can I use 10, 20, 30 or 5, 10, 15 ?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:22




          1




          1





          Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

          – SantiStSupery
          Nov 26 '18 at 10:29





          Yes you could. The most important part is the way you model the relationship between your elements. If it is set to be linear with c=3a and b=2a, the patterns you will find in your data will remain the same whether you use 1, 2 and 3 or any other triplet that preserve that same linear relationship.

          – SantiStSupery
          Nov 26 '18 at 10:29











          1














          It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.



          So a dataset of:



          Gender
          M
          F
          M
          M
          F


          Would become



          Gender_M    Gender_F
          1 0
          0 1
          1 0
          1 0
          0 1


          This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.






          share|improve this answer
























          • Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

            – Mohamed Thasin ah
            Nov 26 '18 at 10:31






          • 1





            So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

            – Thomas Kimber
            Nov 26 '18 at 10:44











          • Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:52






          • 1





            Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

            – Thomas Kimber
            Nov 26 '18 at 11:02
















          1














          It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.



          So a dataset of:



          Gender
          M
          F
          M
          M
          F


          Would become



          Gender_M    Gender_F
          1 0
          0 1
          1 0
          1 0
          0 1


          This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.






          share|improve this answer
























          • Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

            – Mohamed Thasin ah
            Nov 26 '18 at 10:31






          • 1





            So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

            – Thomas Kimber
            Nov 26 '18 at 10:44











          • Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:52






          • 1





            Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

            – Thomas Kimber
            Nov 26 '18 at 11:02














          1












          1








          1







          It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.



          So a dataset of:



          Gender
          M
          F
          M
          M
          F


          Would become



          Gender_M    Gender_F
          1 0
          0 1
          1 0
          1 0
          0 1


          This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.






          share|improve this answer













          It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.



          So a dataset of:



          Gender
          M
          F
          M
          M
          F


          Would become



          Gender_M    Gender_F
          1 0
          0 1
          1 0
          1 0
          0 1


          This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 26 '18 at 10:23









          Thomas KimberThomas Kimber

          3,45421324




          3,45421324













          • Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

            – Mohamed Thasin ah
            Nov 26 '18 at 10:31






          • 1





            So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

            – Thomas Kimber
            Nov 26 '18 at 10:44











          • Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:52






          • 1





            Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

            – Thomas Kimber
            Nov 26 '18 at 11:02



















          • Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

            – Mohamed Thasin ah
            Nov 26 '18 at 10:31






          • 1





            So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

            – Thomas Kimber
            Nov 26 '18 at 10:44











          • Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

            – Mohamed Thasin ah
            Nov 26 '18 at 10:52






          • 1





            Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

            – Thomas Kimber
            Nov 26 '18 at 11:02

















          Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

          – Mohamed Thasin ah
          Nov 26 '18 at 10:31





          Thanks for the answer, I understood the first part, But I failed to understand an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme. this point. can you make me clear.

          – Mohamed Thasin ah
          Nov 26 '18 at 10:31




          1




          1





          So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

          – Thomas Kimber
          Nov 26 '18 at 10:44





          So you've outlined two types of categorical variable in your example, {S,M,L} is transitive, since you can assign them an order. {M,F} isn't transitive, but it is binary, so coding to {0,1} is probably ok. Another type of categorical variable is one where there are more than 2 options, but which are not orderable (i.e. not transitive) - so {Vanilla, Strawberry, Grape} might be the options in a categorical variable, but if transformed to {0,1,2} a skew is introduced, suggesting that Strawberry exists "between" Vanilla and Grape when no such relationship exists outside of the arbitrary coding.

          – Thomas Kimber
          Nov 26 '18 at 10:44













          Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:52





          Great explanation, It's really appreciated. For this {Vanilla, Strawberry, Grape} one hot encoding solve the problem right?

          – Mohamed Thasin ah
          Nov 26 '18 at 10:52




          1




          1





          Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

          – Thomas Kimber
          Nov 26 '18 at 11:02





          Yes, exactly. There's nothing stopping you from using it for {M,F} too. One-hot encoding has the disadvantage of expanding the dimensionality of your data, which can be a problem when the number of categories in a field gets large, but bearing that in mind, it's usually a good general-use option.

          – Thomas Kimber
          Nov 26 '18 at 11:02


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53478046%2fhow-to-handle-categorical-data-for-preprocessing-in-machine-learning%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Ottavio Pratesi

          Tricia Helfer

          15 giugno