Handling categorical variables in sklearn with one-hot encoding The 2019 Stack Overflow Developer Survey Results Are InAre static class variables possible?Using global variables in a functionHow do I pass a variable by reference?How to access environment variable values?Possible ways to do one hot encoding in scikit-learn?Pandas sklearn one-hot encoding dataframe or numpy?One hot encoding categorical features - Sparse form onlyOne-hot-encoding with missing categoriesOneHotEncoder - encoding only some of categorical variable columnsUsing “one hot” encoded dependent variable in random forest

If a Druid sees an animal’s corpse, can they Wild Shape into that animal?

What is the motivation for a law requiring 2 parties to consent for recording a conversation

Falsification in Math vs Science

How technical should a Scrum Master be to effectively remove impediments?

Why is the Constellation's nose gear so long?

Can we generate random numbers using irrational numbers like π and e?

What tool would a Roman-age civilization have for the breaking of silver and other metals into dust?

Does a dangling wire really electrocute me if I'm standing in water?

Why do we hear so much about the Trump administration deciding to impose and then remove tariffs?

What is the meaning of the verb "bear" in this context?

How come people say “Would of”?

What to do when moving next to a bird sanctuary with a loosely-domesticated cat?

Origin of "cooter" meaning "vagina"

What do the Banks children have against barley water?

What is the closest word meaning "respect for time / mindful"

Looking for Correct Greek Translation for Heraclitus

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

Delete all lines which don't have n characters before delimiter

Why is the maximum length of OpenWrt’s root password 8 characters?

Are there any other methods to apply to solving simultaneous equations?

For what reasons would an animal species NOT cross a *horizontal* land bridge?

Is a "Democratic" Oligarchy-Style System Possible?

Can someone be penalized for an "unlawful" act if no penalty is specified?

Who coined the term "madman theory"?



Handling categorical variables in sklearn with one-hot encoding



The 2019 Stack Overflow Developer Survey Results Are InAre static class variables possible?Using global variables in a functionHow do I pass a variable by reference?How to access environment variable values?Possible ways to do one hot encoding in scikit-learn?Pandas sklearn one-hot encoding dataframe or numpy?One hot encoding categorical features - Sparse form onlyOne-hot-encoding with missing categoriesOneHotEncoder - encoding only some of categorical variable columnsUsing “one hot” encoded dependent variable in random forest



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















Can someone help with any existing Python class for categorical encoder for sklearn that ticks the following checkboxes?



  1. pandas friendly - option to return a dataframe

  2. should be able to drop 1 column in one-hot encoding

  3. handling of unseens categories in test data.

  4. compatible with sklearn Pipeline object.









share|improve this question
























  • Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want.

    – gmds
    Mar 22 at 5:02












  • yes. i couldn't find something on these lines..

    – solver149
    Mar 22 at 5:04

















0















Can someone help with any existing Python class for categorical encoder for sklearn that ticks the following checkboxes?



  1. pandas friendly - option to return a dataframe

  2. should be able to drop 1 column in one-hot encoding

  3. handling of unseens categories in test data.

  4. compatible with sklearn Pipeline object.









share|improve this question
























  • Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want.

    – gmds
    Mar 22 at 5:02












  • yes. i couldn't find something on these lines..

    – solver149
    Mar 22 at 5:04













0












0








0








Can someone help with any existing Python class for categorical encoder for sklearn that ticks the following checkboxes?



  1. pandas friendly - option to return a dataframe

  2. should be able to drop 1 column in one-hot encoding

  3. handling of unseens categories in test data.

  4. compatible with sklearn Pipeline object.









share|improve this question
















Can someone help with any existing Python class for categorical encoder for sklearn that ticks the following checkboxes?



  1. pandas friendly - option to return a dataframe

  2. should be able to drop 1 column in one-hot encoding

  3. handling of unseens categories in test data.

  4. compatible with sklearn Pipeline object.






python pandas dataframe machine-learning scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 22 at 5:46







solver149

















asked Mar 22 at 3:58









solver149solver149

30529




30529












  • Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want.

    – gmds
    Mar 22 at 5:02












  • yes. i couldn't find something on these lines..

    – solver149
    Mar 22 at 5:04

















  • Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want.

    – gmds
    Mar 22 at 5:02












  • yes. i couldn't find something on these lines..

    – solver149
    Mar 22 at 5:04
















Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want.

– gmds
Mar 22 at 5:02






Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want.

– gmds
Mar 22 at 5:02














yes. i couldn't find something on these lines..

– solver149
Mar 22 at 5:04





yes. i couldn't find something on these lines..

– solver149
Mar 22 at 5:04












1 Answer
1






active

oldest

votes


















0














I think you're looking for pandas.get_dummies



See the following example.



df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])

# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)


Output:



 col_b col_a_dog col_a_mouse col_c_b 
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0


It covers first 2 conditions that you mentioned.



For 3rd condition you can do the following.



  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)

  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)

  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.






share|improve this answer























  • Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.

    – solver149
    Mar 22 at 4:44






  • 1





    @AkshayNevrekar I believe OP means a sklearn.pipeline.Pipeline object.

    – gmds
    Mar 22 at 5:01











  • yes you are right

    – solver149
    Mar 22 at 5:05











  • @solver149 you should add that info in your question.

    – AkshayNevrekar
    Mar 22 at 5:14











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55292706%2fhandling-categorical-variables-in-sklearn-with-one-hot-encoding%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I think you're looking for pandas.get_dummies



See the following example.



df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])

# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)


Output:



 col_b col_a_dog col_a_mouse col_c_b 
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0


It covers first 2 conditions that you mentioned.



For 3rd condition you can do the following.



  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)

  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)

  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.






share|improve this answer























  • Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.

    – solver149
    Mar 22 at 4:44






  • 1





    @AkshayNevrekar I believe OP means a sklearn.pipeline.Pipeline object.

    – gmds
    Mar 22 at 5:01











  • yes you are right

    – solver149
    Mar 22 at 5:05











  • @solver149 you should add that info in your question.

    – AkshayNevrekar
    Mar 22 at 5:14















0














I think you're looking for pandas.get_dummies



See the following example.



df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])

# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)


Output:



 col_b col_a_dog col_a_mouse col_c_b 
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0


It covers first 2 conditions that you mentioned.



For 3rd condition you can do the following.



  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)

  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)

  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.






share|improve this answer























  • Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.

    – solver149
    Mar 22 at 4:44






  • 1





    @AkshayNevrekar I believe OP means a sklearn.pipeline.Pipeline object.

    – gmds
    Mar 22 at 5:01











  • yes you are right

    – solver149
    Mar 22 at 5:05











  • @solver149 you should add that info in your question.

    – AkshayNevrekar
    Mar 22 at 5:14













0












0








0







I think you're looking for pandas.get_dummies



See the following example.



df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])

# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)


Output:



 col_b col_a_dog col_a_mouse col_c_b 
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0


It covers first 2 conditions that you mentioned.



For 3rd condition you can do the following.



  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)

  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)

  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.






share|improve this answer













I think you're looking for pandas.get_dummies



See the following example.



df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])

# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)


Output:



 col_b col_a_dog col_a_mouse col_c_b 
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0


It covers first 2 conditions that you mentioned.



For 3rd condition you can do the following.



  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)

  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)

  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 22 at 4:35









AkshayNevrekarAkshayNevrekar

6,10792042




6,10792042












  • Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.

    – solver149
    Mar 22 at 4:44






  • 1





    @AkshayNevrekar I believe OP means a sklearn.pipeline.Pipeline object.

    – gmds
    Mar 22 at 5:01











  • yes you are right

    – solver149
    Mar 22 at 5:05











  • @solver149 you should add that info in your question.

    – AkshayNevrekar
    Mar 22 at 5:14

















  • Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.

    – solver149
    Mar 22 at 4:44






  • 1





    @AkshayNevrekar I believe OP means a sklearn.pipeline.Pipeline object.

    – gmds
    Mar 22 at 5:01











  • yes you are right

    – solver149
    Mar 22 at 5:05











  • @solver149 you should add that info in your question.

    – AkshayNevrekar
    Mar 22 at 5:14
















Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.

– solver149
Mar 22 at 4:44





Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.

– solver149
Mar 22 at 4:44




1




1





@AkshayNevrekar I believe OP means a sklearn.pipeline.Pipeline object.

– gmds
Mar 22 at 5:01





@AkshayNevrekar I believe OP means a sklearn.pipeline.Pipeline object.

– gmds
Mar 22 at 5:01













yes you are right

– solver149
Mar 22 at 5:05





yes you are right

– solver149
Mar 22 at 5:05













@solver149 you should add that info in your question.

– AkshayNevrekar
Mar 22 at 5:14





@solver149 you should add that info in your question.

– AkshayNevrekar
Mar 22 at 5:14



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55292706%2fhandling-categorical-variables-in-sklearn-with-one-hot-encoding%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript