Handling categorical variables in sklearn with one-hot encoding The 2019 Stack Overflow Developer Survey Results Are InAre static class variables possible?Using global variables in a functionHow do I pass a variable by reference?How to access environment variable values?Possible ways to do one hot encoding in scikit-learn?Pandas sklearn one-hot encoding dataframe or numpy?One hot encoding categorical features - Sparse form onlyOne-hot-encoding with missing categoriesOneHotEncoder - encoding only some of categorical variable columnsUsing “one hot” encoded dependent variable in random forest
If a Druid sees an animal’s corpse, can they Wild Shape into that animal?
What is the motivation for a law requiring 2 parties to consent for recording a conversation
Falsification in Math vs Science
How technical should a Scrum Master be to effectively remove impediments?
Why is the Constellation's nose gear so long?
Can we generate random numbers using irrational numbers like π and e?
What tool would a Roman-age civilization have for the breaking of silver and other metals into dust?
Does a dangling wire really electrocute me if I'm standing in water?
Why do we hear so much about the Trump administration deciding to impose and then remove tariffs?
What is the meaning of the verb "bear" in this context?
How come people say “Would of”?
What to do when moving next to a bird sanctuary with a loosely-domesticated cat?
Origin of "cooter" meaning "vagina"
What do the Banks children have against barley water?
What is the closest word meaning "respect for time / mindful"
Looking for Correct Greek Translation for Heraclitus
What does Linus Torvalds mean when he says that Git "never ever" tracks a file?
Delete all lines which don't have n characters before delimiter
Why is the maximum length of OpenWrt’s root password 8 characters?
Are there any other methods to apply to solving simultaneous equations?
For what reasons would an animal species NOT cross a *horizontal* land bridge?
Is a "Democratic" Oligarchy-Style System Possible?
Can someone be penalized for an "unlawful" act if no penalty is specified?
Who coined the term "madman theory"?
Handling categorical variables in sklearn with one-hot encoding
The 2019 Stack Overflow Developer Survey Results Are InAre static class variables possible?Using global variables in a functionHow do I pass a variable by reference?How to access environment variable values?Possible ways to do one hot encoding in scikit-learn?Pandas sklearn one-hot encoding dataframe or numpy?One hot encoding categorical features - Sparse form onlyOne-hot-encoding with missing categoriesOneHotEncoder - encoding only some of categorical variable columnsUsing “one hot” encoded dependent variable in random forest
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
Can someone help with any existing Python class for categorical encoder for sklearn
that ticks the following checkboxes?
- pandas friendly - option to return a dataframe
- should be able to drop 1 column in one-hot encoding
- handling of unseens categories in test data.
- compatible with sklearn Pipeline object.
python pandas dataframe machine-learning scikit-learn
add a comment |
Can someone help with any existing Python class for categorical encoder for sklearn
that ticks the following checkboxes?
- pandas friendly - option to return a dataframe
- should be able to drop 1 column in one-hot encoding
- handling of unseens categories in test data.
- compatible with sklearn Pipeline object.
python pandas dataframe machine-learning scikit-learn
Such a thing does not exist natively inpandas
orsklearn
. However, with a little coding, you can wrapOneHotEncoder
to do what you want.
– gmds
Mar 22 at 5:02
yes. i couldn't find something on these lines..
– solver149
Mar 22 at 5:04
add a comment |
Can someone help with any existing Python class for categorical encoder for sklearn
that ticks the following checkboxes?
- pandas friendly - option to return a dataframe
- should be able to drop 1 column in one-hot encoding
- handling of unseens categories in test data.
- compatible with sklearn Pipeline object.
python pandas dataframe machine-learning scikit-learn
Can someone help with any existing Python class for categorical encoder for sklearn
that ticks the following checkboxes?
- pandas friendly - option to return a dataframe
- should be able to drop 1 column in one-hot encoding
- handling of unseens categories in test data.
- compatible with sklearn Pipeline object.
python pandas dataframe machine-learning scikit-learn
python pandas dataframe machine-learning scikit-learn
edited Mar 22 at 5:46
solver149
asked Mar 22 at 3:58
solver149solver149
30529
30529
Such a thing does not exist natively inpandas
orsklearn
. However, with a little coding, you can wrapOneHotEncoder
to do what you want.
– gmds
Mar 22 at 5:02
yes. i couldn't find something on these lines..
– solver149
Mar 22 at 5:04
add a comment |
Such a thing does not exist natively inpandas
orsklearn
. However, with a little coding, you can wrapOneHotEncoder
to do what you want.
– gmds
Mar 22 at 5:02
yes. i couldn't find something on these lines..
– solver149
Mar 22 at 5:04
Such a thing does not exist natively in
pandas
or sklearn
. However, with a little coding, you can wrap OneHotEncoder
to do what you want.– gmds
Mar 22 at 5:02
Such a thing does not exist natively in
pandas
or sklearn
. However, with a little coding, you can wrap OneHotEncoder
to do what you want.– gmds
Mar 22 at 5:02
yes. i couldn't find something on these lines..
– solver149
Mar 22 at 5:04
yes. i couldn't find something on these lines..
– solver149
Mar 22 at 5:04
add a comment |
1 Answer
1
active
oldest
votes
I think you're looking for pandas.get_dummies
See the following example.
df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])
# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)
Output:
col_b col_a_dog col_a_mouse col_c_b
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0
It covers first 2 conditions that you mentioned.
For 3rd condition you can do the following.
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.
– solver149
Mar 22 at 4:44
1
@AkshayNevrekar I believe OP means asklearn.pipeline.Pipeline
object.
– gmds
Mar 22 at 5:01
yes you are right
– solver149
Mar 22 at 5:05
@solver149 you should add that info in your question.
– AkshayNevrekar
Mar 22 at 5:14
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55292706%2fhandling-categorical-variables-in-sklearn-with-one-hot-encoding%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think you're looking for pandas.get_dummies
See the following example.
df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])
# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)
Output:
col_b col_a_dog col_a_mouse col_c_b
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0
It covers first 2 conditions that you mentioned.
For 3rd condition you can do the following.
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.
– solver149
Mar 22 at 4:44
1
@AkshayNevrekar I believe OP means asklearn.pipeline.Pipeline
object.
– gmds
Mar 22 at 5:01
yes you are right
– solver149
Mar 22 at 5:05
@solver149 you should add that info in your question.
– AkshayNevrekar
Mar 22 at 5:14
add a comment |
I think you're looking for pandas.get_dummies
See the following example.
df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])
# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)
Output:
col_b col_a_dog col_a_mouse col_c_b
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0
It covers first 2 conditions that you mentioned.
For 3rd condition you can do the following.
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.
– solver149
Mar 22 at 4:44
1
@AkshayNevrekar I believe OP means asklearn.pipeline.Pipeline
object.
– gmds
Mar 22 at 5:01
yes you are right
– solver149
Mar 22 at 5:05
@solver149 you should add that info in your question.
– AkshayNevrekar
Mar 22 at 5:14
add a comment |
I think you're looking for pandas.get_dummies
See the following example.
df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])
# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)
Output:
col_b col_a_dog col_a_mouse col_c_b
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0
It covers first 2 conditions that you mentioned.
For 3rd condition you can do the following.
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
I think you're looking for pandas.get_dummies
See the following example.
df = pd.DataFrame("col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a'])
# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)
Output:
col_b col_a_dog col_a_mouse col_c_b
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0
It covers first 2 conditions that you mentioned.
For 3rd condition you can do the following.
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
answered Mar 22 at 4:35
AkshayNevrekarAkshayNevrekar
6,10792042
6,10792042
Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.
– solver149
Mar 22 at 4:44
1
@AkshayNevrekar I believe OP means asklearn.pipeline.Pipeline
object.
– gmds
Mar 22 at 5:01
yes you are right
– solver149
Mar 22 at 5:05
@solver149 you should add that info in your question.
– AkshayNevrekar
Mar 22 at 5:14
add a comment |
Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.
– solver149
Mar 22 at 4:44
1
@AkshayNevrekar I believe OP means asklearn.pipeline.Pipeline
object.
– gmds
Mar 22 at 5:01
yes you are right
– solver149
Mar 22 at 5:05
@solver149 you should add that info in your question.
– AkshayNevrekar
Mar 22 at 5:14
Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.
– solver149
Mar 22 at 4:44
Sorry. I am aware of this. Looking for something in sklearn standards that can fit into pipelines.
– solver149
Mar 22 at 4:44
1
1
@AkshayNevrekar I believe OP means a
sklearn.pipeline.Pipeline
object.– gmds
Mar 22 at 5:01
@AkshayNevrekar I believe OP means a
sklearn.pipeline.Pipeline
object.– gmds
Mar 22 at 5:01
yes you are right
– solver149
Mar 22 at 5:05
yes you are right
– solver149
Mar 22 at 5:05
@solver149 you should add that info in your question.
– AkshayNevrekar
Mar 22 at 5:14
@solver149 you should add that info in your question.
– AkshayNevrekar
Mar 22 at 5:14
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55292706%2fhandling-categorical-variables-in-sklearn-with-one-hot-encoding%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Such a thing does not exist natively in
pandas
orsklearn
. However, with a little coding, you can wrapOneHotEncoder
to do what you want.– gmds
Mar 22 at 5:02
yes. i couldn't find something on these lines..
– solver149
Mar 22 at 5:04