pandas get mapping of categories to integer valueGet mapping of categorical variables in pandasHow to get the current time in PythonHow do I sort a dictionary by value?Converting integer to string in Python?How to access environment variable values?Renaming columns in pandasDelete column from pandas DataFrame by column name“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers
How would you translate "grit" (personality trait) to Chinese?
Why doesn't Iron Man's action affect this person in Endgame?
Why would company (decision makers) wait for someone to retire, rather than lay them off, when their role is no longer needed?
I recently started my machine learning PhD and I have absolutely no idea what I'm doing
How to handle professionally if colleagues has referred his relative and asking to take easy while taking interview
Meaning of "legitimate" in Carl Jung's quote "Neurosis is always a substitute for legitimate suffering."
Were any of the books mentioned in this scene from the movie Hackers real?
Understanding Deutch's Algorithm
How will the lack of ground stations affect navigation?
How much outgoing traffic would a HTTP load balance use?
Why is the Advance Variation considered strong vs the Caro-Kann but not vs the Scandinavian?
What dog breeds survive the apocalypse for generations?
UUID type for NEWID()
c++ conditional uni-directional iterator
What do the "optional" resistor and capacitor do in this circuit?
labelled end points on logic diagram
Why commonly or frequently used fonts sizes are even numbers like 10px, 12px, 16px, 24px, or 32px?
Developers demotivated due to working on same project for more than 2 years
Why when I add jam to my tea it stops producing thin "membrane" on top?
Getting a similar picture (colours) on Manual Mode while using similar Auto Mode settings (T6 and 40D)
Would life always name the light from their sun "white"
Do people who work at research institutes consider themselves "academics"?
How to rename multiple files in a directory at the same time
How does a permutation act on a string?
pandas get mapping of categories to integer value
Get mapping of categorical variables in pandasHow to get the current time in PythonHow do I sort a dictionary by value?Converting integer to string in Python?How to access environment variable values?Renaming columns in pandasDelete column from pandas DataFrame by column name“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I can transform categorical columns to their categorical code but how do i get an accurate picture of their mapping? Example:
df_labels = pd.DataFrame('col1':[1,2,3,4,5], 'col2':list('abcab'))
df_labels['col2'] = df_labels['col2'].astype('category')
df_labels looks like this:
col1 col2
0 1 a
1 2 b
2 3 c
3 4 a
4 5 b
How do i get an accurate mapping of the cat codes to cat categories?
The stackoverflow response below says to enumerate the categories. However, I'm not sure if enumerating was the way cat.codes generated the integer values. Is there a more accurate way?
Get mapping of categorical variables in pandas
>>> dict( enumerate(df.five.cat.categories) )
0: 'bad', 1: 'good'
What is a good way to get the mapping in the above format but accurate?
python pandas
add a comment |
I can transform categorical columns to their categorical code but how do i get an accurate picture of their mapping? Example:
df_labels = pd.DataFrame('col1':[1,2,3,4,5], 'col2':list('abcab'))
df_labels['col2'] = df_labels['col2'].astype('category')
df_labels looks like this:
col1 col2
0 1 a
1 2 b
2 3 c
3 4 a
4 5 b
How do i get an accurate mapping of the cat codes to cat categories?
The stackoverflow response below says to enumerate the categories. However, I'm not sure if enumerating was the way cat.codes generated the integer values. Is there a more accurate way?
Get mapping of categorical variables in pandas
>>> dict( enumerate(df.five.cat.categories) )
0: 'bad', 1: 'good'
What is a good way to get the mapping in the above format but accurate?
python pandas
FYI, I have since updated my answer (which you linked to) and added some explanation/verification. I believe it is accurate although I'm happy to improve it if you can elaborate about what you think is inaccurate about it.
– JohnE
Aug 25 '17 at 18:09
add a comment |
I can transform categorical columns to their categorical code but how do i get an accurate picture of their mapping? Example:
df_labels = pd.DataFrame('col1':[1,2,3,4,5], 'col2':list('abcab'))
df_labels['col2'] = df_labels['col2'].astype('category')
df_labels looks like this:
col1 col2
0 1 a
1 2 b
2 3 c
3 4 a
4 5 b
How do i get an accurate mapping of the cat codes to cat categories?
The stackoverflow response below says to enumerate the categories. However, I'm not sure if enumerating was the way cat.codes generated the integer values. Is there a more accurate way?
Get mapping of categorical variables in pandas
>>> dict( enumerate(df.five.cat.categories) )
0: 'bad', 1: 'good'
What is a good way to get the mapping in the above format but accurate?
python pandas
I can transform categorical columns to their categorical code but how do i get an accurate picture of their mapping? Example:
df_labels = pd.DataFrame('col1':[1,2,3,4,5], 'col2':list('abcab'))
df_labels['col2'] = df_labels['col2'].astype('category')
df_labels looks like this:
col1 col2
0 1 a
1 2 b
2 3 c
3 4 a
4 5 b
How do i get an accurate mapping of the cat codes to cat categories?
The stackoverflow response below says to enumerate the categories. However, I'm not sure if enumerating was the way cat.codes generated the integer values. Is there a more accurate way?
Get mapping of categorical variables in pandas
>>> dict( enumerate(df.five.cat.categories) )
0: 'bad', 1: 'good'
What is a good way to get the mapping in the above format but accurate?
python pandas
python pandas
edited Apr 7 at 13:52
JohnE
15.2k73762
15.2k73762
asked Feb 13 '17 at 23:27
jxnjxn
2,1411048104
2,1411048104
FYI, I have since updated my answer (which you linked to) and added some explanation/verification. I believe it is accurate although I'm happy to improve it if you can elaborate about what you think is inaccurate about it.
– JohnE
Aug 25 '17 at 18:09
add a comment |
FYI, I have since updated my answer (which you linked to) and added some explanation/verification. I believe it is accurate although I'm happy to improve it if you can elaborate about what you think is inaccurate about it.
– JohnE
Aug 25 '17 at 18:09
FYI, I have since updated my answer (which you linked to) and added some explanation/verification. I believe it is accurate although I'm happy to improve it if you can elaborate about what you think is inaccurate about it.
– JohnE
Aug 25 '17 at 18:09
FYI, I have since updated my answer (which you linked to) and added some explanation/verification. I believe it is accurate although I'm happy to improve it if you can elaborate about what you think is inaccurate about it.
– JohnE
Aug 25 '17 at 18:09
add a comment |
4 Answers
4
active
oldest
votes
Edited answer (removed cat.categories
and changed list
to dict
):
>>> dict(zip(df_labels.col2.cat.codes, df_labels.col2))
0: 'a', 1: 'b', 2: 'c'
The original answer which some of the comments are referring to:
>>> list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories))
[(0, 'a'), (1, 'b'), (2, 'c')]
As the comments note, the original answer works in this example because the first three values happend to be [a,b,c]
, but would fail if they were instead [c,b,a]
or [b,c,a]
.
1
Yes thanks! needed to putset
in the front as i just want the unique mappings:set(list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories)))
– jxn
Feb 15 '17 at 0:11
5
I think this answer only works because of the way col2 is ordered.len(cat.categories)
is 3 whilelen(cat.codes)
is 5.
– pomber
Jul 9 '17 at 1:22
3
This is an incorrect answer, becauseser.cat.categories
will return all the unique values in the category but not the corresponding label of the items in the series.
– Woods Chen
Jan 14 at 9:47
@JohnE feel free to edit. I cannot delete my answer for it is the accepted one
– Boud
Apr 7 at 5:12
Thanks, @boud, I edited it (while preserving the original with a note). Please add additional edits as you see fit.
– JohnE
Apr 7 at 14:02
add a comment |
I use:
dict([(category, code) for code, category in enumerate(df_labels.col2.cat.categories)])
# 'a': 0, 'b': 1, 'c': 2
Note that this is roughly equivalent to the answer rejected by the OP:dict(enumerate(df.five.cat.categories))
except that it switches keys and values from e.g.0:'a'
to'a':0
which is a minor difference as both keys and values here are unique so the key/value order is in some sense irrelevant and it's also easy enough to reverse. (I think the answer (mine!) rejected by the OP is actually correct so I also think this one is correct too!)
– JohnE
Mar 23 at 15:24
add a comment |
If you want to convert each column/ data series from categorical back to original, you just need to reverse what you did in the for
loop of the dataframe. There are two methods to do that:
To get back to the original Series or numpy array, use
Series.astype(original_dtype)
ornp.asarray(categorical)
.If you have already codes and categories, you can use the
from_codes()
constructor to save the factorize step during normal constructor mode.
See pandas: Categorical Data
Usage of from_codes
As on official documentation, it makes a Categorical type from codes and categories arrays.
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
print splitter
print s
gives
[0 1 1 0 0]
0 train
1 test
2 test
3 train
4 train
dtype: category
Categories (2, object): [train, test]
For your codes
# after your previous conversion
print df['col2']
# apply from_codes, the 2nd argument is the categories from mapping dict
s = pd.Series(pd.Categorical.from_codes(df['col2'], list('abcde')))
print s
gives
0 0
1 1
2 2
3 0
4 1
Name: col2, dtype: int8
0 a
1 b
2 c
3 a
4 b
dtype: category
Categories (5, object): [a, b, c, d, e]
There is not much documentation about usingfrom_codes()
. Can you show me how i can apply it ?
– jxn
Feb 13 '17 at 23:48
As updated, hope it helps.
– Neo X
Feb 14 '17 at 0:11
I see, i just want the unique mapping values though, not the full mapping. For example 0 : 'a', 1 : 'b', 2 : 'c'
– jxn
Feb 14 '17 at 0:35
Then you can easily construct the map by yourself using codes and categories. Yet you cannot maintain the order by a Python dictionary, use two lists or a list of tuples in @Boud answer instead.
– Neo X
Feb 14 '17 at 0:41
add a comment |
OP asks for something "accurate" relative to the answer in the linked question:
dict(enumerate(df_labels.col2.cat.categories))
# 0: 'a', 1: 'b', 2: 'c'
I believe that the above answer is indeed accurate (full disclosure: it is my answer in the other question that I'm defending). Note also that it is roughly equivalent to @pomber's answer, except that the ordering of the keys and values is reversed. (Since both keys and values are unique, the ordering is in some sense irrelevant, and easy enough to reverse as a consequence).
However, the following way is arguably safer, or at least more transparent as to how it works:
dict(zip(df_labels.col2.cat.codes, df_labels.col2))
# 0: 'a', 1: 'b', 2: 'c'
This is similar in spirit to @boud's answer, but corrects an error by replacing df_labels.col2.cat.codes
with df_labels.col2
. It also replaces list()
with dict()
which seems more appropriate for a mapping and automatically gets rid of duplicates.
Note that the length of both arguments to zip()
is len(df)
, whereas the length of df_labels.col2.cat.codes
is a count of unique values which will generally be much shorter than len(df)
.
Also note that this method is quite inefficient as it maps 0
to 'a'
twice, and similarly for 'b'
. In large dataframes the difference in speed could be pretty big. But it won't cause any error because dict()
will remove redundancies like this -- it's just that it will be much less efficient than the other method.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f42215354%2fpandas-get-mapping-of-categories-to-integer-value%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Edited answer (removed cat.categories
and changed list
to dict
):
>>> dict(zip(df_labels.col2.cat.codes, df_labels.col2))
0: 'a', 1: 'b', 2: 'c'
The original answer which some of the comments are referring to:
>>> list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories))
[(0, 'a'), (1, 'b'), (2, 'c')]
As the comments note, the original answer works in this example because the first three values happend to be [a,b,c]
, but would fail if they were instead [c,b,a]
or [b,c,a]
.
1
Yes thanks! needed to putset
in the front as i just want the unique mappings:set(list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories)))
– jxn
Feb 15 '17 at 0:11
5
I think this answer only works because of the way col2 is ordered.len(cat.categories)
is 3 whilelen(cat.codes)
is 5.
– pomber
Jul 9 '17 at 1:22
3
This is an incorrect answer, becauseser.cat.categories
will return all the unique values in the category but not the corresponding label of the items in the series.
– Woods Chen
Jan 14 at 9:47
@JohnE feel free to edit. I cannot delete my answer for it is the accepted one
– Boud
Apr 7 at 5:12
Thanks, @boud, I edited it (while preserving the original with a note). Please add additional edits as you see fit.
– JohnE
Apr 7 at 14:02
add a comment |
Edited answer (removed cat.categories
and changed list
to dict
):
>>> dict(zip(df_labels.col2.cat.codes, df_labels.col2))
0: 'a', 1: 'b', 2: 'c'
The original answer which some of the comments are referring to:
>>> list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories))
[(0, 'a'), (1, 'b'), (2, 'c')]
As the comments note, the original answer works in this example because the first three values happend to be [a,b,c]
, but would fail if they were instead [c,b,a]
or [b,c,a]
.
1
Yes thanks! needed to putset
in the front as i just want the unique mappings:set(list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories)))
– jxn
Feb 15 '17 at 0:11
5
I think this answer only works because of the way col2 is ordered.len(cat.categories)
is 3 whilelen(cat.codes)
is 5.
– pomber
Jul 9 '17 at 1:22
3
This is an incorrect answer, becauseser.cat.categories
will return all the unique values in the category but not the corresponding label of the items in the series.
– Woods Chen
Jan 14 at 9:47
@JohnE feel free to edit. I cannot delete my answer for it is the accepted one
– Boud
Apr 7 at 5:12
Thanks, @boud, I edited it (while preserving the original with a note). Please add additional edits as you see fit.
– JohnE
Apr 7 at 14:02
add a comment |
Edited answer (removed cat.categories
and changed list
to dict
):
>>> dict(zip(df_labels.col2.cat.codes, df_labels.col2))
0: 'a', 1: 'b', 2: 'c'
The original answer which some of the comments are referring to:
>>> list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories))
[(0, 'a'), (1, 'b'), (2, 'c')]
As the comments note, the original answer works in this example because the first three values happend to be [a,b,c]
, but would fail if they were instead [c,b,a]
or [b,c,a]
.
Edited answer (removed cat.categories
and changed list
to dict
):
>>> dict(zip(df_labels.col2.cat.codes, df_labels.col2))
0: 'a', 1: 'b', 2: 'c'
The original answer which some of the comments are referring to:
>>> list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories))
[(0, 'a'), (1, 'b'), (2, 'c')]
As the comments note, the original answer works in this example because the first three values happend to be [a,b,c]
, but would fail if they were instead [c,b,a]
or [b,c,a]
.
edited Apr 8 at 13:28
JohnE
15.2k73762
15.2k73762
answered Feb 13 '17 at 23:44
BoudBoud
19.7k74059
19.7k74059
1
Yes thanks! needed to putset
in the front as i just want the unique mappings:set(list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories)))
– jxn
Feb 15 '17 at 0:11
5
I think this answer only works because of the way col2 is ordered.len(cat.categories)
is 3 whilelen(cat.codes)
is 5.
– pomber
Jul 9 '17 at 1:22
3
This is an incorrect answer, becauseser.cat.categories
will return all the unique values in the category but not the corresponding label of the items in the series.
– Woods Chen
Jan 14 at 9:47
@JohnE feel free to edit. I cannot delete my answer for it is the accepted one
– Boud
Apr 7 at 5:12
Thanks, @boud, I edited it (while preserving the original with a note). Please add additional edits as you see fit.
– JohnE
Apr 7 at 14:02
add a comment |
1
Yes thanks! needed to putset
in the front as i just want the unique mappings:set(list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories)))
– jxn
Feb 15 '17 at 0:11
5
I think this answer only works because of the way col2 is ordered.len(cat.categories)
is 3 whilelen(cat.codes)
is 5.
– pomber
Jul 9 '17 at 1:22
3
This is an incorrect answer, becauseser.cat.categories
will return all the unique values in the category but not the corresponding label of the items in the series.
– Woods Chen
Jan 14 at 9:47
@JohnE feel free to edit. I cannot delete my answer for it is the accepted one
– Boud
Apr 7 at 5:12
Thanks, @boud, I edited it (while preserving the original with a note). Please add additional edits as you see fit.
– JohnE
Apr 7 at 14:02
1
1
Yes thanks! needed to put
set
in the front as i just want the unique mappings: set(list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories)))
– jxn
Feb 15 '17 at 0:11
Yes thanks! needed to put
set
in the front as i just want the unique mappings: set(list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories)))
– jxn
Feb 15 '17 at 0:11
5
5
I think this answer only works because of the way col2 is ordered.
len(cat.categories)
is 3 while len(cat.codes)
is 5.– pomber
Jul 9 '17 at 1:22
I think this answer only works because of the way col2 is ordered.
len(cat.categories)
is 3 while len(cat.codes)
is 5.– pomber
Jul 9 '17 at 1:22
3
3
This is an incorrect answer, because
ser.cat.categories
will return all the unique values in the category but not the corresponding label of the items in the series.– Woods Chen
Jan 14 at 9:47
This is an incorrect answer, because
ser.cat.categories
will return all the unique values in the category but not the corresponding label of the items in the series.– Woods Chen
Jan 14 at 9:47
@JohnE feel free to edit. I cannot delete my answer for it is the accepted one
– Boud
Apr 7 at 5:12
@JohnE feel free to edit. I cannot delete my answer for it is the accepted one
– Boud
Apr 7 at 5:12
Thanks, @boud, I edited it (while preserving the original with a note). Please add additional edits as you see fit.
– JohnE
Apr 7 at 14:02
Thanks, @boud, I edited it (while preserving the original with a note). Please add additional edits as you see fit.
– JohnE
Apr 7 at 14:02
add a comment |
I use:
dict([(category, code) for code, category in enumerate(df_labels.col2.cat.categories)])
# 'a': 0, 'b': 1, 'c': 2
Note that this is roughly equivalent to the answer rejected by the OP:dict(enumerate(df.five.cat.categories))
except that it switches keys and values from e.g.0:'a'
to'a':0
which is a minor difference as both keys and values here are unique so the key/value order is in some sense irrelevant and it's also easy enough to reverse. (I think the answer (mine!) rejected by the OP is actually correct so I also think this one is correct too!)
– JohnE
Mar 23 at 15:24
add a comment |
I use:
dict([(category, code) for code, category in enumerate(df_labels.col2.cat.categories)])
# 'a': 0, 'b': 1, 'c': 2
Note that this is roughly equivalent to the answer rejected by the OP:dict(enumerate(df.five.cat.categories))
except that it switches keys and values from e.g.0:'a'
to'a':0
which is a minor difference as both keys and values here are unique so the key/value order is in some sense irrelevant and it's also easy enough to reverse. (I think the answer (mine!) rejected by the OP is actually correct so I also think this one is correct too!)
– JohnE
Mar 23 at 15:24
add a comment |
I use:
dict([(category, code) for code, category in enumerate(df_labels.col2.cat.categories)])
# 'a': 0, 'b': 1, 'c': 2
I use:
dict([(category, code) for code, category in enumerate(df_labels.col2.cat.categories)])
# 'a': 0, 'b': 1, 'c': 2
edited Mar 23 at 15:16
JohnE
15.2k73762
15.2k73762
answered Jul 9 '17 at 1:23
pomberpomber
12.8k85572
12.8k85572
Note that this is roughly equivalent to the answer rejected by the OP:dict(enumerate(df.five.cat.categories))
except that it switches keys and values from e.g.0:'a'
to'a':0
which is a minor difference as both keys and values here are unique so the key/value order is in some sense irrelevant and it's also easy enough to reverse. (I think the answer (mine!) rejected by the OP is actually correct so I also think this one is correct too!)
– JohnE
Mar 23 at 15:24
add a comment |
Note that this is roughly equivalent to the answer rejected by the OP:dict(enumerate(df.five.cat.categories))
except that it switches keys and values from e.g.0:'a'
to'a':0
which is a minor difference as both keys and values here are unique so the key/value order is in some sense irrelevant and it's also easy enough to reverse. (I think the answer (mine!) rejected by the OP is actually correct so I also think this one is correct too!)
– JohnE
Mar 23 at 15:24
Note that this is roughly equivalent to the answer rejected by the OP:
dict(enumerate(df.five.cat.categories))
except that it switches keys and values from e.g. 0:'a'
to 'a':0
which is a minor difference as both keys and values here are unique so the key/value order is in some sense irrelevant and it's also easy enough to reverse. (I think the answer (mine!) rejected by the OP is actually correct so I also think this one is correct too!)– JohnE
Mar 23 at 15:24
Note that this is roughly equivalent to the answer rejected by the OP:
dict(enumerate(df.five.cat.categories))
except that it switches keys and values from e.g. 0:'a'
to 'a':0
which is a minor difference as both keys and values here are unique so the key/value order is in some sense irrelevant and it's also easy enough to reverse. (I think the answer (mine!) rejected by the OP is actually correct so I also think this one is correct too!)– JohnE
Mar 23 at 15:24
add a comment |
If you want to convert each column/ data series from categorical back to original, you just need to reverse what you did in the for
loop of the dataframe. There are two methods to do that:
To get back to the original Series or numpy array, use
Series.astype(original_dtype)
ornp.asarray(categorical)
.If you have already codes and categories, you can use the
from_codes()
constructor to save the factorize step during normal constructor mode.
See pandas: Categorical Data
Usage of from_codes
As on official documentation, it makes a Categorical type from codes and categories arrays.
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
print splitter
print s
gives
[0 1 1 0 0]
0 train
1 test
2 test
3 train
4 train
dtype: category
Categories (2, object): [train, test]
For your codes
# after your previous conversion
print df['col2']
# apply from_codes, the 2nd argument is the categories from mapping dict
s = pd.Series(pd.Categorical.from_codes(df['col2'], list('abcde')))
print s
gives
0 0
1 1
2 2
3 0
4 1
Name: col2, dtype: int8
0 a
1 b
2 c
3 a
4 b
dtype: category
Categories (5, object): [a, b, c, d, e]
There is not much documentation about usingfrom_codes()
. Can you show me how i can apply it ?
– jxn
Feb 13 '17 at 23:48
As updated, hope it helps.
– Neo X
Feb 14 '17 at 0:11
I see, i just want the unique mapping values though, not the full mapping. For example 0 : 'a', 1 : 'b', 2 : 'c'
– jxn
Feb 14 '17 at 0:35
Then you can easily construct the map by yourself using codes and categories. Yet you cannot maintain the order by a Python dictionary, use two lists or a list of tuples in @Boud answer instead.
– Neo X
Feb 14 '17 at 0:41
add a comment |
If you want to convert each column/ data series from categorical back to original, you just need to reverse what you did in the for
loop of the dataframe. There are two methods to do that:
To get back to the original Series or numpy array, use
Series.astype(original_dtype)
ornp.asarray(categorical)
.If you have already codes and categories, you can use the
from_codes()
constructor to save the factorize step during normal constructor mode.
See pandas: Categorical Data
Usage of from_codes
As on official documentation, it makes a Categorical type from codes and categories arrays.
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
print splitter
print s
gives
[0 1 1 0 0]
0 train
1 test
2 test
3 train
4 train
dtype: category
Categories (2, object): [train, test]
For your codes
# after your previous conversion
print df['col2']
# apply from_codes, the 2nd argument is the categories from mapping dict
s = pd.Series(pd.Categorical.from_codes(df['col2'], list('abcde')))
print s
gives
0 0
1 1
2 2
3 0
4 1
Name: col2, dtype: int8
0 a
1 b
2 c
3 a
4 b
dtype: category
Categories (5, object): [a, b, c, d, e]
There is not much documentation about usingfrom_codes()
. Can you show me how i can apply it ?
– jxn
Feb 13 '17 at 23:48
As updated, hope it helps.
– Neo X
Feb 14 '17 at 0:11
I see, i just want the unique mapping values though, not the full mapping. For example 0 : 'a', 1 : 'b', 2 : 'c'
– jxn
Feb 14 '17 at 0:35
Then you can easily construct the map by yourself using codes and categories. Yet you cannot maintain the order by a Python dictionary, use two lists or a list of tuples in @Boud answer instead.
– Neo X
Feb 14 '17 at 0:41
add a comment |
If you want to convert each column/ data series from categorical back to original, you just need to reverse what you did in the for
loop of the dataframe. There are two methods to do that:
To get back to the original Series or numpy array, use
Series.astype(original_dtype)
ornp.asarray(categorical)
.If you have already codes and categories, you can use the
from_codes()
constructor to save the factorize step during normal constructor mode.
See pandas: Categorical Data
Usage of from_codes
As on official documentation, it makes a Categorical type from codes and categories arrays.
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
print splitter
print s
gives
[0 1 1 0 0]
0 train
1 test
2 test
3 train
4 train
dtype: category
Categories (2, object): [train, test]
For your codes
# after your previous conversion
print df['col2']
# apply from_codes, the 2nd argument is the categories from mapping dict
s = pd.Series(pd.Categorical.from_codes(df['col2'], list('abcde')))
print s
gives
0 0
1 1
2 2
3 0
4 1
Name: col2, dtype: int8
0 a
1 b
2 c
3 a
4 b
dtype: category
Categories (5, object): [a, b, c, d, e]
If you want to convert each column/ data series from categorical back to original, you just need to reverse what you did in the for
loop of the dataframe. There are two methods to do that:
To get back to the original Series or numpy array, use
Series.astype(original_dtype)
ornp.asarray(categorical)
.If you have already codes and categories, you can use the
from_codes()
constructor to save the factorize step during normal constructor mode.
See pandas: Categorical Data
Usage of from_codes
As on official documentation, it makes a Categorical type from codes and categories arrays.
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
print splitter
print s
gives
[0 1 1 0 0]
0 train
1 test
2 test
3 train
4 train
dtype: category
Categories (2, object): [train, test]
For your codes
# after your previous conversion
print df['col2']
# apply from_codes, the 2nd argument is the categories from mapping dict
s = pd.Series(pd.Categorical.from_codes(df['col2'], list('abcde')))
print s
gives
0 0
1 1
2 2
3 0
4 1
Name: col2, dtype: int8
0 a
1 b
2 c
3 a
4 b
dtype: category
Categories (5, object): [a, b, c, d, e]
edited Feb 14 '17 at 0:11
answered Feb 13 '17 at 23:41
Neo XNeo X
70759
70759
There is not much documentation about usingfrom_codes()
. Can you show me how i can apply it ?
– jxn
Feb 13 '17 at 23:48
As updated, hope it helps.
– Neo X
Feb 14 '17 at 0:11
I see, i just want the unique mapping values though, not the full mapping. For example 0 : 'a', 1 : 'b', 2 : 'c'
– jxn
Feb 14 '17 at 0:35
Then you can easily construct the map by yourself using codes and categories. Yet you cannot maintain the order by a Python dictionary, use two lists or a list of tuples in @Boud answer instead.
– Neo X
Feb 14 '17 at 0:41
add a comment |
There is not much documentation about usingfrom_codes()
. Can you show me how i can apply it ?
– jxn
Feb 13 '17 at 23:48
As updated, hope it helps.
– Neo X
Feb 14 '17 at 0:11
I see, i just want the unique mapping values though, not the full mapping. For example 0 : 'a', 1 : 'b', 2 : 'c'
– jxn
Feb 14 '17 at 0:35
Then you can easily construct the map by yourself using codes and categories. Yet you cannot maintain the order by a Python dictionary, use two lists or a list of tuples in @Boud answer instead.
– Neo X
Feb 14 '17 at 0:41
There is not much documentation about using
from_codes()
. Can you show me how i can apply it ?– jxn
Feb 13 '17 at 23:48
There is not much documentation about using
from_codes()
. Can you show me how i can apply it ?– jxn
Feb 13 '17 at 23:48
As updated, hope it helps.
– Neo X
Feb 14 '17 at 0:11
As updated, hope it helps.
– Neo X
Feb 14 '17 at 0:11
I see, i just want the unique mapping values though, not the full mapping. For example 0 : 'a', 1 : 'b', 2 : 'c'
– jxn
Feb 14 '17 at 0:35
I see, i just want the unique mapping values though, not the full mapping. For example 0 : 'a', 1 : 'b', 2 : 'c'
– jxn
Feb 14 '17 at 0:35
Then you can easily construct the map by yourself using codes and categories. Yet you cannot maintain the order by a Python dictionary, use two lists or a list of tuples in @Boud answer instead.
– Neo X
Feb 14 '17 at 0:41
Then you can easily construct the map by yourself using codes and categories. Yet you cannot maintain the order by a Python dictionary, use two lists or a list of tuples in @Boud answer instead.
– Neo X
Feb 14 '17 at 0:41
add a comment |
OP asks for something "accurate" relative to the answer in the linked question:
dict(enumerate(df_labels.col2.cat.categories))
# 0: 'a', 1: 'b', 2: 'c'
I believe that the above answer is indeed accurate (full disclosure: it is my answer in the other question that I'm defending). Note also that it is roughly equivalent to @pomber's answer, except that the ordering of the keys and values is reversed. (Since both keys and values are unique, the ordering is in some sense irrelevant, and easy enough to reverse as a consequence).
However, the following way is arguably safer, or at least more transparent as to how it works:
dict(zip(df_labels.col2.cat.codes, df_labels.col2))
# 0: 'a', 1: 'b', 2: 'c'
This is similar in spirit to @boud's answer, but corrects an error by replacing df_labels.col2.cat.codes
with df_labels.col2
. It also replaces list()
with dict()
which seems more appropriate for a mapping and automatically gets rid of duplicates.
Note that the length of both arguments to zip()
is len(df)
, whereas the length of df_labels.col2.cat.codes
is a count of unique values which will generally be much shorter than len(df)
.
Also note that this method is quite inefficient as it maps 0
to 'a'
twice, and similarly for 'b'
. In large dataframes the difference in speed could be pretty big. But it won't cause any error because dict()
will remove redundancies like this -- it's just that it will be much less efficient than the other method.
add a comment |
OP asks for something "accurate" relative to the answer in the linked question:
dict(enumerate(df_labels.col2.cat.categories))
# 0: 'a', 1: 'b', 2: 'c'
I believe that the above answer is indeed accurate (full disclosure: it is my answer in the other question that I'm defending). Note also that it is roughly equivalent to @pomber's answer, except that the ordering of the keys and values is reversed. (Since both keys and values are unique, the ordering is in some sense irrelevant, and easy enough to reverse as a consequence).
However, the following way is arguably safer, or at least more transparent as to how it works:
dict(zip(df_labels.col2.cat.codes, df_labels.col2))
# 0: 'a', 1: 'b', 2: 'c'
This is similar in spirit to @boud's answer, but corrects an error by replacing df_labels.col2.cat.codes
with df_labels.col2
. It also replaces list()
with dict()
which seems more appropriate for a mapping and automatically gets rid of duplicates.
Note that the length of both arguments to zip()
is len(df)
, whereas the length of df_labels.col2.cat.codes
is a count of unique values which will generally be much shorter than len(df)
.
Also note that this method is quite inefficient as it maps 0
to 'a'
twice, and similarly for 'b'
. In large dataframes the difference in speed could be pretty big. But it won't cause any error because dict()
will remove redundancies like this -- it's just that it will be much less efficient than the other method.
add a comment |
OP asks for something "accurate" relative to the answer in the linked question:
dict(enumerate(df_labels.col2.cat.categories))
# 0: 'a', 1: 'b', 2: 'c'
I believe that the above answer is indeed accurate (full disclosure: it is my answer in the other question that I'm defending). Note also that it is roughly equivalent to @pomber's answer, except that the ordering of the keys and values is reversed. (Since both keys and values are unique, the ordering is in some sense irrelevant, and easy enough to reverse as a consequence).
However, the following way is arguably safer, or at least more transparent as to how it works:
dict(zip(df_labels.col2.cat.codes, df_labels.col2))
# 0: 'a', 1: 'b', 2: 'c'
This is similar in spirit to @boud's answer, but corrects an error by replacing df_labels.col2.cat.codes
with df_labels.col2
. It also replaces list()
with dict()
which seems more appropriate for a mapping and automatically gets rid of duplicates.
Note that the length of both arguments to zip()
is len(df)
, whereas the length of df_labels.col2.cat.codes
is a count of unique values which will generally be much shorter than len(df)
.
Also note that this method is quite inefficient as it maps 0
to 'a'
twice, and similarly for 'b'
. In large dataframes the difference in speed could be pretty big. But it won't cause any error because dict()
will remove redundancies like this -- it's just that it will be much less efficient than the other method.
OP asks for something "accurate" relative to the answer in the linked question:
dict(enumerate(df_labels.col2.cat.categories))
# 0: 'a', 1: 'b', 2: 'c'
I believe that the above answer is indeed accurate (full disclosure: it is my answer in the other question that I'm defending). Note also that it is roughly equivalent to @pomber's answer, except that the ordering of the keys and values is reversed. (Since both keys and values are unique, the ordering is in some sense irrelevant, and easy enough to reverse as a consequence).
However, the following way is arguably safer, or at least more transparent as to how it works:
dict(zip(df_labels.col2.cat.codes, df_labels.col2))
# 0: 'a', 1: 'b', 2: 'c'
This is similar in spirit to @boud's answer, but corrects an error by replacing df_labels.col2.cat.codes
with df_labels.col2
. It also replaces list()
with dict()
which seems more appropriate for a mapping and automatically gets rid of duplicates.
Note that the length of both arguments to zip()
is len(df)
, whereas the length of df_labels.col2.cat.codes
is a count of unique values which will generally be much shorter than len(df)
.
Also note that this method is quite inefficient as it maps 0
to 'a'
twice, and similarly for 'b'
. In large dataframes the difference in speed could be pretty big. But it won't cause any error because dict()
will remove redundancies like this -- it's just that it will be much less efficient than the other method.
edited Mar 23 at 15:32
answered Mar 22 at 16:51
JohnEJohnE
15.2k73762
15.2k73762
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f42215354%2fpandas-get-mapping-of-categories-to-integer-value%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
FYI, I have since updated my answer (which you linked to) and added some explanation/verification. I believe it is accurate although I'm happy to improve it if you can elaborate about what you think is inaccurate about it.
– JohnE
Aug 25 '17 at 18:09