What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?How to identify the categorical variables in the 200+ numerical variables?“Large data” work flows using pandasImpute categorical missing values in scikit-learnProblems with a binary one-hot (one-of-K) coding in pythonHow to plot stacked bar chart to summarise each categorical column for proportion of valuesPreprocessing categorical data already converted into numbersPandas: Super categories the category dataCount Specific Values in Dataframepandas.DataFrame with dict-like data in columnsHandling Categorical Data with Many Values in sklearnRe-categorize a column in a pandas dataframe

How to reply to small talk/random facts in a non-offensive way?

Sho, greek letter

Does Marvel have an equivalent of the Green Lantern?

Fetch and print all properties of an object graph as string

C-152 carb heat on before landing in hot weather?

What sort of mathematical problems are there in AI that people are working on?

Why do some professors with PhDs leave their professorships to teach high school?

How risky is real estate?

Why aren't (poly-)cotton tents more popular?

Is adding a new player (or players) a DM decision, or a group decision?

How often can a PC check with passive perception during a combat turn?

Does squid ink pasta bleed?

Unusual mail headers, evidence of an attempted attack. Have I been pwned?

Is this one of the engines from the 9/11 aircraft?

5 cars in a roundabout traffic

Inverse-quotes-quine

Do flight schools typically have dress codes or expectations?

What is the fibered coproduct of abelian groups?

Is it damaging to turn off a small fridge for two days every week?

Impossible darts scores

Intuitively, why does putting capacitors in series decrease the equivalent capacitance?

What do you call a weak person's act of taking on bigger opponents?

Analog is Obtuse!

How to get cool night-vision without lame drawbacks?

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

How to identify the categorical variables in the 200+ numerical variables?“Large data” work flows using pandasImpute categorical missing values in scikit-learnProblems with a binary one-hot (one-of-K) coding in pythonHow to plot stacked bar chart to summarise each categorical column for proportion of valuesPreprocessing categorical data already converted into numbersPandas: Super categories the category dataCount Specific Values in Dataframepandas.DataFrame with dict-like data in columnsHandling Categorical Data with Many Values in sklearnRe-categorize a column in a pandas dataframe

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.

Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?

My initial thoughts are:

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

edited Mar 6 '16 at 13:45

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

– Ami Tavory
Mar 6 '16 at 13:45

3

Try Benford's law to discern numerical data from categorical one.

– Artem Sobolev
Mar 6 '16 at 14:47

@Barmaley.exe Can you elaborate on that idea please?

– Randy Olson
Mar 13 '16 at 3:43

1

@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

– Artem Sobolev
Mar 14 '16 at 8:12

Do you have any improvements on this?

– ayhan
Jun 4 '17 at 12:05

|
show 2 more comments

My initial thoughts are:

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

edited Mar 6 '16 at 13:45

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

– Ami Tavory
Mar 6 '16 at 13:45

3

Try Benford's law to discern numerical data from categorical one.

– Artem Sobolev
Mar 6 '16 at 14:47

@Barmaley.exe Can you elaborate on that idea please?

– Randy Olson
Mar 13 '16 at 3:43

1

@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

– Artem Sobolev
Mar 14 '16 at 8:12

Do you have any improvements on this?

– ayhan
Jun 4 '17 at 12:05

|
show 2 more comments

My initial thoughts are:

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

edited Mar 6 '16 at 13:45

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

My initial thoughts are:

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

python pandas scikit-learn

edited Mar 6 '16 at 13:45

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

edited Mar 6 '16 at 13:45

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

edited Mar 6 '16 at 13:45

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

asked Mar 6 '16 at 12:38

Randy Olson

1,7111 gold badge16 silver badges37 bronze badges

I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

– Ami Tavory
Mar 6 '16 at 13:45

3

Try Benford's law to discern numerical data from categorical one.

– Artem Sobolev
Mar 6 '16 at 14:47

@Barmaley.exe Can you elaborate on that idea please?

– Randy Olson
Mar 13 '16 at 3:43

1

@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

– Artem Sobolev
Mar 14 '16 at 8:12

Do you have any improvements on this?

– ayhan
Jun 4 '17 at 12:05

|
show 2 more comments

I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

– Ami Tavory
Mar 6 '16 at 13:45

3

Try Benford's law to discern numerical data from categorical one.

– Artem Sobolev
Mar 6 '16 at 14:47

@Barmaley.exe Can you elaborate on that idea please?

– Randy Olson
Mar 13 '16 at 3:43

1

@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

– Artem Sobolev
Mar 14 '16 at 8:12

Do you have any improvements on this?

– ayhan
Jun 4 '17 at 12:05

I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

– Ami Tavory
Mar 6 '16 at 13:45

Try Benford's law to discern numerical data from categorical one.

– Artem Sobolev
Mar 6 '16 at 14:47

@Barmaley.exe Can you elaborate on that idea please?

– Randy Olson
Mar 13 '16 at 3:43

@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

– Artem Sobolev
Mar 14 '16 at 8:12

Do you have any improvements on this?

– ayhan
Jun 4 '17 at 12:05

|
show 2 more comments

7 Answers
7

active

oldest

votes

Here are a couple of approaches:

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10 
likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

edited Dec 13 '18 at 12:36

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

– AiRiFiEd
Dec 13 '18 at 5:28

@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

– Rishabh Srivastava
Dec 13 '18 at 12:38

thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

– AiRiFiEd
Dec 14 '18 at 14:31

add a comment |

There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

I like this idea. Does anyone know of such a library?

– Randy Olson
Mar 6 '16 at 14:04

If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

– Diego
Mar 11 '16 at 19:52

add a comment |

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

add a comment |

I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

% floats: percentage of values that are float

% int: percentage of values that are whole numbers

% string: percentage of values that are strings

% unique string: number of unique string values / total number

% unique integers: number of unique integer values / total number

mean numerical value (non numerical values considered 0 for this)

std deviation of numerical values

and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.

A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

– Wes Turner
Dec 14 '16 at 9:27

add a comment |

IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.

For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.

Countries and such things might also be identifiable...

Age groups (".-.") might also work.

edited Jun 3 '17 at 8:31

community wiki

2 revs, 2 users 92%
Jan Schulz

add a comment |

You could define which datatypes count as numerics and then exclude the corresponding variables

If initial dataframe is df:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)

edited Mar 25 at 12:15

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

– Pramit
May 16 at 23:21

add a comment |

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
 """Removes categorical features using a given method.
 X: pd.DataFrame, dataframe to remove categorical features from."""

 if method=='fraction_unique':
 unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
 reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

 if method=='named_columns':
 non_cat_cols = [col not in cat_cols for col in X.columns]
 reduced_X = X.loc[:, non_cat_cols]

 return reduced_X

You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

– FChm
Apr 29 at 10:22

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35826912%2fwhat-is-a-good-heuristic-to-detect-if-a-column-in-a-pandas-dataframe-is-categori%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

7 Answers
7

active

oldest

votes

7 Answers
7

active

oldest

votes

Here are a couple of approaches:

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10 
likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

edited Dec 13 '18 at 12:36

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

– AiRiFiEd
Dec 13 '18 at 5:28

@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

– Rishabh Srivastava
Dec 13 '18 at 12:38

thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

– AiRiFiEd
Dec 14 '18 at 14:31

add a comment |

Here are a couple of approaches:

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10 
likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

edited Dec 13 '18 at 12:36

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

– AiRiFiEd
Dec 13 '18 at 5:28

@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

– Rishabh Srivastava
Dec 13 '18 at 12:38

thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

– AiRiFiEd
Dec 14 '18 at 14:31

add a comment |

Here are a couple of approaches:

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10 
likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

edited Dec 13 '18 at 12:36

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

Here are a couple of approaches:

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10 
likely_cat = 
for var in df.columns:
 likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

edited Dec 13 '18 at 12:36

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

edited Dec 13 '18 at 12:36

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

answered Mar 6 '16 at 13:50

Rishabh Srivastava

5492 silver badges11 bronze badges

May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

– AiRiFiEd
Dec 13 '18 at 5:28

@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

– Rishabh Srivastava
Dec 13 '18 at 12:38

thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

– AiRiFiEd
Dec 14 '18 at 14:31

add a comment |

May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

– AiRiFiEd
Dec 13 '18 at 5:28

@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

– Rishabh Srivastava
Dec 13 '18 at 12:38

thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

– AiRiFiEd
Dec 14 '18 at 14:31

May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

– AiRiFiEd
Dec 13 '18 at 5:28

@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

– Rishabh Srivastava
Dec 13 '18 at 12:38

thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

– AiRiFiEd
Dec 14 '18 at 14:31

add a comment |

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

I like this idea. Does anyone know of such a library?

– Randy Olson
Mar 6 '16 at 14:04

If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

– Diego
Mar 11 '16 at 19:52

add a comment |

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

I like this idea. Does anyone know of such a library?

– Randy Olson
Mar 6 '16 at 14:04

If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

– Diego
Mar 11 '16 at 19:52

add a comment |

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

answered Mar 6 '16 at 14:01

Diego

4684 silver badges13 bronze badges

I like this idea. Does anyone know of such a library?

– Randy Olson
Mar 6 '16 at 14:04

If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

– Diego
Mar 11 '16 at 19:52

add a comment |

I like this idea. Does anyone know of such a library?

– Randy Olson
Mar 6 '16 at 14:04

If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

– Diego
Mar 11 '16 at 19:52

I like this idea. Does anyone know of such a library?

– Randy Olson
Mar 6 '16 at 14:04

If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

– Diego
Mar 11 '16 at 19:52

add a comment |

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

add a comment |

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

add a comment |

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

answered Mar 6 '16 at 14:31

rd11

1,7473 gold badges16 silver badges26 bronze badges

add a comment |

I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

% floats: percentage of values that are float

% int: percentage of values that are whole numbers

% string: percentage of values that are strings

% unique string: number of unique string values / total number

% unique integers: number of unique integer values / total number

mean numerical value (non numerical values considered 0 for this)

std deviation of numerical values

and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

– Wes Turner
Dec 14 '16 at 9:27

add a comment |

I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

% floats: percentage of values that are float

% int: percentage of values that are whole numbers

% string: percentage of values that are strings

% unique string: number of unique string values / total number

% unique integers: number of unique integer values / total number

mean numerical value (non numerical values considered 0 for this)

std deviation of numerical values

and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

– Wes Turner
Dec 14 '16 at 9:27

add a comment |

I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

% floats: percentage of values that are float

% int: percentage of values that are whole numbers

% string: percentage of values that are strings

% unique string: number of unique string values / total number

% unique integers: number of unique integer values / total number

mean numerical value (non numerical values considered 0 for this)

std deviation of numerical values

and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

% floats: percentage of values that are float

% int: percentage of values that are whole numbers

% string: percentage of values that are strings

% unique string: number of unique string values / total number

% unique integers: number of unique integer values / total number

mean numerical value (non numerical values considered 0 for this)

std deviation of numerical values

and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

answered Jun 29 '16 at 19:53

Karl Rosaen

3,2641 gold badge22 silver badges28 bronze badges

> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

– Wes Turner
Dec 14 '16 at 9:27

add a comment |

> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

– Wes Turner
Dec 14 '16 at 9:27

> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

– Wes Turner
Dec 14 '16 at 9:27

add a comment |

Countries and such things might also be identifiable...

Age groups (".-.") might also work.

edited Jun 3 '17 at 8:31

community wiki

2 revs, 2 users 92%
Jan Schulz

add a comment |

Countries and such things might also be identifiable...

Age groups (".-.") might also work.

edited Jun 3 '17 at 8:31

community wiki

2 revs, 2 users 92%
Jan Schulz

add a comment |

Countries and such things might also be identifiable...

Age groups (".-.") might also work.

edited Jun 3 '17 at 8:31

community wiki

2 revs, 2 users 92%
Jan Schulz

Countries and such things might also be identifiable...

Age groups (".-.") might also work.

edited Jun 3 '17 at 8:31

community wiki

2 revs, 2 users 92%
Jan Schulz

edited Jun 3 '17 at 8:31

community wiki

2 revs, 2 users 92%
Jan Schulz

community wiki

2 revs, 2 users 92%
Jan Schulz

community wiki

2 revs, 2 users 92%
Jan Schulz

add a comment |

You could define which datatypes count as numerics and then exclude the corresponding variables

If initial dataframe is df:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)

edited Mar 25 at 12:15

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

– Pramit
May 16 at 23:21

add a comment |

You could define which datatypes count as numerics and then exclude the corresponding variables

If initial dataframe is df:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)

edited Mar 25 at 12:15

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

– Pramit
May 16 at 23:21

add a comment |

You could define which datatypes count as numerics and then exclude the corresponding variables

If initial dataframe is df:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)

edited Mar 25 at 12:15

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

You could define which datatypes count as numerics and then exclude the corresponding variables

If initial dataframe is df:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)

edited Mar 25 at 12:15

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

edited Mar 25 at 12:15

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

answered Mar 25 at 10:24

VicKat

311 silver badge6 bronze badges

Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

– Pramit
May 16 at 23:21

add a comment |

Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

– Pramit
May 16 at 23:21

Feels like the above is a great strategy. This is how landed up implementing

def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

– Pramit
May 16 at 23:21

Feels like the above is a great strategy. This is how landed up implementing

def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

– Pramit
May 16 at 23:21

add a comment |

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
 """Removes categorical features using a given method.
 X: pd.DataFrame, dataframe to remove categorical features from."""

 if method=='fraction_unique':
 unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
 reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

 if method=='named_columns':
 non_cat_cols = [col not in cat_cols for col in X.columns]
 reduced_X = X.loc[:, non_cat_cols]

 return reduced_X

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

– FChm
Apr 29 at 10:22

add a comment |

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
 """Removes categorical features using a given method.
 X: pd.DataFrame, dataframe to remove categorical features from."""

 if method=='fraction_unique':
 unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
 reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

 if method=='named_columns':
 non_cat_cols = [col not in cat_cols for col in X.columns]
 reduced_X = X.loc[:, non_cat_cols]

 return reduced_X

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

– FChm
Apr 29 at 10:22

add a comment |

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
 """Removes categorical features using a given method.
 X: pd.DataFrame, dataframe to remove categorical features from."""

 if method=='fraction_unique':
 unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
 reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

 if method=='named_columns':
 non_cat_cols = [col not in cat_cols for col in X.columns]
 reduced_X = X.loc[:, non_cat_cols]

 return reduced_X

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
 """Removes categorical features using a given method.
 X: pd.DataFrame, dataframe to remove categorical features from."""

 if method=='fraction_unique':
 unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
 reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

 if method=='named_columns':
 non_cat_cols = [col not in cat_cols for col in X.columns]
 reduced_X = X.loc[:, non_cat_cols]

 return reduced_X

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

answered Apr 29 at 10:09

FChm

9611 gold badge6 silver badges19 bronze badges

I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

– FChm
Apr 29 at 10:22

add a comment |

I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

– FChm
Apr 29 at 10:22

I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

– FChm
Apr 29 at 10:22

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

7 Answers
7

Your Answer

Post as a guest

7 Answers
7

7 Answers
7

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Post as a guest

7 Answers 7

7 Answers 7

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

7 Answers
7

7 Answers
7

7 Answers
7