What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?How to identify the categorical variables in the 200+ numerical variables?“Large data” work flows using pandasImpute categorical missing values in scikit-learnProblems with a binary one-hot (one-of-K) coding in pythonHow to plot stacked bar chart to summarise each categorical column for proportion of valuesPreprocessing categorical data already converted into numbersPandas: Super categories the category dataCount Specific Values in Dataframepandas.DataFrame with dict-like data in columnsHandling Categorical Data with Many Values in sklearnRe-categorize a column in a pandas dataframe
How to reply to small talk/random facts in a non-offensive way?
Sho, greek letter
Does Marvel have an equivalent of the Green Lantern?
Fetch and print all properties of an object graph as string
C-152 carb heat on before landing in hot weather?
What sort of mathematical problems are there in AI that people are working on?
Why do some professors with PhDs leave their professorships to teach high school?
How risky is real estate?
Why aren't (poly-)cotton tents more popular?
Is adding a new player (or players) a DM decision, or a group decision?
How often can a PC check with passive perception during a combat turn?
Does squid ink pasta bleed?
Unusual mail headers, evidence of an attempted attack. Have I been pwned?
Is this one of the engines from the 9/11 aircraft?
5 cars in a roundabout traffic
Inverse-quotes-quine
Do flight schools typically have dress codes or expectations?
What is the fibered coproduct of abelian groups?
Is it damaging to turn off a small fridge for two days every week?
Impossible darts scores
Intuitively, why does putting capacitors in series decrease the equivalent capacitance?
What do you call a weak person's act of taking on bigger opponents?
Analog is Obtuse!
How to get cool night-vision without lame drawbacks?
What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?
How to identify the categorical variables in the 200+ numerical variables?“Large data” work flows using pandasImpute categorical missing values in scikit-learnProblems with a binary one-hot (one-of-K) coding in pythonHow to plot stacked bar chart to summarise each categorical column for proportion of valuesPreprocessing categorical data already converted into numbersPandas: Super categories the category dataCount Specific Values in Dataframepandas.DataFrame with dict-like data in columnsHandling Categorical Data with Many Values in sklearnRe-categorize a column in a pandas dataframe
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object
), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1)
to work fine, but 2)
hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2)
didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2)
obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
python pandas scikit-learn
|
show 2 more comments
I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object
), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1)
to work fine, but 2)
hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2)
didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2)
obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
python pandas scikit-learn
I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.
– Ami Tavory
Mar 6 '16 at 13:45
3
Try Benford's law to discern numerical data from categorical one.
– Artem Sobolev
Mar 6 '16 at 14:47
@Barmaley.exe Can you elaborate on that idea please?
– Randy Olson
Mar 13 '16 at 3:43
1
@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.
– Artem Sobolev
Mar 14 '16 at 8:12
Do you have any improvements on this?
– ayhan
Jun 4 '17 at 12:05
|
show 2 more comments
I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object
), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1)
to work fine, but 2)
hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2)
didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2)
obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
python pandas scikit-learn
I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object
), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1)
to work fine, but 2)
hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2)
didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2)
obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
python pandas scikit-learn
python pandas scikit-learn
edited Mar 6 '16 at 13:45
Randy Olson
asked Mar 6 '16 at 12:38
Randy OlsonRandy Olson
1,7111 gold badge16 silver badges37 bronze badges
1,7111 gold badge16 silver badges37 bronze badges
I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.
– Ami Tavory
Mar 6 '16 at 13:45
3
Try Benford's law to discern numerical data from categorical one.
– Artem Sobolev
Mar 6 '16 at 14:47
@Barmaley.exe Can you elaborate on that idea please?
– Randy Olson
Mar 13 '16 at 3:43
1
@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.
– Artem Sobolev
Mar 14 '16 at 8:12
Do you have any improvements on this?
– ayhan
Jun 4 '17 at 12:05
|
show 2 more comments
I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.
– Ami Tavory
Mar 6 '16 at 13:45
3
Try Benford's law to discern numerical data from categorical one.
– Artem Sobolev
Mar 6 '16 at 14:47
@Barmaley.exe Can you elaborate on that idea please?
– Randy Olson
Mar 13 '16 at 3:43
1
@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.
– Artem Sobolev
Mar 14 '16 at 8:12
Do you have any improvements on this?
– ayhan
Jun 4 '17 at 12:05
I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.
– Ami Tavory
Mar 6 '16 at 13:45
I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.
– Ami Tavory
Mar 6 '16 at 13:45
3
3
Try Benford's law to discern numerical data from categorical one.
– Artem Sobolev
Mar 6 '16 at 14:47
Try Benford's law to discern numerical data from categorical one.
– Artem Sobolev
Mar 6 '16 at 14:47
@Barmaley.exe Can you elaborate on that idea please?
– Randy Olson
Mar 13 '16 at 3:43
@Barmaley.exe Can you elaborate on that idea please?
– Randy Olson
Mar 13 '16 at 3:43
1
1
@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.
– Artem Sobolev
Mar 14 '16 at 8:12
@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.
– Artem Sobolev
Mar 14 '16 at 8:12
Do you have any improvements on this?
– ayhan
Jun 4 '17 at 12:05
Do you have any improvements on this?
– ayhan
Jun 4 '17 at 12:05
|
show 2 more comments
7 Answers
7
active
oldest
votes
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other thresholdCheck if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows?(1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8
– AiRiFiEd
Dec 13 '18 at 5:28
@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.
– Rishabh Srivastava
Dec 13 '18 at 12:38
thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" -top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])
– AiRiFiEd
Dec 14 '18 at 14:31
add a comment |
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
I like this idea. Does anyone know of such a library?
– Randy Olson
Mar 6 '16 at 14:04
If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.
– Diego
Mar 11 '16 at 19:52
add a comment |
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
add a comment |
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
- % floats: percentage of values that are float
- % int: percentage of values that are whole numbers
- % string: percentage of values that are strings
- % unique string: number of unique string values / total number
- % unique integers: number of unique integer values / total number
- mean numerical value (non numerical values considered 0 for this)
- std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402
– Wes Turner
Dec 14 '16 at 9:27
add a comment |
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
add a comment |
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
Feels like the above is a great strategy. This is how landed up implementingdef is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]
– Pramit
May 16 at 23:21
add a comment |
I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X
and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique
).
I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.
– FChm
Apr 29 at 10:22
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35826912%2fwhat-is-a-good-heuristic-to-detect-if-a-column-in-a-pandas-dataframe-is-categori%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
7 Answers
7
active
oldest
votes
7 Answers
7
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other thresholdCheck if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows?(1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8
– AiRiFiEd
Dec 13 '18 at 5:28
@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.
– Rishabh Srivastava
Dec 13 '18 at 12:38
thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" -top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])
– AiRiFiEd
Dec 14 '18 at 14:31
add a comment |
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other thresholdCheck if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows?(1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8
– AiRiFiEd
Dec 13 '18 at 5:28
@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.
– Rishabh Srivastava
Dec 13 '18 at 12:38
thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" -top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])
– AiRiFiEd
Dec 14 '18 at 14:31
add a comment |
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other thresholdCheck if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other thresholdCheck if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat =
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
edited Dec 13 '18 at 12:36
answered Mar 6 '16 at 13:50
Rishabh SrivastavaRishabh Srivastava
5492 silver badges11 bronze badges
5492 silver badges11 bronze badges
May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows?(1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8
– AiRiFiEd
Dec 13 '18 at 5:28
@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.
– Rishabh Srivastava
Dec 13 '18 at 12:38
thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" -top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])
– AiRiFiEd
Dec 14 '18 at 14:31
add a comment |
May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows?(1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8
– AiRiFiEd
Dec 13 '18 at 5:28
@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.
– Rishabh Srivastava
Dec 13 '18 at 12:38
thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" -top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])
– AiRiFiEd
Dec 14 '18 at 14:31
May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows?
(1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8
– AiRiFiEd
Dec 13 '18 at 5:28
May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows?
(1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8
– AiRiFiEd
Dec 13 '18 at 5:28
@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.
– Rishabh Srivastava
Dec 13 '18 at 12:38
@AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.
– Rishabh Srivastava
Dec 13 '18 at 12:38
thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" -
top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])
– AiRiFiEd
Dec 14 '18 at 14:31
thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" -
top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])
– AiRiFiEd
Dec 14 '18 at 14:31
add a comment |
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
I like this idea. Does anyone know of such a library?
– Randy Olson
Mar 6 '16 at 14:04
If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.
– Diego
Mar 11 '16 at 19:52
add a comment |
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
I like this idea. Does anyone know of such a library?
– Randy Olson
Mar 6 '16 at 14:04
If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.
– Diego
Mar 11 '16 at 19:52
add a comment |
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
answered Mar 6 '16 at 14:01
DiegoDiego
4684 silver badges13 bronze badges
4684 silver badges13 bronze badges
I like this idea. Does anyone know of such a library?
– Randy Olson
Mar 6 '16 at 14:04
If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.
– Diego
Mar 11 '16 at 19:52
add a comment |
I like this idea. Does anyone know of such a library?
– Randy Olson
Mar 6 '16 at 14:04
If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.
– Diego
Mar 11 '16 at 19:52
I like this idea. Does anyone know of such a library?
– Randy Olson
Mar 6 '16 at 14:04
I like this idea. Does anyone know of such a library?
– Randy Olson
Mar 6 '16 at 14:04
If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.
– Diego
Mar 11 '16 at 19:52
If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.
– Diego
Mar 11 '16 at 19:52
add a comment |
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
add a comment |
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
add a comment |
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
answered Mar 6 '16 at 14:31
rd11rd11
1,7473 gold badges16 silver badges26 bronze badges
1,7473 gold badges16 silver badges26 bronze badges
add a comment |
add a comment |
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
- % floats: percentage of values that are float
- % int: percentage of values that are whole numbers
- % string: percentage of values that are strings
- % unique string: number of unique string values / total number
- % unique integers: number of unique integer values / total number
- mean numerical value (non numerical values considered 0 for this)
- std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402
– Wes Turner
Dec 14 '16 at 9:27
add a comment |
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
- % floats: percentage of values that are float
- % int: percentage of values that are whole numbers
- % string: percentage of values that are strings
- % unique string: number of unique string values / total number
- % unique integers: number of unique integer values / total number
- mean numerical value (non numerical values considered 0 for this)
- std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402
– Wes Turner
Dec 14 '16 at 9:27
add a comment |
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
- % floats: percentage of values that are float
- % int: percentage of values that are whole numbers
- % string: percentage of values that are strings
- % unique string: number of unique string values / total number
- % unique integers: number of unique integer values / total number
- mean numerical value (non numerical values considered 0 for this)
- std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
- % floats: percentage of values that are float
- % int: percentage of values that are whole numbers
- % string: percentage of values that are strings
- % unique string: number of unique string values / total number
- % unique integers: number of unique integer values / total number
- mean numerical value (non numerical values considered 0 for this)
- std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
answered Jun 29 '16 at 19:53
Karl RosaenKarl Rosaen
3,2641 gold badge22 silver badges28 bronze badges
3,2641 gold badge22 silver badges28 bronze badges
> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402
– Wes Turner
Dec 14 '16 at 9:27
add a comment |
> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402
– Wes Turner
Dec 14 '16 at 9:27
> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402
– Wes Turner
Dec 14 '16 at 9:27
> A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402
– Wes Turner
Dec 14 '16 at 9:27
add a comment |
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
add a comment |
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
add a comment |
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
edited Jun 3 '17 at 8:31
community wiki
2 revs, 2 users 92%
Jan Schulz
add a comment |
add a comment |
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
Feels like the above is a great strategy. This is how landed up implementingdef is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]
– Pramit
May 16 at 23:21
add a comment |
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
Feels like the above is a great strategy. This is how landed up implementingdef is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]
– Pramit
May 16 at 23:21
add a comment |
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
edited Mar 25 at 12:15
answered Mar 25 at 10:24
VicKatVicKat
311 silver badge6 bronze badges
311 silver badge6 bronze badges
Feels like the above is a great strategy. This is how landed up implementingdef is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]
– Pramit
May 16 at 23:21
add a comment |
Feels like the above is a great strategy. This is how landed up implementingdef is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]
– Pramit
May 16 at 23:21
Feels like the above is a great strategy. This is how landed up implementing
def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]
– Pramit
May 16 at 23:21
Feels like the above is a great strategy. This is how landed up implementing
def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]
– Pramit
May 16 at 23:21
add a comment |
I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X
and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique
).
I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.
– FChm
Apr 29 at 10:22
add a comment |
I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X
and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique
).
I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.
– FChm
Apr 29 at 10:22
add a comment |
I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X
and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique
).
I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X
and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique
).
answered Apr 29 at 10:09
FChmFChm
9611 gold badge6 silver badges19 bronze badges
9611 gold badge6 silver badges19 bronze badges
I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.
– FChm
Apr 29 at 10:22
add a comment |
I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.
– FChm
Apr 29 at 10:22
I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.
– FChm
Apr 29 at 10:22
I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.
– FChm
Apr 29 at 10:22
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35826912%2fwhat-is-a-good-heuristic-to-detect-if-a-column-in-a-pandas-dataframe-is-categori%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.
– Ami Tavory
Mar 6 '16 at 13:45
3
Try Benford's law to discern numerical data from categorical one.
– Artem Sobolev
Mar 6 '16 at 14:47
@Barmaley.exe Can you elaborate on that idea please?
– Randy Olson
Mar 13 '16 at 3:43
1
@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.
– Artem Sobolev
Mar 14 '16 at 8:12
Do you have any improvements on this?
– ayhan
Jun 4 '17 at 12:05