What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?How to identify the categorical variables in the 200+ numerical variables?“Large data” work flows using pandasImpute categorical missing values in scikit-learnProblems with a binary one-hot (one-of-K) coding in pythonHow to plot stacked bar chart to summarise each categorical column for proportion of valuesPreprocessing categorical data already converted into numbersPandas: Super categories the category dataCount Specific Values in Dataframepandas.DataFrame with dict-like data in columnsHandling Categorical Data with Many Values in sklearnRe-categorize a column in a pandas dataframe

How to reply to small talk/random facts in a non-offensive way?

Sho, greek letter

Does Marvel have an equivalent of the Green Lantern?

Fetch and print all properties of an object graph as string

C-152 carb heat on before landing in hot weather?

What sort of mathematical problems are there in AI that people are working on?

Why do some professors with PhDs leave their professorships to teach high school?

How risky is real estate?

Why aren't (poly-)cotton tents more popular?

Is adding a new player (or players) a DM decision, or a group decision?

How often can a PC check with passive perception during a combat turn?

Does squid ink pasta bleed?

Unusual mail headers, evidence of an attempted attack. Have I been pwned?

Is this one of the engines from the 9/11 aircraft?

5 cars in a roundabout traffic

Inverse-quotes-quine

Do flight schools typically have dress codes or expectations?

What is the fibered coproduct of abelian groups?

Is it damaging to turn off a small fridge for two days every week?

Impossible darts scores

Intuitively, why does putting capacitors in series decrease the equivalent capacitance?

What do you call a weak person's act of taking on bigger opponents?

Analog is Obtuse!

How to get cool night-vision without lame drawbacks?



What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?


How to identify the categorical variables in the 200+ numerical variables?“Large data” work flows using pandasImpute categorical missing values in scikit-learnProblems with a binary one-hot (one-of-K) coding in pythonHow to plot stacked bar chart to summarise each categorical column for proportion of valuesPreprocessing categorical data already converted into numbersPandas: Super categories the category dataCount Specific Values in Dataframepandas.DataFrame with dict-like data in columnsHandling Categorical Data with Many Values in sklearnRe-categorize a column in a pandas dataframe






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








18















I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.



Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?



My initial thoughts are:



1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data



2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data



I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?



Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.










share|improve this question
























  • I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

    – Ami Tavory
    Mar 6 '16 at 13:45






  • 3





    Try Benford's law to discern numerical data from categorical one.

    – Artem Sobolev
    Mar 6 '16 at 14:47











  • @Barmaley.exe Can you elaborate on that idea please?

    – Randy Olson
    Mar 13 '16 at 3:43






  • 1





    @RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

    – Artem Sobolev
    Mar 14 '16 at 8:12











  • Do you have any improvements on this?

    – ayhan
    Jun 4 '17 at 12:05

















18















I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.



Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?



My initial thoughts are:



1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data



2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data



I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?



Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.










share|improve this question
























  • I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

    – Ami Tavory
    Mar 6 '16 at 13:45






  • 3





    Try Benford's law to discern numerical data from categorical one.

    – Artem Sobolev
    Mar 6 '16 at 14:47











  • @Barmaley.exe Can you elaborate on that idea please?

    – Randy Olson
    Mar 13 '16 at 3:43






  • 1





    @RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

    – Artem Sobolev
    Mar 14 '16 at 8:12











  • Do you have any improvements on this?

    – ayhan
    Jun 4 '17 at 12:05













18












18








18


6






I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.



Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?



My initial thoughts are:



1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data



2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data



I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?



Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.










share|improve this question
















I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.



Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?



My initial thoughts are:



1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data



2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data



I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?



Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.







python pandas scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 6 '16 at 13:45







Randy Olson

















asked Mar 6 '16 at 12:38









Randy OlsonRandy Olson

1,7111 gold badge16 silver badges37 bronze badges




1,7111 gold badge16 silver badges37 bronze badges












  • I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

    – Ami Tavory
    Mar 6 '16 at 13:45






  • 3





    Try Benford's law to discern numerical data from categorical one.

    – Artem Sobolev
    Mar 6 '16 at 14:47











  • @Barmaley.exe Can you elaborate on that idea please?

    – Randy Olson
    Mar 13 '16 at 3:43






  • 1





    @RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

    – Artem Sobolev
    Mar 14 '16 at 8:12











  • Do you have any improvements on this?

    – ayhan
    Jun 4 '17 at 12:05

















  • I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

    – Ami Tavory
    Mar 6 '16 at 13:45






  • 3





    Try Benford's law to discern numerical data from categorical one.

    – Artem Sobolev
    Mar 6 '16 at 14:47











  • @Barmaley.exe Can you elaborate on that idea please?

    – Randy Olson
    Mar 13 '16 at 3:43






  • 1





    @RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

    – Artem Sobolev
    Mar 14 '16 at 8:12











  • Do you have any improvements on this?

    – ayhan
    Jun 4 '17 at 12:05
















I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

– Ami Tavory
Mar 6 '16 at 13:45





I believe this question is nearly completely undefined. What is the distribution over all the datasets in the world? Your rule 1 fails miserably for the postal service or phone book, for example.

– Ami Tavory
Mar 6 '16 at 13:45




3




3





Try Benford's law to discern numerical data from categorical one.

– Artem Sobolev
Mar 6 '16 at 14:47





Try Benford's law to discern numerical data from categorical one.

– Artem Sobolev
Mar 6 '16 at 14:47













@Barmaley.exe Can you elaborate on that idea please?

– Randy Olson
Mar 13 '16 at 3:43





@Barmaley.exe Can you elaborate on that idea please?

– Randy Olson
Mar 13 '16 at 3:43




1




1





@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

– Artem Sobolev
Mar 14 '16 at 8:12





@RandyOlson, well, I'm not sure if it'd work, but the idea is that "natural" numbers tend to obey the Benford's law, while categorical values (ids) don't have to: indeed, you can permute ids arbitrarily and nothing would change. So you can try to derive some kind of a test from that law.

– Artem Sobolev
Mar 14 '16 at 8:12













Do you have any improvements on this?

– ayhan
Jun 4 '17 at 12:05





Do you have any improvements on this?

– ayhan
Jun 4 '17 at 12:05












7 Answers
7






active

oldest

votes


















19














Here are a couple of approaches:




  1. Find the ratio of number of unique values to the total number of unique values. Something like the following




    likely_cat = 
    for var in df.columns:
    likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold




  2. Check if the top n unique values account for more than a certain proportion of all values




    top_n = 10 
    likely_cat =
    for var in df.columns:
    likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold



Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.






share|improve this answer

























  • May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

    – AiRiFiEd
    Dec 13 '18 at 5:28












  • @AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

    – Rishabh Srivastava
    Dec 13 '18 at 12:38











  • thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

    – AiRiFiEd
    Dec 14 '18 at 14:31



















2














There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.






share|improve this answer























  • I like this idea. Does anyone know of such a library?

    – Randy Olson
    Mar 6 '16 at 14:04











  • If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

    – Diego
    Mar 11 '16 at 19:52


















1














I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.



If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.



If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.



But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?






share|improve this answer






























    1














    I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.



    I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:



    • % floats: percentage of values that are float

    • % int: percentage of values that are whole numbers

    • % string: percentage of values that are strings

    • % unique string: number of unique string values / total number

    • % unique integers: number of unique integer values / total number

    • mean numerical value (non numerical values considered 0 for this)

    • std deviation of numerical values

    and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.



    Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.



    A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.






    share|improve this answer























    • > A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

      – Wes Turner
      Dec 14 '16 at 9:27



















    1














    IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.



    For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.



    Countries and such things might also be identifiable...



    Age groups (".-.") might also work.






    share|improve this answer
































      1














      You could define which datatypes count as numerics and then exclude the corresponding variables



      If initial dataframe is df:



      numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
      dataframe = df.select_dtypes(exclude=numerics)





      share|improve this answer

























      • Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

        – Pramit
        May 16 at 23:21



















      0














      I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.



      import pandas as pd

      def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
      """Removes categorical features using a given method.
      X: pd.DataFrame, dataframe to remove categorical features from."""

      if method=='fraction_unique':
      unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
      reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

      if method=='named_columns':
      non_cat_cols = [col not in cat_cols for col in X.columns]
      reduced_X = X.loc[:, non_cat_cols]

      return reduced_X


      You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).






      share|improve this answer























      • I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

        – FChm
        Apr 29 at 10:22













      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35826912%2fwhat-is-a-good-heuristic-to-detect-if-a-column-in-a-pandas-dataframe-is-categori%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      7 Answers
      7






      active

      oldest

      votes








      7 Answers
      7






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      19














      Here are a couple of approaches:




      1. Find the ratio of number of unique values to the total number of unique values. Something like the following




        likely_cat = 
        for var in df.columns:
        likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold




      2. Check if the top n unique values account for more than a certain proportion of all values




        top_n = 10 
        likely_cat =
        for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold



      Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.






      share|improve this answer

























      • May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

        – AiRiFiEd
        Dec 13 '18 at 5:28












      • @AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

        – Rishabh Srivastava
        Dec 13 '18 at 12:38











      • thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

        – AiRiFiEd
        Dec 14 '18 at 14:31
















      19














      Here are a couple of approaches:




      1. Find the ratio of number of unique values to the total number of unique values. Something like the following




        likely_cat = 
        for var in df.columns:
        likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold




      2. Check if the top n unique values account for more than a certain proportion of all values




        top_n = 10 
        likely_cat =
        for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold



      Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.






      share|improve this answer

























      • May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

        – AiRiFiEd
        Dec 13 '18 at 5:28












      • @AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

        – Rishabh Srivastava
        Dec 13 '18 at 12:38











      • thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

        – AiRiFiEd
        Dec 14 '18 at 14:31














      19












      19








      19







      Here are a couple of approaches:




      1. Find the ratio of number of unique values to the total number of unique values. Something like the following




        likely_cat = 
        for var in df.columns:
        likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold




      2. Check if the top n unique values account for more than a certain proportion of all values




        top_n = 10 
        likely_cat =
        for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold



      Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.






      share|improve this answer















      Here are a couple of approaches:




      1. Find the ratio of number of unique values to the total number of unique values. Something like the following




        likely_cat = 
        for var in df.columns:
        likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold




      2. Check if the top n unique values account for more than a certain proportion of all values




        top_n = 10 
        likely_cat =
        for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold



      Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Dec 13 '18 at 12:36

























      answered Mar 6 '16 at 13:50









      Rishabh SrivastavaRishabh Srivastava

      5492 silver badges11 bronze badges




      5492 silver badges11 bronze badges












      • May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

        – AiRiFiEd
        Dec 13 '18 at 5:28












      • @AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

        – Rishabh Srivastava
        Dec 13 '18 at 12:38











      • thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

        – AiRiFiEd
        Dec 14 '18 at 14:31


















      • May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

        – AiRiFiEd
        Dec 13 '18 at 5:28












      • @AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

        – Rishabh Srivastava
        Dec 13 '18 at 12:38











      • thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

        – AiRiFiEd
        Dec 14 '18 at 14:31

















      May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

      – AiRiFiEd
      Dec 13 '18 at 5:28






      May I kindly check if approach 2 is missing a summation operation? when I tested it on my code, it seems that it will return a series of booleans, with each representing if that particular unique value has relative frequency > threshold. Was the intention to sum the total relative frequencies for top_n rows? (1.*dff['test'].value_counts(normalize=True).head(3)).sum() > 0.8

      – AiRiFiEd
      Dec 13 '18 at 5:28














      @AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

      – Rishabh Srivastava
      Dec 13 '18 at 12:38





      @AiRiFiEd: Yes - it was missing a summation operation. Thanks very much for pointing that out. Have updated the answer.

      – Rishabh Srivastava
      Dec 13 '18 at 12:38













      thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

      – AiRiFiEd
      Dec 14 '18 at 14:31






      thanks for updating the answer despite this being a very old post! May i kindly check, from your experience, what would be a reasonable heuristic to use as threshold for approach 2? For example, i am thinking of assigning top_n as x percent of total number of unique values (thereby resulting in something along of the lines of "20% of unique values account for 80% of all values" - top_n = round(0.8 * (1.*dff[var].value_counts(normalize=True).head(3)).shape[0])

      – AiRiFiEd
      Dec 14 '18 at 14:31














      2














      There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
      I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.






      share|improve this answer























      • I like this idea. Does anyone know of such a library?

        – Randy Olson
        Mar 6 '16 at 14:04











      • If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

        – Diego
        Mar 11 '16 at 19:52















      2














      There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
      I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.






      share|improve this answer























      • I like this idea. Does anyone know of such a library?

        – Randy Olson
        Mar 6 '16 at 14:04











      • If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

        – Diego
        Mar 11 '16 at 19:52













      2












      2








      2







      There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
      I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.






      share|improve this answer













      There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
      I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Mar 6 '16 at 14:01









      DiegoDiego

      4684 silver badges13 bronze badges




      4684 silver badges13 bronze badges












      • I like this idea. Does anyone know of such a library?

        – Randy Olson
        Mar 6 '16 at 14:04











      • If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

        – Diego
        Mar 11 '16 at 19:52

















      • I like this idea. Does anyone know of such a library?

        – Randy Olson
        Mar 6 '16 at 14:04











      • If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

        – Diego
        Mar 11 '16 at 19:52
















      I like this idea. Does anyone know of such a library?

      – Randy Olson
      Mar 6 '16 at 14:04





      I like this idea. Does anyone know of such a library?

      – Randy Olson
      Mar 6 '16 at 14:04













      If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

      – Diego
      Mar 11 '16 at 19:52





      If you like the idea consider upvoting the answer so it will be more visible to others and they might suggest the library.

      – Diego
      Mar 11 '16 at 19:52











      1














      I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.



      If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.



      If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.



      But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?






      share|improve this answer



























        1














        I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.



        If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.



        If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.



        But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?






        share|improve this answer

























          1












          1








          1







          I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.



          If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.



          If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.



          But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?






          share|improve this answer













          I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.



          If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.



          If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.



          But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 6 '16 at 14:31









          rd11rd11

          1,7473 gold badges16 silver badges26 bronze badges




          1,7473 gold badges16 silver badges26 bronze badges





















              1














              I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.



              I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:



              • % floats: percentage of values that are float

              • % int: percentage of values that are whole numbers

              • % string: percentage of values that are strings

              • % unique string: number of unique string values / total number

              • % unique integers: number of unique integer values / total number

              • mean numerical value (non numerical values considered 0 for this)

              • std deviation of numerical values

              and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.



              Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.



              A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.






              share|improve this answer























              • > A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

                – Wes Turner
                Dec 14 '16 at 9:27
















              1














              I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.



              I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:



              • % floats: percentage of values that are float

              • % int: percentage of values that are whole numbers

              • % string: percentage of values that are strings

              • % unique string: number of unique string values / total number

              • % unique integers: number of unique integer values / total number

              • mean numerical value (non numerical values considered 0 for this)

              • std deviation of numerical values

              and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.



              Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.



              A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.






              share|improve this answer























              • > A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

                – Wes Turner
                Dec 14 '16 at 9:27














              1












              1








              1







              I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.



              I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:



              • % floats: percentage of values that are float

              • % int: percentage of values that are whole numbers

              • % string: percentage of values that are strings

              • % unique string: number of unique string values / total number

              • % unique integers: number of unique integer values / total number

              • mean numerical value (non numerical values considered 0 for this)

              • std deviation of numerical values

              and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.



              Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.



              A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.






              share|improve this answer













              I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.



              I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:



              • % floats: percentage of values that are float

              • % int: percentage of values that are whole numbers

              • % string: percentage of values that are strings

              • % unique string: number of unique string values / total number

              • % unique integers: number of unique integer values / total number

              • mean numerical value (non numerical values considered 0 for this)

              • std deviation of numerical values

              and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.



              Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.



              A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Jun 29 '16 at 19:53









              Karl RosaenKarl Rosaen

              3,2641 gold badge22 silver badges28 bronze badges




              3,2641 gold badge22 silver badges28 bronze badges












              • > A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

                – Wes Turner
                Dec 14 '16 at 9:27


















              • > A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

                – Wes Turner
                Dec 14 '16 at 9:27

















              > A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

              – Wes Turner
              Dec 14 '16 at 9:27






              > A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? || This would require columnar metadata. See github.com/pandas-dev/pandas/issues/3402

              – Wes Turner
              Dec 14 '16 at 9:27












              1














              IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.



              For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.



              Countries and such things might also be identifiable...



              Age groups (".-.") might also work.






              share|improve this answer





























                1














                IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.



                For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.



                Countries and such things might also be identifiable...



                Age groups (".-.") might also work.






                share|improve this answer



























                  1












                  1








                  1







                  IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.



                  For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.



                  Countries and such things might also be identifiable...



                  Age groups (".-.") might also work.






                  share|improve this answer















                  IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.



                  For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.



                  Countries and such things might also be identifiable...



                  Age groups (".-.") might also work.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jun 3 '17 at 8:31


























                  community wiki





                  2 revs, 2 users 92%
                  Jan Schulz






















                      1














                      You could define which datatypes count as numerics and then exclude the corresponding variables



                      If initial dataframe is df:



                      numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
                      dataframe = df.select_dtypes(exclude=numerics)





                      share|improve this answer

























                      • Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

                        – Pramit
                        May 16 at 23:21
















                      1














                      You could define which datatypes count as numerics and then exclude the corresponding variables



                      If initial dataframe is df:



                      numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
                      dataframe = df.select_dtypes(exclude=numerics)





                      share|improve this answer

























                      • Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

                        – Pramit
                        May 16 at 23:21














                      1












                      1








                      1







                      You could define which datatypes count as numerics and then exclude the corresponding variables



                      If initial dataframe is df:



                      numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
                      dataframe = df.select_dtypes(exclude=numerics)





                      share|improve this answer















                      You could define which datatypes count as numerics and then exclude the corresponding variables



                      If initial dataframe is df:



                      numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
                      dataframe = df.select_dtypes(exclude=numerics)






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Mar 25 at 12:15

























                      answered Mar 25 at 10:24









                      VicKatVicKat

                      311 silver badge6 bronze badges




                      311 silver badge6 bronze badges












                      • Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

                        – Pramit
                        May 16 at 23:21


















                      • Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

                        – Pramit
                        May 16 at 23:21

















                      Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

                      – Pramit
                      May 16 at 23:21






                      Feels like the above is a great strategy. This is how landed up implementing def is_numeric(input_frame:pd.core.frame.DataFrame, clmn_names:Optional[list]=None): numerics_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] return [True if input_frame[clmn_names].dtypes.name in numerics_types else False]

                      – Pramit
                      May 16 at 23:21












                      0














                      I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.



                      import pandas as pd

                      def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
                      """Removes categorical features using a given method.
                      X: pd.DataFrame, dataframe to remove categorical features from."""

                      if method=='fraction_unique':
                      unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
                      reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

                      if method=='named_columns':
                      non_cat_cols = [col not in cat_cols for col in X.columns]
                      reduced_X = X.loc[:, non_cat_cols]

                      return reduced_X


                      You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).






                      share|improve this answer























                      • I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

                        – FChm
                        Apr 29 at 10:22















                      0














                      I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.



                      import pandas as pd

                      def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
                      """Removes categorical features using a given method.
                      X: pd.DataFrame, dataframe to remove categorical features from."""

                      if method=='fraction_unique':
                      unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
                      reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

                      if method=='named_columns':
                      non_cat_cols = [col not in cat_cols for col in X.columns]
                      reduced_X = X.loc[:, non_cat_cols]

                      return reduced_X


                      You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).






                      share|improve this answer























                      • I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

                        – FChm
                        Apr 29 at 10:22













                      0












                      0








                      0







                      I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.



                      import pandas as pd

                      def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
                      """Removes categorical features using a given method.
                      X: pd.DataFrame, dataframe to remove categorical features from."""

                      if method=='fraction_unique':
                      unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
                      reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

                      if method=='named_columns':
                      non_cat_cols = [col not in cat_cols for col in X.columns]
                      reduced_X = X.loc[:, non_cat_cols]

                      return reduced_X


                      You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).






                      share|improve this answer













                      I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.



                      import pandas as pd

                      def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
                      """Removes categorical features using a given method.
                      X: pd.DataFrame, dataframe to remove categorical features from."""

                      if method=='fraction_unique':
                      unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
                      reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

                      if method=='named_columns':
                      non_cat_cols = [col not in cat_cols for col in X.columns]
                      reduced_X = X.loc[:, non_cat_cols]

                      return reduced_X


                      You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Apr 29 at 10:09









                      FChmFChm

                      9611 gold badge6 silver badges19 bronze badges




                      9611 gold badge6 silver badges19 bronze badges












                      • I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

                        – FChm
                        Apr 29 at 10:22

















                      • I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

                        – FChm
                        Apr 29 at 10:22
















                      I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

                      – FChm
                      Apr 29 at 10:22





                      I should add: I also tried a Benfords law discriminator for my dataset (physical properties of materials) and it was not successful.

                      – FChm
                      Apr 29 at 10:22

















                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35826912%2fwhat-is-a-good-heuristic-to-detect-if-a-column-in-a-pandas-dataframe-is-categori%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                      Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

                      Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript