Stratify split by column (object)Linear regression analysis with string/categorical features (variables)?How do you split a list into evenly sized chunks?How do I split a string on a delimiter in Bash?Determine the type of an object?I have much more than three elements in every class, but I get this error: “class cannot be less than k=3 in scikit-learn”How to parse DataFrame with specific column and write it to different excel sheetstrain_test_split not splitting dataTypeError: Singleton array 236724 cannot be considered a valid collectionScikit train_test_split by an indicePython - What value should we use for random_state in train_test_split() and in which scenario?Neural network ValueError: Found input variables with inconsistent numbers of samples?

Count the number of triangles

Why didn't Doc believe Marty was from the future?

What are ways to record who took the pictures if a camera is used by multiple people

Idiomatic way to create an immutable and efficient class in C++?

is "prohibition against," a double negative?

Are sweatpants frowned upon on flights?

I feel cheated by my new employer, does this sound right?

Codewars - Highest Scoring Word

Is it recommended to point out a professor's mistake during their lecture?

Scaling arrows.meta with tranform shape

Why can't miners meet the difficulty by picking a low number for the block hash?

Why is there no willingness in the international community to step in between Pakistan and India?

In what language did Túrin converse with Mím?

How can I improve my formal definitions

Coupling two 15 Amp circuit breaker for 20 Amp

Storing milk for long periods of time

Is "survival" paracord with fire starter strand dangerous

Defending Castle from Zombies

What's the difference between a variable and a memory location?

Give Lightning Web Component a Prettier Name

In Endgame, wouldn't Stark have remembered Hulk busting out of the stairwell?

Is there an in-universe explanation given to the senior Imperial Navy Officers as to why Darth Vader serves Emperor Palpatine?

How to stay mindful of the gap in the breath

Did ancient peoples ever hide their treasure behind puzzles?



Stratify split by column (object)


Linear regression analysis with string/categorical features (variables)?How do you split a list into evenly sized chunks?How do I split a string on a delimiter in Bash?Determine the type of an object?I have much more than three elements in every class, but I get this error: “class cannot be less than k=3 in scikit-learn”How to parse DataFrame with specific column and write it to different excel sheetstrain_test_split not splitting dataTypeError: Singleton array 236724 cannot be considered a valid collectionScikit train_test_split by an indicePython - What value should we use for random_state in train_test_split() and in which scenario?Neural network ValueError: Found input variables with inconsistent numbers of samples?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















When trying to do a strafied split by a column (categorical) it returns me error.



Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15


Here's my code:



X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)


So I get error as follows:



ValueError: could not convert string to float: 'AB'









share|improve this question


























  • cant reproduce the error (using "Country" for "country_code")

    – Christian Sloper
    Mar 27 at 18:33











  • @ChristianSloper good point, fixed. Thanks

    – user10155602
    Mar 27 at 18:37











  • @LucaMassaron can you help with this? Thanks

    – user10155602
    Mar 27 at 19:02


















0















When trying to do a strafied split by a column (categorical) it returns me error.



Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15


Here's my code:



X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)


So I get error as follows:



ValueError: could not convert string to float: 'AB'









share|improve this question


























  • cant reproduce the error (using "Country" for "country_code")

    – Christian Sloper
    Mar 27 at 18:33











  • @ChristianSloper good point, fixed. Thanks

    – user10155602
    Mar 27 at 18:37











  • @LucaMassaron can you help with this? Thanks

    – user10155602
    Mar 27 at 19:02














0












0








0








When trying to do a strafied split by a column (categorical) it returns me error.



Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15


Here's my code:



X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)


So I get error as follows:



ValueError: could not convert string to float: 'AB'









share|improve this question
















When trying to do a strafied split by a column (categorical) it returns me error.



Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15


Here's my code:



X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)


So I get error as follows:



ValueError: could not convert string to float: 'AB'






python machine-learning split scikit-learn linear-regression






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 27 at 19:12

























asked Feb 25 at 21:05







user10155602






















  • cant reproduce the error (using "Country" for "country_code")

    – Christian Sloper
    Mar 27 at 18:33











  • @ChristianSloper good point, fixed. Thanks

    – user10155602
    Mar 27 at 18:37











  • @LucaMassaron can you help with this? Thanks

    – user10155602
    Mar 27 at 19:02


















  • cant reproduce the error (using "Country" for "country_code")

    – Christian Sloper
    Mar 27 at 18:33











  • @ChristianSloper good point, fixed. Thanks

    – user10155602
    Mar 27 at 18:37











  • @LucaMassaron can you help with this? Thanks

    – user10155602
    Mar 27 at 19:02

















cant reproduce the error (using "Country" for "country_code")

– Christian Sloper
Mar 27 at 18:33





cant reproduce the error (using "Country" for "country_code")

– Christian Sloper
Mar 27 at 18:33













@ChristianSloper good point, fixed. Thanks

– user10155602
Mar 27 at 18:37





@ChristianSloper good point, fixed. Thanks

– user10155602
Mar 27 at 18:37













@LucaMassaron can you help with this? Thanks

– user10155602
Mar 27 at 19:02






@LucaMassaron can you help with this? Thanks

– user10155602
Mar 27 at 19:02













2 Answers
2






active

oldest

votes


















0















from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)


  • Convert the string values in country to numbers and save it as a new column

  • When creating x train data drop label (y) and also the string country columns

Method 2



If your test data on which you will make predictions will come later, you will need a mechanism to convert their country into code before making predictions. The recommended way in such a cases is to use LabelEncoder on which you can use fit method to encode strings to labels and later use transform to encode the country of test data.



from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)

# Train-Validation
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))





share|improve this answer


































    0















    In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using
    X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
    to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.






    share|improve this answer



























      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54874639%2fstratify-split-by-column-object%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown
























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0















      from sklearn.model_selection import train_test_split
      from sklearn.linear_model import LinearRegression

      df = pd.DataFrame(
      'Country': ['AB', 'CD', 'EF', 'FG']*20,
      'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
      )

      df['Country_Code'] = df['Country'].astype('category').cat.codes

      X = df.loc[:, df.columns.drop(['Label','Country'])]
      y = df['Label']
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
      lm = LinearRegression()
      lm.fit(X_train,y_train)
      lm_predictions = lm.predict(X_test)


      • Convert the string values in country to numbers and save it as a new column

      • When creating x train data drop label (y) and also the string country columns

      Method 2



      If your test data on which you will make predictions will come later, you will need a mechanism to convert their country into code before making predictions. The recommended way in such a cases is to use LabelEncoder on which you can use fit method to encode strings to labels and later use transform to encode the country of test data.



      from sklearn.model_selection import train_test_split
      from sklearn.linear_model import LinearRegression
      from sklearn import preprocessing

      df = pd.DataFrame(
      'Country': ['AB', 'CD', 'EF', 'FG']*20,
      'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
      )

      # Train-Validation
      le = preprocessing.LabelEncoder()
      df['Country_Code'] = le.fit_transform(df['Country'])
      X = df.loc[:, df.columns.drop(['Label','Country'])]
      y = df['Label']
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
      lm = LinearRegression()
      lm.fit(X_train,y_train)

      # Test
      test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
      test_df['Country_Code'] = le.transform(test_df['Country'])
      print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))





      share|improve this answer































        0















        from sklearn.model_selection import train_test_split
        from sklearn.linear_model import LinearRegression

        df = pd.DataFrame(
        'Country': ['AB', 'CD', 'EF', 'FG']*20,
        'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
        )

        df['Country_Code'] = df['Country'].astype('category').cat.codes

        X = df.loc[:, df.columns.drop(['Label','Country'])]
        y = df['Label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
        lm = LinearRegression()
        lm.fit(X_train,y_train)
        lm_predictions = lm.predict(X_test)


        • Convert the string values in country to numbers and save it as a new column

        • When creating x train data drop label (y) and also the string country columns

        Method 2



        If your test data on which you will make predictions will come later, you will need a mechanism to convert their country into code before making predictions. The recommended way in such a cases is to use LabelEncoder on which you can use fit method to encode strings to labels and later use transform to encode the country of test data.



        from sklearn.model_selection import train_test_split
        from sklearn.linear_model import LinearRegression
        from sklearn import preprocessing

        df = pd.DataFrame(
        'Country': ['AB', 'CD', 'EF', 'FG']*20,
        'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
        )

        # Train-Validation
        le = preprocessing.LabelEncoder()
        df['Country_Code'] = le.fit_transform(df['Country'])
        X = df.loc[:, df.columns.drop(['Label','Country'])]
        y = df['Label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
        lm = LinearRegression()
        lm.fit(X_train,y_train)

        # Test
        test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
        test_df['Country_Code'] = le.transform(test_df['Country'])
        print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))





        share|improve this answer





























          0














          0










          0









          from sklearn.model_selection import train_test_split
          from sklearn.linear_model import LinearRegression

          df = pd.DataFrame(
          'Country': ['AB', 'CD', 'EF', 'FG']*20,
          'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
          )

          df['Country_Code'] = df['Country'].astype('category').cat.codes

          X = df.loc[:, df.columns.drop(['Label','Country'])]
          y = df['Label']
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
          lm = LinearRegression()
          lm.fit(X_train,y_train)
          lm_predictions = lm.predict(X_test)


          • Convert the string values in country to numbers and save it as a new column

          • When creating x train data drop label (y) and also the string country columns

          Method 2



          If your test data on which you will make predictions will come later, you will need a mechanism to convert their country into code before making predictions. The recommended way in such a cases is to use LabelEncoder on which you can use fit method to encode strings to labels and later use transform to encode the country of test data.



          from sklearn.model_selection import train_test_split
          from sklearn.linear_model import LinearRegression
          from sklearn import preprocessing

          df = pd.DataFrame(
          'Country': ['AB', 'CD', 'EF', 'FG']*20,
          'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
          )

          # Train-Validation
          le = preprocessing.LabelEncoder()
          df['Country_Code'] = le.fit_transform(df['Country'])
          X = df.loc[:, df.columns.drop(['Label','Country'])]
          y = df['Label']
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
          lm = LinearRegression()
          lm.fit(X_train,y_train)

          # Test
          test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
          test_df['Country_Code'] = le.transform(test_df['Country'])
          print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))





          share|improve this answer















          from sklearn.model_selection import train_test_split
          from sklearn.linear_model import LinearRegression

          df = pd.DataFrame(
          'Country': ['AB', 'CD', 'EF', 'FG']*20,
          'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
          )

          df['Country_Code'] = df['Country'].astype('category').cat.codes

          X = df.loc[:, df.columns.drop(['Label','Country'])]
          y = df['Label']
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
          lm = LinearRegression()
          lm.fit(X_train,y_train)
          lm_predictions = lm.predict(X_test)


          • Convert the string values in country to numbers and save it as a new column

          • When creating x train data drop label (y) and also the string country columns

          Method 2



          If your test data on which you will make predictions will come later, you will need a mechanism to convert their country into code before making predictions. The recommended way in such a cases is to use LabelEncoder on which you can use fit method to encode strings to labels and later use transform to encode the country of test data.



          from sklearn.model_selection import train_test_split
          from sklearn.linear_model import LinearRegression
          from sklearn import preprocessing

          df = pd.DataFrame(
          'Country': ['AB', 'CD', 'EF', 'FG']*20,
          'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
          )

          # Train-Validation
          le = preprocessing.LabelEncoder()
          df['Country_Code'] = le.fit_transform(df['Country'])
          X = df.loc[:, df.columns.drop(['Label','Country'])]
          y = df['Label']
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
          lm = LinearRegression()
          lm.fit(X_train,y_train)

          # Test
          test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
          test_df['Country_Code'] = le.transform(test_df['Country'])
          print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 27 at 22:33

























          answered Mar 27 at 22:21









          mujjigamujjiga

          5,0702 gold badges16 silver badges24 bronze badges




          5,0702 gold badges16 silver badges24 bronze badges


























              0















              In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using
              X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
              to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.






              share|improve this answer





























                0















                In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using
                X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
                to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.






                share|improve this answer



























                  0














                  0










                  0









                  In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using
                  X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
                  to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.






                  share|improve this answer













                  In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using
                  X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
                  to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Mar 27 at 22:09









                  tjeffkesslertjeffkessler

                  415 bronze badges




                  415 bronze badges






























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54874639%2fstratify-split-by-column-object%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                      Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

                      Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript