Stratify split by column (object)Linear regression analysis with string/categorical features (variables)?How do you split a list into evenly sized chunks?How do I split a string on a delimiter in Bash?Determine the type of an object?I have much more than three elements in every class, but I get this error: “class cannot be less than k=3 in scikit-learn”How to parse DataFrame with specific column and write it to different excel sheetstrain_test_split not splitting dataTypeError: Singleton array 236724 cannot be considered a valid collectionScikit train_test_split by an indicePython - What value should we use for random_state in train_test_split() and in which scenario?Neural network ValueError: Found input variables with inconsistent numbers of samples?

Count the number of triangles

Why didn't Doc believe Marty was from the future?

What are ways to record who took the pictures if a camera is used by multiple people

Idiomatic way to create an immutable and efficient class in C++?

is "prohibition against," a double negative?

Are sweatpants frowned upon on flights?

I feel cheated by my new employer, does this sound right?

Codewars - Highest Scoring Word

Is it recommended to point out a professor's mistake during their lecture?

Scaling arrows.meta with tranform shape

Why can't miners meet the difficulty by picking a low number for the block hash?

Why is there no willingness in the international community to step in between Pakistan and India?

In what language did Túrin converse with Mím?

How can I improve my formal definitions

Coupling two 15 Amp circuit breaker for 20 Amp

Storing milk for long periods of time

Is "survival" paracord with fire starter strand dangerous

Defending Castle from Zombies

What's the difference between a variable and a memory location?

Give Lightning Web Component a Prettier Name

In Endgame, wouldn't Stark have remembered Hulk busting out of the stairwell?

Is there an in-universe explanation given to the senior Imperial Navy Officers as to why Darth Vader serves Emperor Palpatine?

How to stay mindful of the gap in the breath

Did ancient peoples ever hide their treasure behind puzzles?

Stratify split by column (object)

Linear regression analysis with string/categorical features (variables)?How do you split a list into evenly sized chunks?How do I split a string on a delimiter in Bash?Determine the type of an object?I have much more than three elements in every class, but I get this error: “class cannot be less than k=3 in scikit-learn”How to parse DataFrame with specific column and write it to different excel sheetstrain_test_split not splitting dataTypeError: Singleton array 236724 cannot be considered a valid collectionScikit train_test_split by an indicePython - What value should we use for random_state in train_test_split() and in which scenario?Neural network ValueError: Found input variables with inconsistent numbers of samples?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

When trying to do a strafied split by a column (categorical) it returns me error.

Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14 
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15

Here's my code:

X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

So I get error as follows:

ValueError: could not convert string to float: 'AB'

edited Mar 27 at 19:12

asked Feb 25 at 21:05

user10155602

cant reproduce the error (using "Country" for "country_code")

– Christian Sloper
Mar 27 at 18:33

@ChristianSloper good point, fixed. Thanks

– user10155602
Mar 27 at 18:37

@LucaMassaron can you help with this? Thanks

– user10155602
Mar 27 at 19:02

add a comment |

When trying to do a strafied split by a column (categorical) it returns me error.

Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14 
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15

Here's my code:

X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

So I get error as follows:

ValueError: could not convert string to float: 'AB'

edited Mar 27 at 19:12

asked Feb 25 at 21:05

user10155602

cant reproduce the error (using "Country" for "country_code")

– Christian Sloper
Mar 27 at 18:33

@ChristianSloper good point, fixed. Thanks

– user10155602
Mar 27 at 18:37

@LucaMassaron can you help with this? Thanks

– user10155602
Mar 27 at 19:02

add a comment |

When trying to do a strafied split by a column (categorical) it returns me error.

Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14 
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15

Here's my code:

X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

So I get error as follows:

ValueError: could not convert string to float: 'AB'

edited Mar 27 at 19:12

asked Feb 25 at 21:05

user10155602

When trying to do a strafied split by a column (categorical) it returns me error.

Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14 
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15

Here's my code:

X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

So I get error as follows:

ValueError: could not convert string to float: 'AB'

python machine-learning split scikit-learn linear-regression

edited Mar 27 at 19:12

asked Feb 25 at 21:05

user10155602

edited Mar 27 at 19:12

asked Feb 25 at 21:05

user10155602

edited Mar 27 at 19:12

asked Feb 25 at 21:05

user10155602

asked Feb 25 at 21:05

user10155602

asked Feb 25 at 21:05

user10155602

cant reproduce the error (using "Country" for "country_code")

– Christian Sloper
Mar 27 at 18:33

@ChristianSloper good point, fixed. Thanks

– user10155602
Mar 27 at 18:37

@LucaMassaron can you help with this? Thanks

– user10155602
Mar 27 at 19:02

add a comment |

cant reproduce the error (using "Country" for "country_code")

– Christian Sloper
Mar 27 at 18:33

@ChristianSloper good point, fixed. Thanks

– user10155602
Mar 27 at 18:37

@LucaMassaron can you help with this? Thanks

– user10155602
Mar 27 at 19:02

cant reproduce the error (using "Country" for "country_code")

– Christian Sloper
Mar 27 at 18:33

@ChristianSloper good point, fixed. Thanks

– user10155602
Mar 27 at 18:37

@LucaMassaron can you help with this? Thanks

– user10155602
Mar 27 at 19:02

add a comment |

2 Answers
2

active

oldest

votes

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

Convert the string values in country to numbers and save it as a new column

When creating x train data drop label (y) and also the string country columns

Method 2

If your test data on which you will make predictions will come later, you will need a mechanism to convert their country into code before making predictions. The recommended way in such a cases is to use LabelEncoder on which you can use fit method to encode strings to labels and later use transform to encode the country of test data.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

# Train-Validation 
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))

edited Mar 27 at 22:33

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

add a comment |

In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using
X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54874639%2fstratify-split-by-column-object%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

Convert the string values in country to numbers and save it as a new column

When creating x train data drop label (y) and also the string country columns

Method 2

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

# Train-Validation 
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))

edited Mar 27 at 22:33

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

add a comment |

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

Convert the string values in country to numbers and save it as a new column

When creating x train data drop label (y) and also the string country columns

Method 2

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

# Train-Validation 
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))

edited Mar 27 at 22:33

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

add a comment |

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

Convert the string values in country to numbers and save it as a new column

When creating x train data drop label (y) and also the string country columns

Method 2

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

# Train-Validation 
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))

edited Mar 27 at 22:33

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

Convert the string values in country to numbers and save it as a new column

When creating x train data drop label (y) and also the string country columns

Method 2

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame(
 'Country': ['AB', 'CD', 'EF', 'FG']*20,
 'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
 )

# Train-Validation 
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))

edited Mar 27 at 22:33

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

edited Mar 27 at 22:33

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

answered Mar 27 at 22:21

mujjiga

5,0702 gold badges16 silver badges24 bronze badges

add a comment |

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

add a comment |

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

add a comment |

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

answered Mar 27 at 22:09

tjeffkessler

415 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Method 2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Method 2

Method 2

Method 2

Method 2

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers 2

Method 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Method 2

Method 2

Method 2

Method 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers
2

2 Answers
2

2 Answers
2