Stratify split by column (object)Linear regression analysis with string/categorical features (variables)?How do you split a list into evenly sized chunks?How do I split a string on a delimiter in Bash?Determine the type of an object?I have much more than three elements in every class, but I get this error: “class cannot be less than k=3 in scikit-learn”How to parse DataFrame with specific column and write it to different excel sheetstrain_test_split not splitting dataTypeError: Singleton array 236724 cannot be considered a valid collectionScikit train_test_split by an indicePython - What value should we use for random_state in train_test_split() and in which scenario?Neural network ValueError: Found input variables with inconsistent numbers of samples?
Count the number of triangles
Why didn't Doc believe Marty was from the future?
What are ways to record who took the pictures if a camera is used by multiple people
Idiomatic way to create an immutable and efficient class in C++?
is "prohibition against," a double negative?
Are sweatpants frowned upon on flights?
I feel cheated by my new employer, does this sound right?
Codewars - Highest Scoring Word
Is it recommended to point out a professor's mistake during their lecture?
Scaling arrows.meta with tranform shape
Why can't miners meet the difficulty by picking a low number for the block hash?
Why is there no willingness in the international community to step in between Pakistan and India?
In what language did Túrin converse with Mím?
How can I improve my formal definitions
Coupling two 15 Amp circuit breaker for 20 Amp
Storing milk for long periods of time
Is "survival" paracord with fire starter strand dangerous
Defending Castle from Zombies
What's the difference between a variable and a memory location?
Give Lightning Web Component a Prettier Name
In Endgame, wouldn't Stark have remembered Hulk busting out of the stairwell?
Is there an in-universe explanation given to the senior Imperial Navy Officers as to why Darth Vader serves Emperor Palpatine?
How to stay mindful of the gap in the breath
Did ancient peoples ever hide their treasure behind puzzles?
Stratify split by column (object)
Linear regression analysis with string/categorical features (variables)?How do you split a list into evenly sized chunks?How do I split a string on a delimiter in Bash?Determine the type of an object?I have much more than three elements in every class, but I get this error: “class cannot be less than k=3 in scikit-learn”How to parse DataFrame with specific column and write it to different excel sheetstrain_test_split not splitting dataTypeError: Singleton array 236724 cannot be considered a valid collectionScikit train_test_split by an indicePython - What value should we use for random_state in train_test_split() and in which scenario?Neural network ValueError: Found input variables with inconsistent numbers of samples?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
When trying to do a strafied split by a column (categorical) it returns me error.
Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15
Here's my code:
X = df.loc[:, df.columns != 'Label']
y = df['Label']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
So I get error as follows:
ValueError: could not convert string to float: 'AB'
python machine-learning split scikit-learn linear-regression
add a comment |
When trying to do a strafied split by a column (categorical) it returns me error.
Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15
Here's my code:
X = df.loc[:, df.columns != 'Label']
y = df['Label']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
So I get error as follows:
ValueError: could not convert string to float: 'AB'
python machine-learning split scikit-learn linear-regression
cant reproduce the error (using "Country" for "country_code")
– Christian Sloper
Mar 27 at 18:33
@ChristianSloper good point, fixed. Thanks
– user10155602
Mar 27 at 18:37
@LucaMassaron can you help with this? Thanks
– user10155602
Mar 27 at 19:02
add a comment |
When trying to do a strafied split by a column (categorical) it returns me error.
Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15
Here's my code:
X = df.loc[:, df.columns != 'Label']
y = df['Label']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
So I get error as follows:
ValueError: could not convert string to float: 'AB'
python machine-learning split scikit-learn linear-regression
When trying to do a strafied split by a column (categorical) it returns me error.
Country ColumnA ColumnB ColumnC Label
AB 0.2 0.5 0.1 14
CD 0.9 0.2 0.6 60
EF 0.4 0.3 0.8 5
FG 0.6 0.9 0.2 15
Here's my code:
X = df.loc[:, df.columns != 'Label']
y = df['Label']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
So I get error as follows:
ValueError: could not convert string to float: 'AB'
python machine-learning split scikit-learn linear-regression
python machine-learning split scikit-learn linear-regression
edited Mar 27 at 19:12
asked Feb 25 at 21:05
user10155602
cant reproduce the error (using "Country" for "country_code")
– Christian Sloper
Mar 27 at 18:33
@ChristianSloper good point, fixed. Thanks
– user10155602
Mar 27 at 18:37
@LucaMassaron can you help with this? Thanks
– user10155602
Mar 27 at 19:02
add a comment |
cant reproduce the error (using "Country" for "country_code")
– Christian Sloper
Mar 27 at 18:33
@ChristianSloper good point, fixed. Thanks
– user10155602
Mar 27 at 18:37
@LucaMassaron can you help with this? Thanks
– user10155602
Mar 27 at 19:02
cant reproduce the error (using "Country" for "country_code")
– Christian Sloper
Mar 27 at 18:33
cant reproduce the error (using "Country" for "country_code")
– Christian Sloper
Mar 27 at 18:33
@ChristianSloper good point, fixed. Thanks
– user10155602
Mar 27 at 18:37
@ChristianSloper good point, fixed. Thanks
– user10155602
Mar 27 at 18:37
@LucaMassaron can you help with this? Thanks
– user10155602
Mar 27 at 19:02
@LucaMassaron can you help with this? Thanks
– user10155602
Mar 27 at 19:02
add a comment |
2 Answers
2
active
oldest
votes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
df['Country_Code'] = df['Country'].astype('category').cat.codes
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
- Convert the string values in
country
to numbers and save it as a new column - When creating
x
train data droplabel
(y
) and also the stringcountry
columns
Method 2
If your test data on which you will make predictions will come later, you will need a mechanism to convert their country
into code
before making predictions. The recommended way in such a cases is to use LabelEncoder
on which you can use fit
method to encode strings to labels and later use transform
to encode the country of test data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
# Train-Validation
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))
add a comment |
In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54874639%2fstratify-split-by-column-object%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
df['Country_Code'] = df['Country'].astype('category').cat.codes
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
- Convert the string values in
country
to numbers and save it as a new column - When creating
x
train data droplabel
(y
) and also the stringcountry
columns
Method 2
If your test data on which you will make predictions will come later, you will need a mechanism to convert their country
into code
before making predictions. The recommended way in such a cases is to use LabelEncoder
on which you can use fit
method to encode strings to labels and later use transform
to encode the country of test data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
# Train-Validation
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))
add a comment |
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
df['Country_Code'] = df['Country'].astype('category').cat.codes
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
- Convert the string values in
country
to numbers and save it as a new column - When creating
x
train data droplabel
(y
) and also the stringcountry
columns
Method 2
If your test data on which you will make predictions will come later, you will need a mechanism to convert their country
into code
before making predictions. The recommended way in such a cases is to use LabelEncoder
on which you can use fit
method to encode strings to labels and later use transform
to encode the country of test data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
# Train-Validation
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))
add a comment |
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
df['Country_Code'] = df['Country'].astype('category').cat.codes
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
- Convert the string values in
country
to numbers and save it as a new column - When creating
x
train data droplabel
(y
) and also the stringcountry
columns
Method 2
If your test data on which you will make predictions will come later, you will need a mechanism to convert their country
into code
before making predictions. The recommended way in such a cases is to use LabelEncoder
on which you can use fit
method to encode strings to labels and later use transform
to encode the country of test data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
# Train-Validation
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
df['Country_Code'] = df['Country'].astype('category').cat.codes
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
- Convert the string values in
country
to numbers and save it as a new column - When creating
x
train data droplabel
(y
) and also the stringcountry
columns
Method 2
If your test data on which you will make predictions will come later, you will need a mechanism to convert their country
into code
before making predictions. The recommended way in such a cases is to use LabelEncoder
on which you can use fit
method to encode strings to labels and later use transform
to encode the country of test data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
df = pd.DataFrame(
'Country': ['AB', 'CD', 'EF', 'FG']*20,
'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
)
# Train-Validation
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
# Test
test_df = pd.DataFrame('Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] )
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))
edited Mar 27 at 22:33
answered Mar 27 at 22:21
mujjigamujjiga
5,0702 gold badges16 silver badges24 bronze badges
5,0702 gold badges16 silver badges24 bronze badges
add a comment |
add a comment |
In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.
add a comment |
In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.
add a comment |
In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.
In reproducing your code, I found that the error comes from trying to fit a linear regression model on a set of features that includes strings. This answer gives you some options for what to do. I would suggest using X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country)
to one-hot encode your countries after you make your train_test_split() to preserve the class balance that you are looking for.
answered Mar 27 at 22:09
tjeffkesslertjeffkessler
415 bronze badges
415 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54874639%2fstratify-split-by-column-object%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
cant reproduce the error (using "Country" for "country_code")
– Christian Sloper
Mar 27 at 18:33
@ChristianSloper good point, fixed. Thanks
– user10155602
Mar 27 at 18:37
@LucaMassaron can you help with this? Thanks
– user10155602
Mar 27 at 19:02