CountVectorizer values work alone in classifier, cannot get working when adding other featuresWhy this errror appears during fit while creating decision Tree ClassifierI'm trying to build a random forest classifier upon a liver disorder data set. But the fit method returns an error as such:got error:Input contains NaN, infinity or a value too large for dtype('float64')Categorical attributes to Sparse Matrixmin-max standardization for the datasetPython Decicion Tree ClassifierValueError: Input contains NaN, infinity or a value too large for dtype('float32')The shape_index feature from sklearn not able to apply PCA, due to a NaN errorValueError: could not convert string to float: '15ML'Why this program could not convert string to float in Python
Information to fellow intern about hiring?
Does it makes sense to buy a new cycle to learn riding?
Why is my log file so massive? 22gb. I am running log backups
aging parents with no investments
What is the command to reset a PC without deleting any files
Add an angle to a sphere
What does 'script /dev/null' do?
Why doesn't a const reference extend the life of a temporary object passed via a function?
Can I legally use front facing blue light in the UK?
What do the Banks children have against barley water?
How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?
Domain expired, GoDaddy holds it and is asking more money
Denied boarding due to overcrowding, Sparpreis ticket. What are my rights?
How could a lack of term limits lead to a "dictatorship?"
How to deal with fear of taking dependencies
Ideas for 3rd eye abilities
Is Social Media Science Fiction?
Why is the design of haulage companies so “special”?
Are cabin dividers used to "hide" the flex of the airplane?
Is this food a bread or a loaf?
Check if two datetimes are between two others
"My colleague's body is amazing"
Is "plugging out" electronic devices an American expression?
Is every set a filtered colimit of finite sets?
CountVectorizer values work alone in classifier, cannot get working when adding other features
Why this errror appears during fit while creating decision Tree ClassifierI'm trying to build a random forest classifier upon a liver disorder data set. But the fit method returns an error as such:got error:Input contains NaN, infinity or a value too large for dtype('float64')Categorical attributes to Sparse Matrixmin-max standardization for the datasetPython Decicion Tree ClassifierValueError: Input contains NaN, infinity or a value too large for dtype('float32')The shape_index feature from sklearn not able to apply PCA, due to a NaN errorValueError: could not convert string to float: '15ML'Why this program could not convert string to float in Python
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)
I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
ERROR:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
245 """
246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
249 if sample_weight is not None:
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
I did some debugging:
print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>
Any help would be appreciated.
python scikit-learn classification text-classification countvectorizer
add a comment |
I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)
I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
ERROR:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
245 """
246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
249 if sample_weight is not None:
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
I did some debugging:
print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>
Any help would be appreciated.
python scikit-learn classification text-classification countvectorizer
add a comment |
I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)
I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
ERROR:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
245 """
246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
249 if sample_weight is not None:
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
I did some debugging:
print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>
Any help would be appreciated.
python scikit-learn classification text-classification countvectorizer
I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)
I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
ERROR:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
245 """
246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
249 if sample_weight is not None:
D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
I did some debugging:
print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>
Any help would be appreciated.
python scikit-learn classification text-classification countvectorizer
python scikit-learn classification text-classification countvectorizer
asked Mar 20 at 20:56
Tallen86Tallen86
82
82
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix
Suppose now you have you features in a dataframe called df
and your labels in y_train
:
df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])
You want to perform a text vectorization on column c
and add the features a
and b
to your vectorization.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)
CountVecTest.toarray()
This will return:
array([[0, 1, 1, 1],
[1, 0, 1, 1]], dtype=int64)
But CountVecTest
now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:
X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])
X_train.toarray()
This will return, as expected:
array([[0, 1, 1, 1, 1, 2],
[1, 0, 1, 1, 2, 3]], dtype=int64)
Then you can train your random forest:
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)
NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.
1
thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on
– Tallen86
Mar 23 at 16:22
Glad it helped!
– MaximeKan
Mar 23 at 18:17
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55270053%2fcountvectorizer-values-work-alone-in-classifier-cannot-get-working-when-adding%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix
Suppose now you have you features in a dataframe called df
and your labels in y_train
:
df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])
You want to perform a text vectorization on column c
and add the features a
and b
to your vectorization.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)
CountVecTest.toarray()
This will return:
array([[0, 1, 1, 1],
[1, 0, 1, 1]], dtype=int64)
But CountVecTest
now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:
X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])
X_train.toarray()
This will return, as expected:
array([[0, 1, 1, 1, 1, 2],
[1, 0, 1, 1, 2, 3]], dtype=int64)
Then you can train your random forest:
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)
NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.
1
thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on
– Tallen86
Mar 23 at 16:22
Glad it helped!
– MaximeKan
Mar 23 at 18:17
add a comment |
I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix
Suppose now you have you features in a dataframe called df
and your labels in y_train
:
df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])
You want to perform a text vectorization on column c
and add the features a
and b
to your vectorization.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)
CountVecTest.toarray()
This will return:
array([[0, 1, 1, 1],
[1, 0, 1, 1]], dtype=int64)
But CountVecTest
now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:
X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])
X_train.toarray()
This will return, as expected:
array([[0, 1, 1, 1, 1, 2],
[1, 0, 1, 1, 2, 3]], dtype=int64)
Then you can train your random forest:
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)
NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.
1
thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on
– Tallen86
Mar 23 at 16:22
Glad it helped!
– MaximeKan
Mar 23 at 18:17
add a comment |
I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix
Suppose now you have you features in a dataframe called df
and your labels in y_train
:
df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])
You want to perform a text vectorization on column c
and add the features a
and b
to your vectorization.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)
CountVecTest.toarray()
This will return:
array([[0, 1, 1, 1],
[1, 0, 1, 1]], dtype=int64)
But CountVecTest
now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:
X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])
X_train.toarray()
This will return, as expected:
array([[0, 1, 1, 1, 1, 2],
[1, 0, 1, 1, 2, 3]], dtype=int64)
Then you can train your random forest:
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)
NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.
I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix
Suppose now you have you features in a dataframe called df
and your labels in y_train
:
df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])
You want to perform a text vectorization on column c
and add the features a
and b
to your vectorization.
vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)
CountVecTest.toarray()
This will return:
array([[0, 1, 1, 1],
[1, 0, 1, 1]], dtype=int64)
But CountVecTest
now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:
X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])
X_train.toarray()
This will return, as expected:
array([[0, 1, 1, 1, 1, 2],
[1, 0, 1, 1, 2, 3]], dtype=int64)
Then you can train your random forest:
rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)
NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.
answered Mar 22 at 1:46
MaximeKanMaximeKan
81426
81426
1
thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on
– Tallen86
Mar 23 at 16:22
Glad it helped!
– MaximeKan
Mar 23 at 18:17
add a comment |
1
thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on
– Tallen86
Mar 23 at 16:22
Glad it helped!
– MaximeKan
Mar 23 at 18:17
1
1
thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on
– Tallen86
Mar 23 at 16:22
thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on
– Tallen86
Mar 23 at 16:22
Glad it helped!
– MaximeKan
Mar 23 at 18:17
Glad it helped!
– MaximeKan
Mar 23 at 18:17
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55270053%2fcountvectorizer-values-work-alone-in-classifier-cannot-get-working-when-adding%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown