CountVectorizer values work alone in classifier, cannot get working when adding other featuresWhy this errror appears during fit while creating decision Tree ClassifierI'm trying to build a random forest classifier upon a liver disorder data set. But the fit method returns an error as such:got error:Input contains NaN, infinity or a value too large for dtype('float64')Categorical attributes to Sparse Matrixmin-max standardization for the datasetPython Decicion Tree ClassifierValueError: Input contains NaN, infinity or a value too large for dtype('float32')The shape_index feature from sklearn not able to apply PCA, due to a NaN errorValueError: could not convert string to float: '15ML'Why this program could not convert string to float in Python

Information to fellow intern about hiring?

Does it makes sense to buy a new cycle to learn riding?

Why is my log file so massive? 22gb. I am running log backups

aging parents with no investments

What is the command to reset a PC without deleting any files

Add an angle to a sphere

What does 'script /dev/null' do?

Why doesn't a const reference extend the life of a temporary object passed via a function?

Can I legally use front facing blue light in the UK?

What do the Banks children have against barley water?

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Domain expired, GoDaddy holds it and is asking more money

Denied boarding due to overcrowding, Sparpreis ticket. What are my rights?

How could a lack of term limits lead to a "dictatorship?"

How to deal with fear of taking dependencies

Ideas for 3rd eye abilities

Is Social Media Science Fiction?

Why is the design of haulage companies so “special”?

Are cabin dividers used to "hide" the flex of the airplane?

Is this food a bread or a loaf?

Check if two datetimes are between two others

"My colleague's body is amazing"

Is "plugging out" electronic devices an American expression?

Is every set a filtered colimit of finite sets?

CountVectorizer values work alone in classifier, cannot get working when adding other features

Why this errror appears during fit while creating decision Tree ClassifierI'm trying to build a random forest classifier upon a liver disorder data set. But the fit method returns an error as such:got error:Input contains NaN, infinity or a value too large for dtype('float64')Categorical attributes to Sparse Matrixmin-max standardization for the datasetPython Decicion Tree ClassifierValueError: Input contains NaN, infinity or a value too large for dtype('float32')The shape_index feature from sklearn not able to apply PCA, due to a NaN errorValueError: could not convert string to float: '15ML'Why this program could not convert string to float in Python

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)

I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

ERROR:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
 1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
 245 """
 246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
 248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
 249 if sample_weight is not None:

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
 431 force_all_finite)
 432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
 434 
 435 if ensure_2d:

ValueError: setting an array element with a sequence.

I did some debugging:

print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>

Any help would be appreciated.

asked Mar 20 at 20:56

Tallen86

add a comment |

I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)

I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

ERROR:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
 1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
 245 """
 246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
 248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
 249 if sample_weight is not None:

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
 431 force_all_finite)
 432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
 434 
 435 if ensure_2d:

ValueError: setting an array element with a sequence.

I did some debugging:

print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>

Any help would be appreciated.

asked Mar 20 at 20:56

Tallen86

add a comment |

I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)

I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

ERROR:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
 1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
 245 """
 246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
 248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
 249 if sample_weight is not None:

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
 431 force_all_finite)
 432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
 434 
 435 if ensure_2d:

ValueError: setting an array element with a sequence.

I did some debugging:

print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>

Any help would be appreciated.

asked Mar 20 at 20:56

Tallen86

I have a CSV of twitter profile data, containing: name, description, followers count, following count, bot (class I want to predict)

I have successfully executed a classification model when using just the CountVectorizer values (xtrain) and Bot (ytrain). But have not been able to add this feature to my set of other features.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(training_data.description.values.astype('U'))
CountVecTest = CountVecTest.toarray()
arr = sparse.coo_matrix(CountVecTest)
training_data["NewCol"] = arr.toarray().tolist()

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

ERROR:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-7d67a6586592> in <module>()
 1 rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
----> 2 rf = rf.fit(training_data[["followers_count","friends_count","NewCol","bot"]], training_data.bot)

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnensembleforest.py in fit(self, X, y, sample_weight)
 245 """
 246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
 248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
 249 if sample_weight is not None:

D:_MyFiles_LibrariesDocumentsAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
 431 force_all_finite)
 432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
 434 
 435 if ensure_2d:

ValueError: setting an array element with a sequence.

I did some debugging:

print(type(training_data.NewCol))
print(type(training_data.NewCol[0]))
>>> <class 'pandas.core.series.Series'>
>>> <class 'numpy.ndarray'>

Any help would be appreciated.

python scikit-learn classification text-classification countvectorizer

asked Mar 20 at 20:56

Tallen86

asked Mar 20 at 20:56

Tallen86

asked Mar 20 at 20:56

Tallen86

asked Mar 20 at 20:56

Tallen86

asked Mar 20 at 20:56

Tallen86

add a comment |

1 Answer
1

active

oldest

votes

I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix

Suppose now you have you features in a dataframe called df and your labels in y_train:

df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])

You want to perform a text vectorization on column c and add the features a and b to your vectorization.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)

CountVecTest.toarray()

This will return:

array([[0, 1, 1, 1],
 [1, 0, 1, 1]], dtype=int64)

But CountVecTest now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:

X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])

X_train.toarray()

This will return, as expected:

array([[0, 1, 1, 1, 1, 2],
 [1, 0, 1, 1, 2, 3]], dtype=int64)

Then you can train your random forest:

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)

NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.

answered Mar 22 at 1:46

MaximeKan

81426

1

thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on

– Tallen86
Mar 23 at 16:22

Glad it helped!

– MaximeKan
Mar 23 at 18:17

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55270053%2fcountvectorizer-values-work-alone-in-classifier-cannot-get-working-when-adding%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix

Suppose now you have you features in a dataframe called df and your labels in y_train:

df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])

You want to perform a text vectorization on column c and add the features a and b to your vectorization.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)

CountVecTest.toarray()

This will return:

array([[0, 1, 1, 1],
 [1, 0, 1, 1]], dtype=int64)

But CountVecTest now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:

X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])

X_train.toarray()

This will return, as expected:

array([[0, 1, 1, 1, 1, 2],
 [1, 0, 1, 1, 2, 3]], dtype=int64)

Then you can train your random forest:

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)

NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.

answered Mar 22 at 1:46

MaximeKan

81426

1

thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on

– Tallen86
Mar 23 at 16:22

Glad it helped!

– MaximeKan
Mar 23 at 18:17

add a comment |

I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix

Suppose now you have you features in a dataframe called df and your labels in y_train:

df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])

You want to perform a text vectorization on column c and add the features a and b to your vectorization.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)

CountVecTest.toarray()

This will return:

array([[0, 1, 1, 1],
 [1, 0, 1, 1]], dtype=int64)

But CountVecTest now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:

X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])

X_train.toarray()

This will return, as expected:

array([[0, 1, 1, 1, 1, 2],
 [1, 0, 1, 1, 2, 3]], dtype=int64)

Then you can train your random forest:

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)

NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.

answered Mar 22 at 1:46

MaximeKan

81426

1

thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on

– Tallen86
Mar 23 at 16:22

Glad it helped!

– MaximeKan
Mar 23 at 18:17

add a comment |

I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix

Suppose now you have you features in a dataframe called df and your labels in y_train:

df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])

You want to perform a text vectorization on column c and add the features a and b to your vectorization.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)

CountVecTest.toarray()

This will return:

array([[0, 1, 1, 1],
 [1, 0, 1, 1]], dtype=int64)

But CountVecTest now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:

X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])

X_train.toarray()

This will return, as expected:

array([[0, 1, 1, 1, 1, 2],
 [1, 0, 1, 1, 2, 3]], dtype=int64)

Then you can train your random forest:

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)

NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.

answered Mar 22 at 1:46

MaximeKan

81426

I would do this the other way around and add your features to your vectorization. Here is what I mean with a toy example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix

Suppose now you have you features in a dataframe called df and your labels in y_train:

df = pd.DataFrame("a":[1,2],"b":[2,3],"c":['we love cars', 'we love cakes'])
y_train = np.array([0,1])

You want to perform a text vectorization on column c and add the features a and b to your vectorization.

vectorizer = CountVectorizer()
CountVecTest = vectorizer.fit_transform(df.c)

CountVecTest.toarray()

This will return:

array([[0, 1, 1, 1],
 [1, 0, 1, 1]], dtype=int64)

But CountVecTest now is a scipy sparse matrix. So what you need to do is add your features to this matrix. Like this:

X_train = hstack([CountVecTest, csr_matrix(df[['a','b']])])

X_train.toarray()

This will return, as expected:

array([[0, 1, 1, 1, 1, 2],
 [1, 0, 1, 1, 2, 3]], dtype=int64)

Then you can train your random forest:

rf = RandomForestClassifier(criterion='entropy', min_samples_leaf=10, min_samples_split=20)
rf.fit(X_train, y_train)

NB: In the code snippet you provided, you passed the label info (the "bot" column) to the training features, which you should obviously not do.

answered Mar 22 at 1:46

MaximeKan

81426

answered Mar 22 at 1:46

MaximeKan

81426

answered Mar 22 at 1:46

MaximeKan

81426

answered Mar 22 at 1:46

MaximeKan

81426

1

thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on

– Tallen86
Mar 23 at 16:22

Glad it helped!

– MaximeKan
Mar 23 at 18:17

add a comment |

1

thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on

– Tallen86
Mar 23 at 16:22

Glad it helped!

– MaximeKan
Mar 23 at 18:17

thank you very much! Managed to get it working. also needed to cast the dataframe to int but everything else was spot on

– Tallen86
Mar 23 at 16:22

Glad it helped!

– MaximeKan
Mar 23 at 18:17

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1