How to use sample_weight parameter for algorithms in sklearn The Next CEO of Stack OverflowWhat does ** (double star/asterisk) and * (star/asterisk) do for parameters?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory in Python?How do I sort a dictionary by value?How to make a chain of function decorators?How to make a flat list out of list of lists?How do I list all files of a directory?scikit-learn: Random forest class_weight and sample

How to use sample_weight parameter for algorithms in sklearn The Next CEO of Stack OverflowWhat does ** (double star/asterisk) and * (star/asterisk) do for parameters?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory in Python?How do I sort a dictionary by value?How to make a chain of function decorators?How to make a flat list out of list of lists?How do I list all files of a directory?scikit-learn: Random forest class_weight and sample_weight parameters

My boss doesn't want me to have a side project

Is it correct to say moon starry nights?

Could a dragon use its wings to swim?

What does this strange code stamp on my passport mean?

Post-doc vs. Assistant Professor choice, but neither ideal

Physiological effects of huge anime eyes

Do I need to write [sic] when including a quotation with a number less than 10 that isn't written out?

Creating a script with console commands

About implicitly convert type 'int' to 'char', why it is different between `s[i] += s[j]` and `s[i] = s[i]+s[j] `

Raspberry pi 3 B with Ubuntu 18.04 server arm64: what pi version

pgfplots: How to draw a tangent graph below two others?

Why does freezing point matter when picking cooler ice packs?

What happens if you break a law in another country outside of that country?

It it possible to avoid kiwi.com's automatic online check-in and instead do it manually by yourself?

Can you teleport closer to a creature you are Frightened of?

Cannot restore registry to default in Windows 10?

Does the Idaho Potato Commission associate potato skins with healthy eating?

Shortening a title without changing its meaning

Is there a rule of thumb for determining the amount one should accept for of a settlement offer?

Is it a bad idea to plug the other end of ESD strap to wall ground?

Is it reasonable to ask other researchers to send me their previous grant applications?

Ising model simulation

How can I force the size of an int for debugging purposes?

Masking layers by a vector polygon layer in QGIS

How to use sample_weight parameter for algorithms in sklearn

The Next CEO of Stack OverflowWhat does ** (double star/asterisk) and * (star/asterisk) do for parameters?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory in Python?How do I sort a dictionary by value?How to make a chain of function decorators?How to make a flat list out of list of lists?How do I list all files of a directory?scikit-learn: Random forest class_weight and sample_weight parameters

I've a very imbalanced dataset and I'm performing a classification task. So i've tried all the algorithms i.e (Decision Trees, Naive Bayes, Logistic Regression) and for each of them I've come across a parameter called sample_weights in scikit learn.

Assume in my dataset I've around 100k positive data points and 20k negative data points.

i.e 0.83 % of positive labels and 0.16 % of negative labels

From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.

class_weight : dict or ‘balanced’, default: None

Weights associated with classes in the form class_label: weight. If
not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?

asked Mar 21 at 19:30

user214

513115

add a comment |

Assume in my dataset I've around 100k positive data points and 20k negative data points.

i.e 0.83 % of positive labels and 0.16 % of negative labels

From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.

class_weight : dict or ‘balanced’, default: None

Weights associated with classes in the form class_label: weight. If
not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?

asked Mar 21 at 19:30

user214

513115

add a comment |

Assume in my dataset I've around 100k positive data points and 20k negative data points.

i.e 0.83 % of positive labels and 0.16 % of negative labels

From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.

class_weight : dict or ‘balanced’, default: None

Weights associated with classes in the form class_label: weight. If
not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?

asked Mar 21 at 19:30

user214

513115

Assume in my dataset I've around 100k positive data points and 20k negative data points.

i.e 0.83 % of positive labels and 0.16 % of negative labels

From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.

class_weight : dict or ‘balanced’, default: None

Weights associated with classes in the form class_label: weight. If
not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?

python machine-learning scikit-learn

asked Mar 21 at 19:30

user214

513115

asked Mar 21 at 19:30

user214

513115

asked Mar 21 at 19:30

user214

513115

asked Mar 21 at 19:30

user214

513115

asked Mar 21 at 19:30

user214

513115

add a comment |

1 Answer
1

active

oldest

votes

The weights should be set to balanced so that the classes are trained as if they were balanced.

Class weights are equivalent to random oversampling. In my opinion, intelligent oversampling techniques such as SMOTE is a more efficient method over the the method of adding weights to samples during training.

However, oversampling techniques have an added computation cost, because the model needs to be trained on a larger dataset (due to oversampling). Class weighting, on the other hand has no additional computation cost to the model. Unless training a very computationally expensive model, I usually prefer SMOTE.

edited Mar 22 at 13:57

answered Mar 21 at 19:36

Djib2011

96311018

1

Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

– user214
Mar 21 at 19:50

1

You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

– Djib2011
Mar 21 at 21:25

1

You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

– user214
Mar 21 at 23:59

1

Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

– Djib2011
Mar 22 at 14:00

add a comment |

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55288023%2fhow-to-use-sample-weight-parameter-for-algorithms-in-sklearn%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The weights should be set to balanced so that the classes are trained as if they were balanced.

edited Mar 22 at 13:57

answered Mar 21 at 19:36

Djib2011

96311018

1

Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

– user214
Mar 21 at 19:50

1

You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

– Djib2011
Mar 21 at 21:25

1

You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

– user214
Mar 21 at 23:59

1

Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

– Djib2011
Mar 22 at 14:00

add a comment |

The weights should be set to balanced so that the classes are trained as if they were balanced.

edited Mar 22 at 13:57

answered Mar 21 at 19:36

Djib2011

96311018

1

Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

– user214
Mar 21 at 19:50

1

You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

– Djib2011
Mar 21 at 21:25

1

You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

– user214
Mar 21 at 23:59

1

Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

– Djib2011
Mar 22 at 14:00

add a comment |

The weights should be set to balanced so that the classes are trained as if they were balanced.

edited Mar 22 at 13:57

answered Mar 21 at 19:36

Djib2011

96311018

The weights should be set to balanced so that the classes are trained as if they were balanced.

edited Mar 22 at 13:57

answered Mar 21 at 19:36

Djib2011

96311018

edited Mar 22 at 13:57

answered Mar 21 at 19:36

Djib2011

96311018

answered Mar 21 at 19:36

Djib2011

96311018

answered Mar 21 at 19:36

Djib2011

96311018

1

Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

– user214
Mar 21 at 19:50

1

You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

– Djib2011
Mar 21 at 21:25

1

You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

– user214
Mar 21 at 23:59

1

Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

– Djib2011
Mar 22 at 14:00

add a comment |

1

Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

– user214
Mar 21 at 19:50

1

You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

– Djib2011
Mar 21 at 21:25

1

You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

– user214
Mar 21 at 23:59

1

Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

– Djib2011
Mar 22 at 14:00

Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

– user214
Mar 21 at 19:50

You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

– Djib2011
Mar 21 at 21:25

You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

– user214
Mar 21 at 23:59

Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

– Djib2011
Mar 22 at 14:00

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1