How to use sample_weight parameter for algorithms in sklearn The Next CEO of Stack OverflowWhat does ** (double star/asterisk) and * (star/asterisk) do for parameters?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory in Python?How do I sort a dictionary by value?How to make a chain of function decorators?How to make a flat list out of list of lists?How do I list all files of a directory?scikit-learn: Random forest class_weight and sample_weight parameters

My boss doesn't want me to have a side project

Is it correct to say moon starry nights?

Could a dragon use its wings to swim?

What does this strange code stamp on my passport mean?

Post-doc vs. Assistant Professor choice, but neither ideal

Physiological effects of huge anime eyes

Do I need to write [sic] when including a quotation with a number less than 10 that isn't written out?

Creating a script with console commands

About implicitly convert type 'int' to 'char', why it is different between `s[i] += s[j]` and `s[i] = s[i]+s[j] `

Raspberry pi 3 B with Ubuntu 18.04 server arm64: what pi version

pgfplots: How to draw a tangent graph below two others?

Why does freezing point matter when picking cooler ice packs?

What happens if you break a law in another country outside of that country?

It it possible to avoid kiwi.com's automatic online check-in and instead do it manually by yourself?

Can you teleport closer to a creature you are Frightened of?

Cannot restore registry to default in Windows 10?

Does the Idaho Potato Commission associate potato skins with healthy eating?

Shortening a title without changing its meaning

Is there a rule of thumb for determining the amount one should accept for of a settlement offer?

Is it a bad idea to plug the other end of ESD strap to wall ground?

Is it reasonable to ask other researchers to send me their previous grant applications?

Ising model simulation

How can I force the size of an int for debugging purposes?

Masking layers by a vector polygon layer in QGIS



How to use sample_weight parameter for algorithms in sklearn



The Next CEO of Stack OverflowWhat does ** (double star/asterisk) and * (star/asterisk) do for parameters?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory in Python?How do I sort a dictionary by value?How to make a chain of function decorators?How to make a flat list out of list of lists?How do I list all files of a directory?scikit-learn: Random forest class_weight and sample_weight parameters










2















I've a very imbalanced dataset and I'm performing a classification task. So i've tried all the algorithms i.e (Decision Trees, Naive Bayes, Logistic Regression) and for each of them I've come across a parameter called sample_weights in scikit learn.



Assume in my dataset I've around 100k positive data points and 20k negative data points.

i.e 0.83 % of positive labels and 0.16 % of negative labels



From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.




class_weight : dict or ‘balanced’, default: None



Weights associated with classes in the form class_label: weight. If
not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.




My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?










share|improve this question


























    2















    I've a very imbalanced dataset and I'm performing a classification task. So i've tried all the algorithms i.e (Decision Trees, Naive Bayes, Logistic Regression) and for each of them I've come across a parameter called sample_weights in scikit learn.



    Assume in my dataset I've around 100k positive data points and 20k negative data points.

    i.e 0.83 % of positive labels and 0.16 % of negative labels



    From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.




    class_weight : dict or ‘balanced’, default: None



    Weights associated with classes in the form class_label: weight. If
    not given, all classes are supposed to have weight one. For
    multi-output problems, a list of dicts can be provided in the same
    order as the columns of y.




    My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?










    share|improve this question
























      2












      2








      2


      1






      I've a very imbalanced dataset and I'm performing a classification task. So i've tried all the algorithms i.e (Decision Trees, Naive Bayes, Logistic Regression) and for each of them I've come across a parameter called sample_weights in scikit learn.



      Assume in my dataset I've around 100k positive data points and 20k negative data points.

      i.e 0.83 % of positive labels and 0.16 % of negative labels



      From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.




      class_weight : dict or ‘balanced’, default: None



      Weights associated with classes in the form class_label: weight. If
      not given, all classes are supposed to have weight one. For
      multi-output problems, a list of dicts can be provided in the same
      order as the columns of y.




      My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?










      share|improve this question














      I've a very imbalanced dataset and I'm performing a classification task. So i've tried all the algorithms i.e (Decision Trees, Naive Bayes, Logistic Regression) and for each of them I've come across a parameter called sample_weights in scikit learn.



      Assume in my dataset I've around 100k positive data points and 20k negative data points.

      i.e 0.83 % of positive labels and 0.16 % of negative labels



      From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.




      class_weight : dict or ‘balanced’, default: None



      Weights associated with classes in the form class_label: weight. If
      not given, all classes are supposed to have weight one. For
      multi-output problems, a list of dicts can be provided in the same
      order as the columns of y.




      My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?







      python machine-learning scikit-learn






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 21 at 19:30









      user214user214

      513115




      513115






















          1 Answer
          1






          active

          oldest

          votes


















          1














          The weights should be set to balanced so that the classes are trained as if they were balanced.



          Class weights are equivalent to random oversampling. In my opinion, intelligent oversampling techniques such as SMOTE is a more efficient method over the the method of adding weights to samples during training.



          However, oversampling techniques have an added computation cost, because the model needs to be trained on a larger dataset (due to oversampling). Class weighting, on the other hand has no additional computation cost to the model. Unless training a very computationally expensive model, I usually prefer SMOTE.






          share|improve this answer




















          • 1





            Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

            – user214
            Mar 21 at 19:50







          • 1





            You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

            – Djib2011
            Mar 21 at 21:25






          • 1





            You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

            – user214
            Mar 21 at 23:59







          • 1





            Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

            – Djib2011
            Mar 22 at 14:00











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55288023%2fhow-to-use-sample-weight-parameter-for-algorithms-in-sklearn%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          The weights should be set to balanced so that the classes are trained as if they were balanced.



          Class weights are equivalent to random oversampling. In my opinion, intelligent oversampling techniques such as SMOTE is a more efficient method over the the method of adding weights to samples during training.



          However, oversampling techniques have an added computation cost, because the model needs to be trained on a larger dataset (due to oversampling). Class weighting, on the other hand has no additional computation cost to the model. Unless training a very computationally expensive model, I usually prefer SMOTE.






          share|improve this answer




















          • 1





            Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

            – user214
            Mar 21 at 19:50







          • 1





            You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

            – Djib2011
            Mar 21 at 21:25






          • 1





            You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

            – user214
            Mar 21 at 23:59







          • 1





            Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

            – Djib2011
            Mar 22 at 14:00















          1














          The weights should be set to balanced so that the classes are trained as if they were balanced.



          Class weights are equivalent to random oversampling. In my opinion, intelligent oversampling techniques such as SMOTE is a more efficient method over the the method of adding weights to samples during training.



          However, oversampling techniques have an added computation cost, because the model needs to be trained on a larger dataset (due to oversampling). Class weighting, on the other hand has no additional computation cost to the model. Unless training a very computationally expensive model, I usually prefer SMOTE.






          share|improve this answer




















          • 1





            Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

            – user214
            Mar 21 at 19:50







          • 1





            You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

            – Djib2011
            Mar 21 at 21:25






          • 1





            You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

            – user214
            Mar 21 at 23:59







          • 1





            Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

            – Djib2011
            Mar 22 at 14:00













          1












          1








          1







          The weights should be set to balanced so that the classes are trained as if they were balanced.



          Class weights are equivalent to random oversampling. In my opinion, intelligent oversampling techniques such as SMOTE is a more efficient method over the the method of adding weights to samples during training.



          However, oversampling techniques have an added computation cost, because the model needs to be trained on a larger dataset (due to oversampling). Class weighting, on the other hand has no additional computation cost to the model. Unless training a very computationally expensive model, I usually prefer SMOTE.






          share|improve this answer















          The weights should be set to balanced so that the classes are trained as if they were balanced.



          Class weights are equivalent to random oversampling. In my opinion, intelligent oversampling techniques such as SMOTE is a more efficient method over the the method of adding weights to samples during training.



          However, oversampling techniques have an added computation cost, because the model needs to be trained on a larger dataset (due to oversampling). Class weighting, on the other hand has no additional computation cost to the model. Unless training a very computationally expensive model, I usually prefer SMOTE.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 22 at 13:57

























          answered Mar 21 at 19:36









          Djib2011Djib2011

          96311018




          96311018







          • 1





            Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

            – user214
            Mar 21 at 19:50







          • 1





            You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

            – Djib2011
            Mar 21 at 21:25






          • 1





            You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

            – user214
            Mar 21 at 23:59







          • 1





            Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

            – Djib2011
            Mar 22 at 14:00












          • 1





            Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

            – user214
            Mar 21 at 19:50







          • 1





            You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

            – Djib2011
            Mar 21 at 21:25






          • 1





            You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

            – user214
            Mar 21 at 23:59







          • 1





            Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

            – Djib2011
            Mar 22 at 14:00







          1




          1





          Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

          – user214
          Mar 21 at 19:50






          Do I've to use SMOTE on the total dataset(not splitted) or just on X_train?

          – user214
          Mar 21 at 19:50





          1




          1





          You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

          – Djib2011
          Mar 21 at 21:25





          You should only use SMOTE on the training set. Any evaluation performed should be on a set that hasn't been resampled, or else your results won't be reliable!

          – Djib2011
          Mar 21 at 21:25




          1




          1





          You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

          – user214
          Mar 21 at 23:59






          You mentioned techniques like sample_weight are computationally expensive. Isn't SMOTE also doing the same thing i.e oversampling?

          – user214
          Mar 21 at 23:59





          1




          1





          Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

          – Djib2011
          Mar 22 at 14:00





          Sorry I wrote that wrong, I meant that oversampling is more computationally expensive because it trains the model on larger dataset. Sample weights don't have an added computational cost, but usually they don't perform as well as oversampling techniques like SMOTE. If your model isn't very computationally expensive as is, I'd recommend oversampling.

          – Djib2011
          Mar 22 at 14:00



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55288023%2fhow-to-use-sample-weight-parameter-for-algorithms-in-sklearn%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

          용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

          155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해