Binary Crossentropy to penalize all components of one-hot vector The Next CEO of Stack OverflowHow to choose cross-entropy loss in tensorflow?Tensorflow sigmoid and cross entropy vs sigmoid_cross_entropy_with_logitsWhat are advantages of Artificial Neural Networks over Support Vector Machines?Keras: binary_crossentropy & categorical_crossentropy confusionTensorflow loss calculation for multiple positive classificationsAbout tf.nn.softmax_cross_entropy_with_logits_v2Sigmoid activation for multi-class classification?Softmax activation with cross entropy loss results in the outputs converging to exactly 0 and 1 for both classes, respectivelykeras categorical and binary crossentropyChannel wise CrossEntropyLoss for image segmentation in pytorchTraining multiclass NN in Keras using binary cross-entropy gives higher score than using categorical cross-entropydifference between categorical and binary cross entropy

How should I support this large drywall patch?

Would this house-rule that treats advantage as a +1 to the roll instead (and disadvantage as -1) and allows them to stack be balanced?

Why did we only see the N-1 starfighters in one film?

Is micro rebar a better way to reinforce concrete than rebar?

Written every which way

Contours of a clandestine nature

Does it take more energy to get to Venus or to Mars?

Why is the US ranked as #45 in Press Freedom ratings, despite its extremely permissive free speech laws?

Why do remote companies require working in the US?

Elegant way to replace substring in a regex with optional groups in Python?

In excess I'm lethal

What exact does MIB represent in SNMP? How is it different from OID?

Workaholic Formal/Informal

Why does standard notation not preserve intervals (visually)

Why do we use the plural of movies in this phrase "We went to the movies last night."?

Why do airplanes bank sharply to the right after air-to-air refueling?

Rotate a column

WOW air has ceased operation, can I get my tickets refunded?

Why has the US not been more assertive in confronting Russia in recent years?

Would a galaxy be visible from outside, but nearby?

How did the Bene Gesserit know how to make a Kwisatz Haderach?

How do we know the LHC results are robust?

What was the first Unix version to run on a microcomputer?

Novel about a guy who is possessed by the divine essence and the world ends?



Binary Crossentropy to penalize all components of one-hot vector



The Next CEO of Stack OverflowHow to choose cross-entropy loss in tensorflow?Tensorflow sigmoid and cross entropy vs sigmoid_cross_entropy_with_logitsWhat are advantages of Artificial Neural Networks over Support Vector Machines?Keras: binary_crossentropy & categorical_crossentropy confusionTensorflow loss calculation for multiple positive classificationsAbout tf.nn.softmax_cross_entropy_with_logits_v2Sigmoid activation for multi-class classification?Softmax activation with cross entropy loss results in the outputs converging to exactly 0 and 1 for both classes, respectivelykeras categorical and binary crossentropyChannel wise CrossEntropyLoss for image segmentation in pytorchTraining multiclass NN in Keras using binary cross-entropy gives higher score than using categorical cross-entropydifference between categorical and binary cross entropy










5















I understand that binary cross-entropy is the same as categorical cross-entropy in case of two classes.



Further, it is clear for me what softmax is.

Therefore, I see that categorical cross-entropy just penalizes the one component (probability) that should be 1.



But why, can't or shouldn't I use binary cross-entropy on a one-hot vector?



Normal Case for 1-Label-Multiclass-Mutual-exclusivity-classification:
################
pred = [0.1 0.3 0.2 0.4]
label (one hot) = [0 1 0 0]
costfunction: categorical crossentropy
= sum(label * -log(pred)) //just consider the 1-label
= 0.523
Why not that?
################
pred = [0.1 0.3 0.2 0.4]
label (one hot) = [0 1 0 0]
costfunction: binary crossentropy
= sum(- label * log(pred) - (1 - label) * log(1 - pred))
= 1*-log(0.3)-log(1-0.1)-log(1-0.2)-log(1-0.4)
= 0.887


I see that in binary cross-entropy the zero is a target class, and corresponds to the following one-hot encoding:



target class zero 0 -> [1 0]
target class one 1 -> [0 1]


In summary: Why do we just calculate/summarize the negative log likelihood for the predicted class. Why don't we penalize the other SHOULD-BE-ZERO-/NOT-THAT-CLASS classes?



In case one uses binary cross-entropy to a one-hot vector. Probabilities to expected zero labels would be penalized too.










share|improve this question




























    5















    I understand that binary cross-entropy is the same as categorical cross-entropy in case of two classes.



    Further, it is clear for me what softmax is.

    Therefore, I see that categorical cross-entropy just penalizes the one component (probability) that should be 1.



    But why, can't or shouldn't I use binary cross-entropy on a one-hot vector?



    Normal Case for 1-Label-Multiclass-Mutual-exclusivity-classification:
    ################
    pred = [0.1 0.3 0.2 0.4]
    label (one hot) = [0 1 0 0]
    costfunction: categorical crossentropy
    = sum(label * -log(pred)) //just consider the 1-label
    = 0.523
    Why not that?
    ################
    pred = [0.1 0.3 0.2 0.4]
    label (one hot) = [0 1 0 0]
    costfunction: binary crossentropy
    = sum(- label * log(pred) - (1 - label) * log(1 - pred))
    = 1*-log(0.3)-log(1-0.1)-log(1-0.2)-log(1-0.4)
    = 0.887


    I see that in binary cross-entropy the zero is a target class, and corresponds to the following one-hot encoding:



    target class zero 0 -> [1 0]
    target class one 1 -> [0 1]


    In summary: Why do we just calculate/summarize the negative log likelihood for the predicted class. Why don't we penalize the other SHOULD-BE-ZERO-/NOT-THAT-CLASS classes?



    In case one uses binary cross-entropy to a one-hot vector. Probabilities to expected zero labels would be penalized too.










    share|improve this question


























      5












      5








      5


      1






      I understand that binary cross-entropy is the same as categorical cross-entropy in case of two classes.



      Further, it is clear for me what softmax is.

      Therefore, I see that categorical cross-entropy just penalizes the one component (probability) that should be 1.



      But why, can't or shouldn't I use binary cross-entropy on a one-hot vector?



      Normal Case for 1-Label-Multiclass-Mutual-exclusivity-classification:
      ################
      pred = [0.1 0.3 0.2 0.4]
      label (one hot) = [0 1 0 0]
      costfunction: categorical crossentropy
      = sum(label * -log(pred)) //just consider the 1-label
      = 0.523
      Why not that?
      ################
      pred = [0.1 0.3 0.2 0.4]
      label (one hot) = [0 1 0 0]
      costfunction: binary crossentropy
      = sum(- label * log(pred) - (1 - label) * log(1 - pred))
      = 1*-log(0.3)-log(1-0.1)-log(1-0.2)-log(1-0.4)
      = 0.887


      I see that in binary cross-entropy the zero is a target class, and corresponds to the following one-hot encoding:



      target class zero 0 -> [1 0]
      target class one 1 -> [0 1]


      In summary: Why do we just calculate/summarize the negative log likelihood for the predicted class. Why don't we penalize the other SHOULD-BE-ZERO-/NOT-THAT-CLASS classes?



      In case one uses binary cross-entropy to a one-hot vector. Probabilities to expected zero labels would be penalized too.










      share|improve this question
















      I understand that binary cross-entropy is the same as categorical cross-entropy in case of two classes.



      Further, it is clear for me what softmax is.

      Therefore, I see that categorical cross-entropy just penalizes the one component (probability) that should be 1.



      But why, can't or shouldn't I use binary cross-entropy on a one-hot vector?



      Normal Case for 1-Label-Multiclass-Mutual-exclusivity-classification:
      ################
      pred = [0.1 0.3 0.2 0.4]
      label (one hot) = [0 1 0 0]
      costfunction: categorical crossentropy
      = sum(label * -log(pred)) //just consider the 1-label
      = 0.523
      Why not that?
      ################
      pred = [0.1 0.3 0.2 0.4]
      label (one hot) = [0 1 0 0]
      costfunction: binary crossentropy
      = sum(- label * log(pred) - (1 - label) * log(1 - pred))
      = 1*-log(0.3)-log(1-0.1)-log(1-0.2)-log(1-0.4)
      = 0.887


      I see that in binary cross-entropy the zero is a target class, and corresponds to the following one-hot encoding:



      target class zero 0 -> [1 0]
      target class one 1 -> [0 1]


      In summary: Why do we just calculate/summarize the negative log likelihood for the predicted class. Why don't we penalize the other SHOULD-BE-ZERO-/NOT-THAT-CLASS classes?



      In case one uses binary cross-entropy to a one-hot vector. Probabilities to expected zero labels would be penalized too.







      machine-learning classification multilabel-classification one-hot-encoding cross-entropy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 13 '17 at 15:47









      Maxim

      32.9k2281132




      32.9k2281132










      asked May 23 '17 at 14:55









      hallo02hallo02

      928




      928






















          1 Answer
          1






          active

          oldest

          votes


















          3














          See my answer on a similar question. In short, binary cross-entropy formula doesn't make sense for the one-hot vector. It's either possible to apply softmax cross-entropy for two or more classes or use the vector of (independent) probabilities in label, depending on the task.




          But why, can't or shouldn't I use binary crossentropy on a one-hot vector?




          What you compute is binary cross-entropy of 4 independent features:



          pred = [0.1 0.3 0.2 0.4]
          label = [0 1 0 0]


          The model inference predicted that first feature is on with 10% probability, the second feature is on with 30% probability and so on. Target label is interpreted this way: all features are off, except for the second one. Note that [1, 1, 1, 1] is a perfectly valid label as well, i.e. it's not one-hot vector, and pred=[0.5, 0.8, 0.7, 0.1] is a valid prediction, i.e. the sum doesn't have to equal to one.



          In other words, your computation is valid, but for a completely different problem: multi-label non-exclusive binary classification.



          See also the difference between softmax and sigmoid cross-entropy loss functions in tensorflow.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44138324%2fbinary-crossentropy-to-penalize-all-components-of-one-hot-vector%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3














            See my answer on a similar question. In short, binary cross-entropy formula doesn't make sense for the one-hot vector. It's either possible to apply softmax cross-entropy for two or more classes or use the vector of (independent) probabilities in label, depending on the task.




            But why, can't or shouldn't I use binary crossentropy on a one-hot vector?




            What you compute is binary cross-entropy of 4 independent features:



            pred = [0.1 0.3 0.2 0.4]
            label = [0 1 0 0]


            The model inference predicted that first feature is on with 10% probability, the second feature is on with 30% probability and so on. Target label is interpreted this way: all features are off, except for the second one. Note that [1, 1, 1, 1] is a perfectly valid label as well, i.e. it's not one-hot vector, and pred=[0.5, 0.8, 0.7, 0.1] is a valid prediction, i.e. the sum doesn't have to equal to one.



            In other words, your computation is valid, but for a completely different problem: multi-label non-exclusive binary classification.



            See also the difference between softmax and sigmoid cross-entropy loss functions in tensorflow.






            share|improve this answer



























              3














              See my answer on a similar question. In short, binary cross-entropy formula doesn't make sense for the one-hot vector. It's either possible to apply softmax cross-entropy for two or more classes or use the vector of (independent) probabilities in label, depending on the task.




              But why, can't or shouldn't I use binary crossentropy on a one-hot vector?




              What you compute is binary cross-entropy of 4 independent features:



              pred = [0.1 0.3 0.2 0.4]
              label = [0 1 0 0]


              The model inference predicted that first feature is on with 10% probability, the second feature is on with 30% probability and so on. Target label is interpreted this way: all features are off, except for the second one. Note that [1, 1, 1, 1] is a perfectly valid label as well, i.e. it's not one-hot vector, and pred=[0.5, 0.8, 0.7, 0.1] is a valid prediction, i.e. the sum doesn't have to equal to one.



              In other words, your computation is valid, but for a completely different problem: multi-label non-exclusive binary classification.



              See also the difference between softmax and sigmoid cross-entropy loss functions in tensorflow.






              share|improve this answer

























                3












                3








                3







                See my answer on a similar question. In short, binary cross-entropy formula doesn't make sense for the one-hot vector. It's either possible to apply softmax cross-entropy for two or more classes or use the vector of (independent) probabilities in label, depending on the task.




                But why, can't or shouldn't I use binary crossentropy on a one-hot vector?




                What you compute is binary cross-entropy of 4 independent features:



                pred = [0.1 0.3 0.2 0.4]
                label = [0 1 0 0]


                The model inference predicted that first feature is on with 10% probability, the second feature is on with 30% probability and so on. Target label is interpreted this way: all features are off, except for the second one. Note that [1, 1, 1, 1] is a perfectly valid label as well, i.e. it's not one-hot vector, and pred=[0.5, 0.8, 0.7, 0.1] is a valid prediction, i.e. the sum doesn't have to equal to one.



                In other words, your computation is valid, but for a completely different problem: multi-label non-exclusive binary classification.



                See also the difference between softmax and sigmoid cross-entropy loss functions in tensorflow.






                share|improve this answer













                See my answer on a similar question. In short, binary cross-entropy formula doesn't make sense for the one-hot vector. It's either possible to apply softmax cross-entropy for two or more classes or use the vector of (independent) probabilities in label, depending on the task.




                But why, can't or shouldn't I use binary crossentropy on a one-hot vector?




                What you compute is binary cross-entropy of 4 independent features:



                pred = [0.1 0.3 0.2 0.4]
                label = [0 1 0 0]


                The model inference predicted that first feature is on with 10% probability, the second feature is on with 30% probability and so on. Target label is interpreted this way: all features are off, except for the second one. Note that [1, 1, 1, 1] is a perfectly valid label as well, i.e. it's not one-hot vector, and pred=[0.5, 0.8, 0.7, 0.1] is a valid prediction, i.e. the sum doesn't have to equal to one.



                In other words, your computation is valid, but for a completely different problem: multi-label non-exclusive binary classification.



                See also the difference between softmax and sigmoid cross-entropy loss functions in tensorflow.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 13 '17 at 13:53









                MaximMaxim

                32.9k2281132




                32.9k2281132





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44138324%2fbinary-crossentropy-to-penalize-all-components-of-one-hot-vector%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                    용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

                    155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해