Bin and Calculate Entropy using NumpyIs there a NumPy function to return the first index of something in an array?What are the advantages of NumPy over regular Python lists?How can the Euclidean distance be calculated with NumPy?How to print the full NumPy array, without truncation?Why do people write the #!/usr/bin/env python shebang on the first line of a Python script?Numpy array dimensionsHow to access the ith column of a NumPy multidimensional array?Dump a NumPy array into a csv file“Large data” work flows using pandasFeature Selection by Entropy and Information Gain in Matlab

Why does splatting create a tuple on the rhs but a list on the lhs?

Why does the Starter Set wizard have six spells in their spellbook?

Why sampling a periodic signal doesn't yield a periodic discrete signal?

Is "vegetable base" a common term in English?

Why did it take so long for Germany to allow electric scooters / e-rollers on the roads?

Shorten or merge multiple lines of `&> /dev/null &`

Are cells guaranteed to get at least one mitochondrion when they divide?

Is there a simple example that empirical evidence is misleading?

The disk image is 497GB smaller than the target device

Using too much dialogue?

Best shape for a necromancer's undead minions for battle?

Is superuser the same as root?

What could a self-sustaining lunar colony slowly lose that would ultimately prove fatal?

What is the use case for non-breathable waterproof pants?

What tokens are in the end of line?

Which European Languages are not Indo-European?

Co-author wants to put their current funding source in the acknowledgements section because they edited the paper

Is my plasma cannon concept viable?

Why is 'additive' EQ more difficult to use than 'subtractive'?

Expected maximum number of unpaired socks

Why is unzipped directory exactly 4.0k (much smaller than zipped file)?

Are runways booked by airlines to land their planes?

Are there any German nonsense poems (Jabberwocky)?

If I arrive in the UK, and then head to mainland Europe, does my Schengen visa 90 day limit start when I arrived in the UK, or mainland Europe?



Bin and Calculate Entropy using Numpy


Is there a NumPy function to return the first index of something in an array?What are the advantages of NumPy over regular Python lists?How can the Euclidean distance be calculated with NumPy?How to print the full NumPy array, without truncation?Why do people write the #!/usr/bin/env python shebang on the first line of a Python script?Numpy array dimensionsHow to access the ith column of a NumPy multidimensional array?Dump a NumPy array into a csv file“Large data” work flows using pandasFeature Selection by Entropy and Information Gain in Matlab






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I am attempting to perform the following task:



For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.



Pseudocode would look like this:



split_data(feature):
BestValues = 0
For Each Value in Feature:
Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
If CurrentGain > BestGain:
Set BestValues = Value,Next Value
Set BestGain = CurrentGain


return BestValues


I currently have a Python codes that looks like the following:



# This function finds the total entropy for a given dataset
def entropy(dataset):
# Declare variables
total_entropy = 0
# Determine classes and numby of items in each class
classes = numpy.unique(dataset[:,-1])

# Loop through each "class", or label
for aclass in classes:
# Create temp variables
currFreq = 0
currProb = 0
# Loop through each row in the dataset
for row in dataset:
# If that row has the same label as the current class, implement the frequency
if (aclass == row[-1]):
currFreq = currFreq + 1
# If not, continue
else:
continue
# The current probability is the # of occurences / total occurences
currProb = currFreq / len(dataset)
# If it is 0, then the entropy is 0. If not, use entropy formula
if (currFreq > 0):
total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
else:
return 0

# Return the total entropy
return total_entropy

# This function gets the entropy for a single attribute
def entropy_by_attribute(dataset, feature):
# The attribute is the specific feature of the dataset
attribute = dataset[:,feature]
# The target_variables are the unique values in that feature
target_variables = numpy.unique(dataset[:,-1])
# The unique values in the column we are evaluating
variables = numpy.unique(attribute)
# The entropy for the attribute in question
entropy_attribute = 0

# Loop through each of the possible values
for variable in variables:
denominator = 0
entropy_each_feature = 0
# For every row in the column
for row in attribute:
# If it is equal to the current value we are estimating, increase your denominator
if row == variable:
denominator = denominator + 1

# Now loop through each class
for target_variable in target_variables:
numerator = 0
# Loop through the dataset
for row in dataset:
index = 0
# if the current row in the feature is equal to the value you are evaluating
# and the label is equal to the label you are evaluating, increase the numerator
if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
numerator = numerator + 1
else:
continue
index = index + 1

# use eps to protect from divide by 0
fraction = numerator/(denominator+numpy.finfo(float).eps)
entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))

# Now calculate the total entropy for the attribute in question
big_fraction = denominator / len(dataset)
entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)

# Return that entropy
return entropy_attribute

# This function calculates the information gain
def infogain(dataset, feature):
# Grab the entropy from the total dataset
total_entropy = entropy(dataset)
# Grab the entropy for the current feature being evaluated
feature_entropy = entropy_by_attribute(dataset, feature)
# Calculate the infogain
infogain = float(abs(total_entropy - feature_entropy))

# Return the infogain
return infogain


However, I am unsure of how to do the following:



  1. For a feature, grab its total entropy

  2. For a single feature, determine entropy using a binning technique where I am testing two values

I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.










share|improve this question




























    0















    I am attempting to perform the following task:



    For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.



    Pseudocode would look like this:



    split_data(feature):
    BestValues = 0
    For Each Value in Feature:
    Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
    If CurrentGain > BestGain:
    Set BestValues = Value,Next Value
    Set BestGain = CurrentGain


    return BestValues


    I currently have a Python codes that looks like the following:



    # This function finds the total entropy for a given dataset
    def entropy(dataset):
    # Declare variables
    total_entropy = 0
    # Determine classes and numby of items in each class
    classes = numpy.unique(dataset[:,-1])

    # Loop through each "class", or label
    for aclass in classes:
    # Create temp variables
    currFreq = 0
    currProb = 0
    # Loop through each row in the dataset
    for row in dataset:
    # If that row has the same label as the current class, implement the frequency
    if (aclass == row[-1]):
    currFreq = currFreq + 1
    # If not, continue
    else:
    continue
    # The current probability is the # of occurences / total occurences
    currProb = currFreq / len(dataset)
    # If it is 0, then the entropy is 0. If not, use entropy formula
    if (currFreq > 0):
    total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
    else:
    return 0

    # Return the total entropy
    return total_entropy

    # This function gets the entropy for a single attribute
    def entropy_by_attribute(dataset, feature):
    # The attribute is the specific feature of the dataset
    attribute = dataset[:,feature]
    # The target_variables are the unique values in that feature
    target_variables = numpy.unique(dataset[:,-1])
    # The unique values in the column we are evaluating
    variables = numpy.unique(attribute)
    # The entropy for the attribute in question
    entropy_attribute = 0

    # Loop through each of the possible values
    for variable in variables:
    denominator = 0
    entropy_each_feature = 0
    # For every row in the column
    for row in attribute:
    # If it is equal to the current value we are estimating, increase your denominator
    if row == variable:
    denominator = denominator + 1

    # Now loop through each class
    for target_variable in target_variables:
    numerator = 0
    # Loop through the dataset
    for row in dataset:
    index = 0
    # if the current row in the feature is equal to the value you are evaluating
    # and the label is equal to the label you are evaluating, increase the numerator
    if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
    numerator = numerator + 1
    else:
    continue
    index = index + 1

    # use eps to protect from divide by 0
    fraction = numerator/(denominator+numpy.finfo(float).eps)
    entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))

    # Now calculate the total entropy for the attribute in question
    big_fraction = denominator / len(dataset)
    entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)

    # Return that entropy
    return entropy_attribute

    # This function calculates the information gain
    def infogain(dataset, feature):
    # Grab the entropy from the total dataset
    total_entropy = entropy(dataset)
    # Grab the entropy for the current feature being evaluated
    feature_entropy = entropy_by_attribute(dataset, feature)
    # Calculate the infogain
    infogain = float(abs(total_entropy - feature_entropy))

    # Return the infogain
    return infogain


    However, I am unsure of how to do the following:



    1. For a feature, grab its total entropy

    2. For a single feature, determine entropy using a binning technique where I am testing two values

    I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.










    share|improve this question
























      0












      0








      0








      I am attempting to perform the following task:



      For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.



      Pseudocode would look like this:



      split_data(feature):
      BestValues = 0
      For Each Value in Feature:
      Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
      If CurrentGain > BestGain:
      Set BestValues = Value,Next Value
      Set BestGain = CurrentGain


      return BestValues


      I currently have a Python codes that looks like the following:



      # This function finds the total entropy for a given dataset
      def entropy(dataset):
      # Declare variables
      total_entropy = 0
      # Determine classes and numby of items in each class
      classes = numpy.unique(dataset[:,-1])

      # Loop through each "class", or label
      for aclass in classes:
      # Create temp variables
      currFreq = 0
      currProb = 0
      # Loop through each row in the dataset
      for row in dataset:
      # If that row has the same label as the current class, implement the frequency
      if (aclass == row[-1]):
      currFreq = currFreq + 1
      # If not, continue
      else:
      continue
      # The current probability is the # of occurences / total occurences
      currProb = currFreq / len(dataset)
      # If it is 0, then the entropy is 0. If not, use entropy formula
      if (currFreq > 0):
      total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
      else:
      return 0

      # Return the total entropy
      return total_entropy

      # This function gets the entropy for a single attribute
      def entropy_by_attribute(dataset, feature):
      # The attribute is the specific feature of the dataset
      attribute = dataset[:,feature]
      # The target_variables are the unique values in that feature
      target_variables = numpy.unique(dataset[:,-1])
      # The unique values in the column we are evaluating
      variables = numpy.unique(attribute)
      # The entropy for the attribute in question
      entropy_attribute = 0

      # Loop through each of the possible values
      for variable in variables:
      denominator = 0
      entropy_each_feature = 0
      # For every row in the column
      for row in attribute:
      # If it is equal to the current value we are estimating, increase your denominator
      if row == variable:
      denominator = denominator + 1

      # Now loop through each class
      for target_variable in target_variables:
      numerator = 0
      # Loop through the dataset
      for row in dataset:
      index = 0
      # if the current row in the feature is equal to the value you are evaluating
      # and the label is equal to the label you are evaluating, increase the numerator
      if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
      numerator = numerator + 1
      else:
      continue
      index = index + 1

      # use eps to protect from divide by 0
      fraction = numerator/(denominator+numpy.finfo(float).eps)
      entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))

      # Now calculate the total entropy for the attribute in question
      big_fraction = denominator / len(dataset)
      entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)

      # Return that entropy
      return entropy_attribute

      # This function calculates the information gain
      def infogain(dataset, feature):
      # Grab the entropy from the total dataset
      total_entropy = entropy(dataset)
      # Grab the entropy for the current feature being evaluated
      feature_entropy = entropy_by_attribute(dataset, feature)
      # Calculate the infogain
      infogain = float(abs(total_entropy - feature_entropy))

      # Return the infogain
      return infogain


      However, I am unsure of how to do the following:



      1. For a feature, grab its total entropy

      2. For a single feature, determine entropy using a binning technique where I am testing two values

      I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.










      share|improve this question














      I am attempting to perform the following task:



      For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.



      Pseudocode would look like this:



      split_data(feature):
      BestValues = 0
      For Each Value in Feature:
      Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
      If CurrentGain > BestGain:
      Set BestValues = Value,Next Value
      Set BestGain = CurrentGain


      return BestValues


      I currently have a Python codes that looks like the following:



      # This function finds the total entropy for a given dataset
      def entropy(dataset):
      # Declare variables
      total_entropy = 0
      # Determine classes and numby of items in each class
      classes = numpy.unique(dataset[:,-1])

      # Loop through each "class", or label
      for aclass in classes:
      # Create temp variables
      currFreq = 0
      currProb = 0
      # Loop through each row in the dataset
      for row in dataset:
      # If that row has the same label as the current class, implement the frequency
      if (aclass == row[-1]):
      currFreq = currFreq + 1
      # If not, continue
      else:
      continue
      # The current probability is the # of occurences / total occurences
      currProb = currFreq / len(dataset)
      # If it is 0, then the entropy is 0. If not, use entropy formula
      if (currFreq > 0):
      total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
      else:
      return 0

      # Return the total entropy
      return total_entropy

      # This function gets the entropy for a single attribute
      def entropy_by_attribute(dataset, feature):
      # The attribute is the specific feature of the dataset
      attribute = dataset[:,feature]
      # The target_variables are the unique values in that feature
      target_variables = numpy.unique(dataset[:,-1])
      # The unique values in the column we are evaluating
      variables = numpy.unique(attribute)
      # The entropy for the attribute in question
      entropy_attribute = 0

      # Loop through each of the possible values
      for variable in variables:
      denominator = 0
      entropy_each_feature = 0
      # For every row in the column
      for row in attribute:
      # If it is equal to the current value we are estimating, increase your denominator
      if row == variable:
      denominator = denominator + 1

      # Now loop through each class
      for target_variable in target_variables:
      numerator = 0
      # Loop through the dataset
      for row in dataset:
      index = 0
      # if the current row in the feature is equal to the value you are evaluating
      # and the label is equal to the label you are evaluating, increase the numerator
      if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
      numerator = numerator + 1
      else:
      continue
      index = index + 1

      # use eps to protect from divide by 0
      fraction = numerator/(denominator+numpy.finfo(float).eps)
      entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))

      # Now calculate the total entropy for the attribute in question
      big_fraction = denominator / len(dataset)
      entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)

      # Return that entropy
      return entropy_attribute

      # This function calculates the information gain
      def infogain(dataset, feature):
      # Grab the entropy from the total dataset
      total_entropy = entropy(dataset)
      # Grab the entropy for the current feature being evaluated
      feature_entropy = entropy_by_attribute(dataset, feature)
      # Calculate the infogain
      infogain = float(abs(total_entropy - feature_entropy))

      # Return the infogain
      return infogain


      However, I am unsure of how to do the following:



      1. For a feature, grab its total entropy

      2. For a single feature, determine entropy using a binning technique where I am testing two values

      I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.







      python numpy statistics






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 23 at 23:24









      Jerry M.Jerry M.

      1,04911029




      1,04911029






















          1 Answer
          1






          active

          oldest

          votes


















          0














          The following function processes the entropy calculation, per column (feature)



          def entropy(attributes, dataset, targetAttr):
          freq =
          entropy = 0.0
          index = 0
          for item in attributes:
          if (targetAttr == item):
          break
          else:
          index = index + 1
          index = index - 1
          for item in dataset:
          if ((item[index]) in freq):
          # Increase the index
          freq[item[index]] += 1.0
          else:
          # Initialize it by setting it to 0
          freq[item[index]] = 1.0

          for freq in freq.values():
          entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
          return entropy





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319307%2fbin-and-calculate-entropy-using-numpy%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            The following function processes the entropy calculation, per column (feature)



            def entropy(attributes, dataset, targetAttr):
            freq =
            entropy = 0.0
            index = 0
            for item in attributes:
            if (targetAttr == item):
            break
            else:
            index = index + 1
            index = index - 1
            for item in dataset:
            if ((item[index]) in freq):
            # Increase the index
            freq[item[index]] += 1.0
            else:
            # Initialize it by setting it to 0
            freq[item[index]] = 1.0

            for freq in freq.values():
            entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
            return entropy





            share|improve this answer



























              0














              The following function processes the entropy calculation, per column (feature)



              def entropy(attributes, dataset, targetAttr):
              freq =
              entropy = 0.0
              index = 0
              for item in attributes:
              if (targetAttr == item):
              break
              else:
              index = index + 1
              index = index - 1
              for item in dataset:
              if ((item[index]) in freq):
              # Increase the index
              freq[item[index]] += 1.0
              else:
              # Initialize it by setting it to 0
              freq[item[index]] = 1.0

              for freq in freq.values():
              entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
              return entropy





              share|improve this answer

























                0












                0








                0







                The following function processes the entropy calculation, per column (feature)



                def entropy(attributes, dataset, targetAttr):
                freq =
                entropy = 0.0
                index = 0
                for item in attributes:
                if (targetAttr == item):
                break
                else:
                index = index + 1
                index = index - 1
                for item in dataset:
                if ((item[index]) in freq):
                # Increase the index
                freq[item[index]] += 1.0
                else:
                # Initialize it by setting it to 0
                freq[item[index]] = 1.0

                for freq in freq.values():
                entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
                return entropy





                share|improve this answer













                The following function processes the entropy calculation, per column (feature)



                def entropy(attributes, dataset, targetAttr):
                freq =
                entropy = 0.0
                index = 0
                for item in attributes:
                if (targetAttr == item):
                break
                else:
                index = index + 1
                index = index - 1
                for item in dataset:
                if ((item[index]) in freq):
                # Increase the index
                freq[item[index]] += 1.0
                else:
                # Initialize it by setting it to 0
                freq[item[index]] = 1.0

                for freq in freq.values():
                entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
                return entropy






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Apr 29 at 20:04









                Jerry M.Jerry M.

                1,04911029




                1,04911029





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319307%2fbin-and-calculate-entropy-using-numpy%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                    Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

                    Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript