Bin and Calculate Entropy using NumpyIs there a NumPy function to return the first index of something in an array?What are the advantages of NumPy over regular Python lists?How can the Euclidean distance be calculated with NumPy?How to print the full NumPy array, without truncation?Why do people write the #!/usr/bin/env python shebang on the first line of a Python script?Numpy array dimensionsHow to access the ith column of a NumPy multidimensional array?Dump a NumPy array into a csv file“Large data” work flows using pandasFeature Selection by Entropy and Information Gain in Matlab
Why does splatting create a tuple on the rhs but a list on the lhs?
Why does the Starter Set wizard have six spells in their spellbook?
Why sampling a periodic signal doesn't yield a periodic discrete signal?
Is "vegetable base" a common term in English?
Why did it take so long for Germany to allow electric scooters / e-rollers on the roads?
Shorten or merge multiple lines of `&> /dev/null &`
Are cells guaranteed to get at least one mitochondrion when they divide?
Is there a simple example that empirical evidence is misleading?
The disk image is 497GB smaller than the target device
Using too much dialogue?
Best shape for a necromancer's undead minions for battle?
Is superuser the same as root?
What could a self-sustaining lunar colony slowly lose that would ultimately prove fatal?
What is the use case for non-breathable waterproof pants?
What tokens are in the end of line?
Which European Languages are not Indo-European?
Co-author wants to put their current funding source in the acknowledgements section because they edited the paper
Is my plasma cannon concept viable?
Why is 'additive' EQ more difficult to use than 'subtractive'?
Expected maximum number of unpaired socks
Why is unzipped directory exactly 4.0k (much smaller than zipped file)?
Are runways booked by airlines to land their planes?
Are there any German nonsense poems (Jabberwocky)?
If I arrive in the UK, and then head to mainland Europe, does my Schengen visa 90 day limit start when I arrived in the UK, or mainland Europe?
Bin and Calculate Entropy using Numpy
Is there a NumPy function to return the first index of something in an array?What are the advantages of NumPy over regular Python lists?How can the Euclidean distance be calculated with NumPy?How to print the full NumPy array, without truncation?Why do people write the #!/usr/bin/env python shebang on the first line of a Python script?Numpy array dimensionsHow to access the ith column of a NumPy multidimensional array?Dump a NumPy array into a csv file“Large data” work flows using pandasFeature Selection by Entropy and Information Gain in Matlab
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am attempting to perform the following task:
For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.
Pseudocode would look like this:
split_data(feature):
BestValues = 0
For Each Value in Feature:
Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
If CurrentGain > BestGain:
Set BestValues = Value,Next Value
Set BestGain = CurrentGain
return BestValues
I currently have a Python codes that looks like the following:
# This function finds the total entropy for a given dataset
def entropy(dataset):
# Declare variables
total_entropy = 0
# Determine classes and numby of items in each class
classes = numpy.unique(dataset[:,-1])
# Loop through each "class", or label
for aclass in classes:
# Create temp variables
currFreq = 0
currProb = 0
# Loop through each row in the dataset
for row in dataset:
# If that row has the same label as the current class, implement the frequency
if (aclass == row[-1]):
currFreq = currFreq + 1
# If not, continue
else:
continue
# The current probability is the # of occurences / total occurences
currProb = currFreq / len(dataset)
# If it is 0, then the entropy is 0. If not, use entropy formula
if (currFreq > 0):
total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
else:
return 0
# Return the total entropy
return total_entropy
# This function gets the entropy for a single attribute
def entropy_by_attribute(dataset, feature):
# The attribute is the specific feature of the dataset
attribute = dataset[:,feature]
# The target_variables are the unique values in that feature
target_variables = numpy.unique(dataset[:,-1])
# The unique values in the column we are evaluating
variables = numpy.unique(attribute)
# The entropy for the attribute in question
entropy_attribute = 0
# Loop through each of the possible values
for variable in variables:
denominator = 0
entropy_each_feature = 0
# For every row in the column
for row in attribute:
# If it is equal to the current value we are estimating, increase your denominator
if row == variable:
denominator = denominator + 1
# Now loop through each class
for target_variable in target_variables:
numerator = 0
# Loop through the dataset
for row in dataset:
index = 0
# if the current row in the feature is equal to the value you are evaluating
# and the label is equal to the label you are evaluating, increase the numerator
if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
numerator = numerator + 1
else:
continue
index = index + 1
# use eps to protect from divide by 0
fraction = numerator/(denominator+numpy.finfo(float).eps)
entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))
# Now calculate the total entropy for the attribute in question
big_fraction = denominator / len(dataset)
entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)
# Return that entropy
return entropy_attribute
# This function calculates the information gain
def infogain(dataset, feature):
# Grab the entropy from the total dataset
total_entropy = entropy(dataset)
# Grab the entropy for the current feature being evaluated
feature_entropy = entropy_by_attribute(dataset, feature)
# Calculate the infogain
infogain = float(abs(total_entropy - feature_entropy))
# Return the infogain
return infogain
However, I am unsure of how to do the following:
- For a feature, grab its total entropy
- For a single feature, determine entropy using a binning technique where I am testing two values
I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.
python numpy statistics
add a comment |
I am attempting to perform the following task:
For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.
Pseudocode would look like this:
split_data(feature):
BestValues = 0
For Each Value in Feature:
Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
If CurrentGain > BestGain:
Set BestValues = Value,Next Value
Set BestGain = CurrentGain
return BestValues
I currently have a Python codes that looks like the following:
# This function finds the total entropy for a given dataset
def entropy(dataset):
# Declare variables
total_entropy = 0
# Determine classes and numby of items in each class
classes = numpy.unique(dataset[:,-1])
# Loop through each "class", or label
for aclass in classes:
# Create temp variables
currFreq = 0
currProb = 0
# Loop through each row in the dataset
for row in dataset:
# If that row has the same label as the current class, implement the frequency
if (aclass == row[-1]):
currFreq = currFreq + 1
# If not, continue
else:
continue
# The current probability is the # of occurences / total occurences
currProb = currFreq / len(dataset)
# If it is 0, then the entropy is 0. If not, use entropy formula
if (currFreq > 0):
total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
else:
return 0
# Return the total entropy
return total_entropy
# This function gets the entropy for a single attribute
def entropy_by_attribute(dataset, feature):
# The attribute is the specific feature of the dataset
attribute = dataset[:,feature]
# The target_variables are the unique values in that feature
target_variables = numpy.unique(dataset[:,-1])
# The unique values in the column we are evaluating
variables = numpy.unique(attribute)
# The entropy for the attribute in question
entropy_attribute = 0
# Loop through each of the possible values
for variable in variables:
denominator = 0
entropy_each_feature = 0
# For every row in the column
for row in attribute:
# If it is equal to the current value we are estimating, increase your denominator
if row == variable:
denominator = denominator + 1
# Now loop through each class
for target_variable in target_variables:
numerator = 0
# Loop through the dataset
for row in dataset:
index = 0
# if the current row in the feature is equal to the value you are evaluating
# and the label is equal to the label you are evaluating, increase the numerator
if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
numerator = numerator + 1
else:
continue
index = index + 1
# use eps to protect from divide by 0
fraction = numerator/(denominator+numpy.finfo(float).eps)
entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))
# Now calculate the total entropy for the attribute in question
big_fraction = denominator / len(dataset)
entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)
# Return that entropy
return entropy_attribute
# This function calculates the information gain
def infogain(dataset, feature):
# Grab the entropy from the total dataset
total_entropy = entropy(dataset)
# Grab the entropy for the current feature being evaluated
feature_entropy = entropy_by_attribute(dataset, feature)
# Calculate the infogain
infogain = float(abs(total_entropy - feature_entropy))
# Return the infogain
return infogain
However, I am unsure of how to do the following:
- For a feature, grab its total entropy
- For a single feature, determine entropy using a binning technique where I am testing two values
I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.
python numpy statistics
add a comment |
I am attempting to perform the following task:
For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.
Pseudocode would look like this:
split_data(feature):
BestValues = 0
For Each Value in Feature:
Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
If CurrentGain > BestGain:
Set BestValues = Value,Next Value
Set BestGain = CurrentGain
return BestValues
I currently have a Python codes that looks like the following:
# This function finds the total entropy for a given dataset
def entropy(dataset):
# Declare variables
total_entropy = 0
# Determine classes and numby of items in each class
classes = numpy.unique(dataset[:,-1])
# Loop through each "class", or label
for aclass in classes:
# Create temp variables
currFreq = 0
currProb = 0
# Loop through each row in the dataset
for row in dataset:
# If that row has the same label as the current class, implement the frequency
if (aclass == row[-1]):
currFreq = currFreq + 1
# If not, continue
else:
continue
# The current probability is the # of occurences / total occurences
currProb = currFreq / len(dataset)
# If it is 0, then the entropy is 0. If not, use entropy formula
if (currFreq > 0):
total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
else:
return 0
# Return the total entropy
return total_entropy
# This function gets the entropy for a single attribute
def entropy_by_attribute(dataset, feature):
# The attribute is the specific feature of the dataset
attribute = dataset[:,feature]
# The target_variables are the unique values in that feature
target_variables = numpy.unique(dataset[:,-1])
# The unique values in the column we are evaluating
variables = numpy.unique(attribute)
# The entropy for the attribute in question
entropy_attribute = 0
# Loop through each of the possible values
for variable in variables:
denominator = 0
entropy_each_feature = 0
# For every row in the column
for row in attribute:
# If it is equal to the current value we are estimating, increase your denominator
if row == variable:
denominator = denominator + 1
# Now loop through each class
for target_variable in target_variables:
numerator = 0
# Loop through the dataset
for row in dataset:
index = 0
# if the current row in the feature is equal to the value you are evaluating
# and the label is equal to the label you are evaluating, increase the numerator
if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
numerator = numerator + 1
else:
continue
index = index + 1
# use eps to protect from divide by 0
fraction = numerator/(denominator+numpy.finfo(float).eps)
entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))
# Now calculate the total entropy for the attribute in question
big_fraction = denominator / len(dataset)
entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)
# Return that entropy
return entropy_attribute
# This function calculates the information gain
def infogain(dataset, feature):
# Grab the entropy from the total dataset
total_entropy = entropy(dataset)
# Grab the entropy for the current feature being evaluated
feature_entropy = entropy_by_attribute(dataset, feature)
# Calculate the infogain
infogain = float(abs(total_entropy - feature_entropy))
# Return the infogain
return infogain
However, I am unsure of how to do the following:
- For a feature, grab its total entropy
- For a single feature, determine entropy using a binning technique where I am testing two values
I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.
python numpy statistics
I am attempting to perform the following task:
For a given column of data (stored as a numpy array), "bin" the data in a greedy fashion where I test the current object and the next in order to calculate its entropy.
Pseudocode would look like this:
split_data(feature):
BestValues = 0
For Each Value in Feature:
Calculate CurrentGain As InformationGain(Entropy(Feature) - Entropy(Value + Next Value))
If CurrentGain > BestGain:
Set BestValues = Value,Next Value
Set BestGain = CurrentGain
return BestValues
I currently have a Python codes that looks like the following:
# This function finds the total entropy for a given dataset
def entropy(dataset):
# Declare variables
total_entropy = 0
# Determine classes and numby of items in each class
classes = numpy.unique(dataset[:,-1])
# Loop through each "class", or label
for aclass in classes:
# Create temp variables
currFreq = 0
currProb = 0
# Loop through each row in the dataset
for row in dataset:
# If that row has the same label as the current class, implement the frequency
if (aclass == row[-1]):
currFreq = currFreq + 1
# If not, continue
else:
continue
# The current probability is the # of occurences / total occurences
currProb = currFreq / len(dataset)
# If it is 0, then the entropy is 0. If not, use entropy formula
if (currFreq > 0):
total_entropy = total_entropy + (-currProb * math.log(currProb, 2))
else:
return 0
# Return the total entropy
return total_entropy
# This function gets the entropy for a single attribute
def entropy_by_attribute(dataset, feature):
# The attribute is the specific feature of the dataset
attribute = dataset[:,feature]
# The target_variables are the unique values in that feature
target_variables = numpy.unique(dataset[:,-1])
# The unique values in the column we are evaluating
variables = numpy.unique(attribute)
# The entropy for the attribute in question
entropy_attribute = 0
# Loop through each of the possible values
for variable in variables:
denominator = 0
entropy_each_feature = 0
# For every row in the column
for row in attribute:
# If it is equal to the current value we are estimating, increase your denominator
if row == variable:
denominator = denominator + 1
# Now loop through each class
for target_variable in target_variables:
numerator = 0
# Loop through the dataset
for row in dataset:
index = 0
# if the current row in the feature is equal to the value you are evaluating
# and the label is equal to the label you are evaluating, increase the numerator
if dataset[index][feature] == variable and dataset[index][-1] == target_variable:
numerator = numerator + 1
else:
continue
index = index + 1
# use eps to protect from divide by 0
fraction = numerator/(denominator+numpy.finfo(float).eps)
entropy_each_feature = entropy_each_feature + (-fraction * math.log(fraction+numpy.finfo(float).eps, 2))
# Now calculate the total entropy for the attribute in question
big_fraction = denominator / len(dataset)
entropy_attribute = entropy_attribute +(-big_fraction*entropy_each_feature)
# Return that entropy
return entropy_attribute
# This function calculates the information gain
def infogain(dataset, feature):
# Grab the entropy from the total dataset
total_entropy = entropy(dataset)
# Grab the entropy for the current feature being evaluated
feature_entropy = entropy_by_attribute(dataset, feature)
# Calculate the infogain
infogain = float(abs(total_entropy - feature_entropy))
# Return the infogain
return infogain
However, I am unsure of how to do the following:
- For a feature, grab its total entropy
- For a single feature, determine entropy using a binning technique where I am testing two values
I cannot logically conceive of how to develop codes to accomplish 1 and 2, and I am struggling hard. I will continue to update with progress that I do make.
python numpy statistics
python numpy statistics
asked Mar 23 at 23:24
Jerry M.Jerry M.
1,04911029
1,04911029
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The following function processes the entropy calculation, per column (feature)
def entropy(attributes, dataset, targetAttr):
freq =
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319307%2fbin-and-calculate-entropy-using-numpy%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The following function processes the entropy calculation, per column (feature)
def entropy(attributes, dataset, targetAttr):
freq =
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
add a comment |
The following function processes the entropy calculation, per column (feature)
def entropy(attributes, dataset, targetAttr):
freq =
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
add a comment |
The following function processes the entropy calculation, per column (feature)
def entropy(attributes, dataset, targetAttr):
freq =
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
The following function processes the entropy calculation, per column (feature)
def entropy(attributes, dataset, targetAttr):
freq =
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
answered Apr 29 at 20:04
Jerry M.Jerry M.
1,04911029
1,04911029
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319307%2fbin-and-calculate-entropy-using-numpy%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown