What is the problem with the shape of the roc curve with low auc(.4)? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Is it possible to compare the classification ability of two sets of features by ROC?How to fix ROC curve with points below diagonal?Matlab and Support Vector Machines: Why doesn't the implementation of PCA give good prediction results?ROC curve and libsvmPerfcurve positive classsklearn - model keeps overfittingHow to plot a ROC curve of a detector generated by TrainCascadeObjectDetector?Calculating in Matlab confidence intervals and AUC in case of IdentificationSpecificity of ROC curve plotting in reverse directionwhy have high AUC and low accuracy in a balanced dataset for SVM
Dating a Former Employee
Where are Serre’s lectures at Collège de France to be found?
Can a party unilaterally change candidates in preparation for a General election?
Is it common practice to audition new musicians one-on-one before rehearsing with the entire band?
Amount of permutations on an NxNxN Rubik's Cube
How to compare two different files line by line in unix?
Why are the trig functions versine, haversine, exsecant, etc, rarely used in modern mathematics?
Extracting terms with certain heads in a function
Is there any way for the UK Prime Minister to make a motion directly dependent on Government confidence?
What is the meaning of the simile “quick as silk”?
What do you call the main part of a joke?
How to tell that you are a giant?
What's the meaning of "fortified infraction restraint"?
How does the math work when buying airline miles?
Do jazz musicians improvise on the parent scale in addition to the chord-scales?
Is there such thing as an Availability Group failover trigger?
When a candle burns, why does the top of wick glow if bottom of flame is hottest?
Denied boarding although I have proper visa and documentation. To whom should I make a complaint?
Compare a given version number in the form major.minor.build.patch and see if one is less than the other
8 Prisoners wearing hats
Delete nth line from bottom
What are the out-of-universe reasons for the references to Toby Maguire-era Spider-Man in ITSV
Do I really need recursive chmod to restrict access to a folder?
Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?
What is the problem with the shape of the roc curve with low auc(.4)?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!Is it possible to compare the classification ability of two sets of features by ROC?How to fix ROC curve with points below diagonal?Matlab and Support Vector Machines: Why doesn't the implementation of PCA give good prediction results?ROC curve and libsvmPerfcurve positive classsklearn - model keeps overfittingHow to plot a ROC curve of a detector generated by TrainCascadeObjectDetector?Calculating in Matlab confidence intervals and AUC in case of IdentificationSpecificity of ROC curve plotting in reverse directionwhy have high AUC and low accuracy in a balanced dataset for SVM
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I'm trying to plot a ROC curve. I have 75 data points and I considered only 10 features. Ii'm getting a staircase like image see below. Is this due to the small data set? Can we add more points to improve the curve?
AUC is very low .44. Is there any method to upload csv file ?
species1= readtable('target.csv');
species1 = table2cell(species1)
meas1= readtable('feature.csv');
meas1=meas1(:,1:10);
meas1= table2array(meas1)
numObs = length(species1);
half = floor(numObs/2);
training = meas1(1:half,:);
trainingSpecies = species1(1:half);
sample = meas1(half+1:end,:);
trainingSpecies = cell2mat(trainingSpecies)
group = species1(half+1:end,:);
group = cell2mat(group)
SVMModel = fitcsvm(training,trainingSpecies)
[label,score] = predict(SVMModel,sample);
[X,Y,T,AUC] = perfcurve(group,score(:,2),'1');
plot(X,Y,'LineWidth',3)
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification ')
matlab svm
add a comment |
I'm trying to plot a ROC curve. I have 75 data points and I considered only 10 features. Ii'm getting a staircase like image see below. Is this due to the small data set? Can we add more points to improve the curve?
AUC is very low .44. Is there any method to upload csv file ?
species1= readtable('target.csv');
species1 = table2cell(species1)
meas1= readtable('feature.csv');
meas1=meas1(:,1:10);
meas1= table2array(meas1)
numObs = length(species1);
half = floor(numObs/2);
training = meas1(1:half,:);
trainingSpecies = species1(1:half);
sample = meas1(half+1:end,:);
trainingSpecies = cell2mat(trainingSpecies)
group = species1(half+1:end,:);
group = cell2mat(group)
SVMModel = fitcsvm(training,trainingSpecies)
[label,score] = predict(SVMModel,sample);
[X,Y,T,AUC] = perfcurve(group,score(:,2),'1');
plot(X,Y,'LineWidth',3)
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification ')
matlab svm
Perfcurve creates a threshold for every single point, when that happens, it will always be a stepwise plot.
– Durkee
Mar 22 at 12:54
There's nothing to solve here. This is just how it works. You could run a smoothing function I guess but that would degrade the quality.
– Durkee
Mar 22 at 13:27
which smoothing function
– leena s
Mar 22 at 13:44
add a comment |
I'm trying to plot a ROC curve. I have 75 data points and I considered only 10 features. Ii'm getting a staircase like image see below. Is this due to the small data set? Can we add more points to improve the curve?
AUC is very low .44. Is there any method to upload csv file ?
species1= readtable('target.csv');
species1 = table2cell(species1)
meas1= readtable('feature.csv');
meas1=meas1(:,1:10);
meas1= table2array(meas1)
numObs = length(species1);
half = floor(numObs/2);
training = meas1(1:half,:);
trainingSpecies = species1(1:half);
sample = meas1(half+1:end,:);
trainingSpecies = cell2mat(trainingSpecies)
group = species1(half+1:end,:);
group = cell2mat(group)
SVMModel = fitcsvm(training,trainingSpecies)
[label,score] = predict(SVMModel,sample);
[X,Y,T,AUC] = perfcurve(group,score(:,2),'1');
plot(X,Y,'LineWidth',3)
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification ')
matlab svm
I'm trying to plot a ROC curve. I have 75 data points and I considered only 10 features. Ii'm getting a staircase like image see below. Is this due to the small data set? Can we add more points to improve the curve?
AUC is very low .44. Is there any method to upload csv file ?
species1= readtable('target.csv');
species1 = table2cell(species1)
meas1= readtable('feature.csv');
meas1=meas1(:,1:10);
meas1= table2array(meas1)
numObs = length(species1);
half = floor(numObs/2);
training = meas1(1:half,:);
trainingSpecies = species1(1:half);
sample = meas1(half+1:end,:);
trainingSpecies = cell2mat(trainingSpecies)
group = species1(half+1:end,:);
group = cell2mat(group)
SVMModel = fitcsvm(training,trainingSpecies)
[label,score] = predict(SVMModel,sample);
[X,Y,T,AUC] = perfcurve(group,score(:,2),'1');
plot(X,Y,'LineWidth',3)
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification ')
matlab svm
matlab svm
edited Mar 23 at 4:37
leena s
asked Mar 22 at 9:38
leena sleena s
15
15
Perfcurve creates a threshold for every single point, when that happens, it will always be a stepwise plot.
– Durkee
Mar 22 at 12:54
There's nothing to solve here. This is just how it works. You could run a smoothing function I guess but that would degrade the quality.
– Durkee
Mar 22 at 13:27
which smoothing function
– leena s
Mar 22 at 13:44
add a comment |
Perfcurve creates a threshold for every single point, when that happens, it will always be a stepwise plot.
– Durkee
Mar 22 at 12:54
There's nothing to solve here. This is just how it works. You could run a smoothing function I guess but that would degrade the quality.
– Durkee
Mar 22 at 13:27
which smoothing function
– leena s
Mar 22 at 13:44
Perfcurve creates a threshold for every single point, when that happens, it will always be a stepwise plot.
– Durkee
Mar 22 at 12:54
Perfcurve creates a threshold for every single point, when that happens, it will always be a stepwise plot.
– Durkee
Mar 22 at 12:54
There's nothing to solve here. This is just how it works. You could run a smoothing function I guess but that would degrade the quality.
– Durkee
Mar 22 at 13:27
There's nothing to solve here. This is just how it works. You could run a smoothing function I guess but that would degrade the quality.
– Durkee
Mar 22 at 13:27
which smoothing function
– leena s
Mar 22 at 13:44
which smoothing function
– leena s
Mar 22 at 13:44
add a comment |
1 Answer
1
active
oldest
votes
As indicated by Durkee, the perfcurve
function will always be stepwise. In fact, the ROC curve is an empirical (as opposed to theoretical) cumulative distribution function (ecdf), and ecdf are stepwise functions by definition (as it computes the CDF on the values observed in the sample).
Usually, smoothing of the ROC curve is done via binning. You could bin the score values and compute an approximate ROC curve, or you could bin the False Positive Rate values obtained by the actual ROC curve (i.e. bin the X
values generated by perfcurve()
) which generates a smooth version that preserves the area under the curve (AUC
).
In the following example I will show and compare the smoothed ROC curves obtained from these two options, which can be accomplished using the TVals
option and the XVals
option of the perfcurve
function, respectively.
In each case, the binning is done so that we get approximately equal-sized (equal in the number of cases) bins using the tiedrank
function. The values to use for the TVals
and the XVals
options are then computed using the grpstats
function as the max
value on each bin of the original/pre-binned variable (scores
or X
, respectively).
%% Reference for the original ROC curve example: https://www.mathworks.com/help/stats/perfcurve.html
load fisheriris
pred = meas(51:end,1:2);
resp = (1:100)'>50; % Versicolor = 0, virginica = 1
mdl = fitglm(pred,resp,'Distribution','binomial','Link','logit');
scores = mdl.Fitted.Probability;
[X,Y,T,AUC] = perfcurve(species(51:end,:),scores,'virginica');
AUC
%% Define the number of bins to use for smoothing
nbins = 10;
%% Option 1 (RED): Smooth the ROC curve by defining score thresholds (based on equal-size bins of the score).
scores_grp = ceil(nbins * tiedrank(scores(:,1)) / length(scores));
scores_thr = grpstats(scores, scores_grp, @max);
[X_grpScore,Y_grpScore,T_grpScore,AUC_grpScore] = perfcurve(species(51:end,:),scores,'virginica','TVals',scores_thr);
AUC_grpScore
%% Option 2 (GREEN) Smooth the ROC curve by binning the False Positive Rate (variable X of the perfcurve() output)
X_grp = ceil(nbins * tiedrank(X(:,1)) / length(X));
X_thr = grpstats(X, X_grp, @max);
[X_grpFPR,Y_grpFPR,T_grpFPR,AUC_grpFPR] = perfcurve(species(51:end,:),scores,'virginica','XVals',X_thr);
AUC_grpFPR
%% Plot
figure
plot(X,Y,'b.-'); hold on
plot(X_grpScore,Y_grpScore,'rx-')
plot(X_grpFPR,Y_grpFPR,'g.-')
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')
legend('Original ROC curve', ...
sprintf('Smoothed ROC curve in %d bins (based on score bins)', nbins), ...
sprintf('Smoothed ROC curve in %d bins (based on FPR bins)', nbins), ...
'Location', 'SouthEast')
The graphical output from this code is the following:
Note: if you look at the text output generated by the above code, you will notice that, as anticipated, the AUC
values for the original ROC and the smoothed ROC curve based on FPR bins (GREEN option) coincide (AUC = 0.7918
), whereas the AUC
value for the smoothed ROC curve based on score bins (RED option) is quite smaller than the original AUC (= 0.6342
), so the FPR approach should be preferred as smoothing technique for plotting purposes. Note however that the FPR approach requires computing the ROC curve twice, once on the original scores
variable, and once on the binned FPR values (X
values of the first ROC calculation).
However, the second ROC calculation can be avoided because the same smoothed ROC curve can be obtained by binning the X
values and computing the max(Y)
value on each bin, as shown in the following snippet:
%% Compute max(Y) on the binned X values
% Make a dataset with the X and Y variables as columns (for easier manipulation and grouping)
ds = dataset(X,Y);
% Compute equal size bins on X and the corresponding MAX statistics
ds.X_grp = ceil(nbins * tiedrank(ds.X(:,1)) / size(ds.X,1));
ds_grp = grpstats(ds, 'X_grp', @max, 'DataVars', 'X', 'Y');
% Add the smooth curve to the previous plot
hold on
plot(ds_grp.max_X, ds_grp.max_Y, 'mx-')
And now you should see the above plot where the green curve has been overridden by a magenta curve with star points.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55296702%2fwhat-is-the-problem-with-the-shape-of-the-roc-curve-with-low-auc-4%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As indicated by Durkee, the perfcurve
function will always be stepwise. In fact, the ROC curve is an empirical (as opposed to theoretical) cumulative distribution function (ecdf), and ecdf are stepwise functions by definition (as it computes the CDF on the values observed in the sample).
Usually, smoothing of the ROC curve is done via binning. You could bin the score values and compute an approximate ROC curve, or you could bin the False Positive Rate values obtained by the actual ROC curve (i.e. bin the X
values generated by perfcurve()
) which generates a smooth version that preserves the area under the curve (AUC
).
In the following example I will show and compare the smoothed ROC curves obtained from these two options, which can be accomplished using the TVals
option and the XVals
option of the perfcurve
function, respectively.
In each case, the binning is done so that we get approximately equal-sized (equal in the number of cases) bins using the tiedrank
function. The values to use for the TVals
and the XVals
options are then computed using the grpstats
function as the max
value on each bin of the original/pre-binned variable (scores
or X
, respectively).
%% Reference for the original ROC curve example: https://www.mathworks.com/help/stats/perfcurve.html
load fisheriris
pred = meas(51:end,1:2);
resp = (1:100)'>50; % Versicolor = 0, virginica = 1
mdl = fitglm(pred,resp,'Distribution','binomial','Link','logit');
scores = mdl.Fitted.Probability;
[X,Y,T,AUC] = perfcurve(species(51:end,:),scores,'virginica');
AUC
%% Define the number of bins to use for smoothing
nbins = 10;
%% Option 1 (RED): Smooth the ROC curve by defining score thresholds (based on equal-size bins of the score).
scores_grp = ceil(nbins * tiedrank(scores(:,1)) / length(scores));
scores_thr = grpstats(scores, scores_grp, @max);
[X_grpScore,Y_grpScore,T_grpScore,AUC_grpScore] = perfcurve(species(51:end,:),scores,'virginica','TVals',scores_thr);
AUC_grpScore
%% Option 2 (GREEN) Smooth the ROC curve by binning the False Positive Rate (variable X of the perfcurve() output)
X_grp = ceil(nbins * tiedrank(X(:,1)) / length(X));
X_thr = grpstats(X, X_grp, @max);
[X_grpFPR,Y_grpFPR,T_grpFPR,AUC_grpFPR] = perfcurve(species(51:end,:),scores,'virginica','XVals',X_thr);
AUC_grpFPR
%% Plot
figure
plot(X,Y,'b.-'); hold on
plot(X_grpScore,Y_grpScore,'rx-')
plot(X_grpFPR,Y_grpFPR,'g.-')
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')
legend('Original ROC curve', ...
sprintf('Smoothed ROC curve in %d bins (based on score bins)', nbins), ...
sprintf('Smoothed ROC curve in %d bins (based on FPR bins)', nbins), ...
'Location', 'SouthEast')
The graphical output from this code is the following:
Note: if you look at the text output generated by the above code, you will notice that, as anticipated, the AUC
values for the original ROC and the smoothed ROC curve based on FPR bins (GREEN option) coincide (AUC = 0.7918
), whereas the AUC
value for the smoothed ROC curve based on score bins (RED option) is quite smaller than the original AUC (= 0.6342
), so the FPR approach should be preferred as smoothing technique for plotting purposes. Note however that the FPR approach requires computing the ROC curve twice, once on the original scores
variable, and once on the binned FPR values (X
values of the first ROC calculation).
However, the second ROC calculation can be avoided because the same smoothed ROC curve can be obtained by binning the X
values and computing the max(Y)
value on each bin, as shown in the following snippet:
%% Compute max(Y) on the binned X values
% Make a dataset with the X and Y variables as columns (for easier manipulation and grouping)
ds = dataset(X,Y);
% Compute equal size bins on X and the corresponding MAX statistics
ds.X_grp = ceil(nbins * tiedrank(ds.X(:,1)) / size(ds.X,1));
ds_grp = grpstats(ds, 'X_grp', @max, 'DataVars', 'X', 'Y');
% Add the smooth curve to the previous plot
hold on
plot(ds_grp.max_X, ds_grp.max_Y, 'mx-')
And now you should see the above plot where the green curve has been overridden by a magenta curve with star points.
add a comment |
As indicated by Durkee, the perfcurve
function will always be stepwise. In fact, the ROC curve is an empirical (as opposed to theoretical) cumulative distribution function (ecdf), and ecdf are stepwise functions by definition (as it computes the CDF on the values observed in the sample).
Usually, smoothing of the ROC curve is done via binning. You could bin the score values and compute an approximate ROC curve, or you could bin the False Positive Rate values obtained by the actual ROC curve (i.e. bin the X
values generated by perfcurve()
) which generates a smooth version that preserves the area under the curve (AUC
).
In the following example I will show and compare the smoothed ROC curves obtained from these two options, which can be accomplished using the TVals
option and the XVals
option of the perfcurve
function, respectively.
In each case, the binning is done so that we get approximately equal-sized (equal in the number of cases) bins using the tiedrank
function. The values to use for the TVals
and the XVals
options are then computed using the grpstats
function as the max
value on each bin of the original/pre-binned variable (scores
or X
, respectively).
%% Reference for the original ROC curve example: https://www.mathworks.com/help/stats/perfcurve.html
load fisheriris
pred = meas(51:end,1:2);
resp = (1:100)'>50; % Versicolor = 0, virginica = 1
mdl = fitglm(pred,resp,'Distribution','binomial','Link','logit');
scores = mdl.Fitted.Probability;
[X,Y,T,AUC] = perfcurve(species(51:end,:),scores,'virginica');
AUC
%% Define the number of bins to use for smoothing
nbins = 10;
%% Option 1 (RED): Smooth the ROC curve by defining score thresholds (based on equal-size bins of the score).
scores_grp = ceil(nbins * tiedrank(scores(:,1)) / length(scores));
scores_thr = grpstats(scores, scores_grp, @max);
[X_grpScore,Y_grpScore,T_grpScore,AUC_grpScore] = perfcurve(species(51:end,:),scores,'virginica','TVals',scores_thr);
AUC_grpScore
%% Option 2 (GREEN) Smooth the ROC curve by binning the False Positive Rate (variable X of the perfcurve() output)
X_grp = ceil(nbins * tiedrank(X(:,1)) / length(X));
X_thr = grpstats(X, X_grp, @max);
[X_grpFPR,Y_grpFPR,T_grpFPR,AUC_grpFPR] = perfcurve(species(51:end,:),scores,'virginica','XVals',X_thr);
AUC_grpFPR
%% Plot
figure
plot(X,Y,'b.-'); hold on
plot(X_grpScore,Y_grpScore,'rx-')
plot(X_grpFPR,Y_grpFPR,'g.-')
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')
legend('Original ROC curve', ...
sprintf('Smoothed ROC curve in %d bins (based on score bins)', nbins), ...
sprintf('Smoothed ROC curve in %d bins (based on FPR bins)', nbins), ...
'Location', 'SouthEast')
The graphical output from this code is the following:
Note: if you look at the text output generated by the above code, you will notice that, as anticipated, the AUC
values for the original ROC and the smoothed ROC curve based on FPR bins (GREEN option) coincide (AUC = 0.7918
), whereas the AUC
value for the smoothed ROC curve based on score bins (RED option) is quite smaller than the original AUC (= 0.6342
), so the FPR approach should be preferred as smoothing technique for plotting purposes. Note however that the FPR approach requires computing the ROC curve twice, once on the original scores
variable, and once on the binned FPR values (X
values of the first ROC calculation).
However, the second ROC calculation can be avoided because the same smoothed ROC curve can be obtained by binning the X
values and computing the max(Y)
value on each bin, as shown in the following snippet:
%% Compute max(Y) on the binned X values
% Make a dataset with the X and Y variables as columns (for easier manipulation and grouping)
ds = dataset(X,Y);
% Compute equal size bins on X and the corresponding MAX statistics
ds.X_grp = ceil(nbins * tiedrank(ds.X(:,1)) / size(ds.X,1));
ds_grp = grpstats(ds, 'X_grp', @max, 'DataVars', 'X', 'Y');
% Add the smooth curve to the previous plot
hold on
plot(ds_grp.max_X, ds_grp.max_Y, 'mx-')
And now you should see the above plot where the green curve has been overridden by a magenta curve with star points.
add a comment |
As indicated by Durkee, the perfcurve
function will always be stepwise. In fact, the ROC curve is an empirical (as opposed to theoretical) cumulative distribution function (ecdf), and ecdf are stepwise functions by definition (as it computes the CDF on the values observed in the sample).
Usually, smoothing of the ROC curve is done via binning. You could bin the score values and compute an approximate ROC curve, or you could bin the False Positive Rate values obtained by the actual ROC curve (i.e. bin the X
values generated by perfcurve()
) which generates a smooth version that preserves the area under the curve (AUC
).
In the following example I will show and compare the smoothed ROC curves obtained from these two options, which can be accomplished using the TVals
option and the XVals
option of the perfcurve
function, respectively.
In each case, the binning is done so that we get approximately equal-sized (equal in the number of cases) bins using the tiedrank
function. The values to use for the TVals
and the XVals
options are then computed using the grpstats
function as the max
value on each bin of the original/pre-binned variable (scores
or X
, respectively).
%% Reference for the original ROC curve example: https://www.mathworks.com/help/stats/perfcurve.html
load fisheriris
pred = meas(51:end,1:2);
resp = (1:100)'>50; % Versicolor = 0, virginica = 1
mdl = fitglm(pred,resp,'Distribution','binomial','Link','logit');
scores = mdl.Fitted.Probability;
[X,Y,T,AUC] = perfcurve(species(51:end,:),scores,'virginica');
AUC
%% Define the number of bins to use for smoothing
nbins = 10;
%% Option 1 (RED): Smooth the ROC curve by defining score thresholds (based on equal-size bins of the score).
scores_grp = ceil(nbins * tiedrank(scores(:,1)) / length(scores));
scores_thr = grpstats(scores, scores_grp, @max);
[X_grpScore,Y_grpScore,T_grpScore,AUC_grpScore] = perfcurve(species(51:end,:),scores,'virginica','TVals',scores_thr);
AUC_grpScore
%% Option 2 (GREEN) Smooth the ROC curve by binning the False Positive Rate (variable X of the perfcurve() output)
X_grp = ceil(nbins * tiedrank(X(:,1)) / length(X));
X_thr = grpstats(X, X_grp, @max);
[X_grpFPR,Y_grpFPR,T_grpFPR,AUC_grpFPR] = perfcurve(species(51:end,:),scores,'virginica','XVals',X_thr);
AUC_grpFPR
%% Plot
figure
plot(X,Y,'b.-'); hold on
plot(X_grpScore,Y_grpScore,'rx-')
plot(X_grpFPR,Y_grpFPR,'g.-')
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')
legend('Original ROC curve', ...
sprintf('Smoothed ROC curve in %d bins (based on score bins)', nbins), ...
sprintf('Smoothed ROC curve in %d bins (based on FPR bins)', nbins), ...
'Location', 'SouthEast')
The graphical output from this code is the following:
Note: if you look at the text output generated by the above code, you will notice that, as anticipated, the AUC
values for the original ROC and the smoothed ROC curve based on FPR bins (GREEN option) coincide (AUC = 0.7918
), whereas the AUC
value for the smoothed ROC curve based on score bins (RED option) is quite smaller than the original AUC (= 0.6342
), so the FPR approach should be preferred as smoothing technique for plotting purposes. Note however that the FPR approach requires computing the ROC curve twice, once on the original scores
variable, and once on the binned FPR values (X
values of the first ROC calculation).
However, the second ROC calculation can be avoided because the same smoothed ROC curve can be obtained by binning the X
values and computing the max(Y)
value on each bin, as shown in the following snippet:
%% Compute max(Y) on the binned X values
% Make a dataset with the X and Y variables as columns (for easier manipulation and grouping)
ds = dataset(X,Y);
% Compute equal size bins on X and the corresponding MAX statistics
ds.X_grp = ceil(nbins * tiedrank(ds.X(:,1)) / size(ds.X,1));
ds_grp = grpstats(ds, 'X_grp', @max, 'DataVars', 'X', 'Y');
% Add the smooth curve to the previous plot
hold on
plot(ds_grp.max_X, ds_grp.max_Y, 'mx-')
And now you should see the above plot where the green curve has been overridden by a magenta curve with star points.
As indicated by Durkee, the perfcurve
function will always be stepwise. In fact, the ROC curve is an empirical (as opposed to theoretical) cumulative distribution function (ecdf), and ecdf are stepwise functions by definition (as it computes the CDF on the values observed in the sample).
Usually, smoothing of the ROC curve is done via binning. You could bin the score values and compute an approximate ROC curve, or you could bin the False Positive Rate values obtained by the actual ROC curve (i.e. bin the X
values generated by perfcurve()
) which generates a smooth version that preserves the area under the curve (AUC
).
In the following example I will show and compare the smoothed ROC curves obtained from these two options, which can be accomplished using the TVals
option and the XVals
option of the perfcurve
function, respectively.
In each case, the binning is done so that we get approximately equal-sized (equal in the number of cases) bins using the tiedrank
function. The values to use for the TVals
and the XVals
options are then computed using the grpstats
function as the max
value on each bin of the original/pre-binned variable (scores
or X
, respectively).
%% Reference for the original ROC curve example: https://www.mathworks.com/help/stats/perfcurve.html
load fisheriris
pred = meas(51:end,1:2);
resp = (1:100)'>50; % Versicolor = 0, virginica = 1
mdl = fitglm(pred,resp,'Distribution','binomial','Link','logit');
scores = mdl.Fitted.Probability;
[X,Y,T,AUC] = perfcurve(species(51:end,:),scores,'virginica');
AUC
%% Define the number of bins to use for smoothing
nbins = 10;
%% Option 1 (RED): Smooth the ROC curve by defining score thresholds (based on equal-size bins of the score).
scores_grp = ceil(nbins * tiedrank(scores(:,1)) / length(scores));
scores_thr = grpstats(scores, scores_grp, @max);
[X_grpScore,Y_grpScore,T_grpScore,AUC_grpScore] = perfcurve(species(51:end,:),scores,'virginica','TVals',scores_thr);
AUC_grpScore
%% Option 2 (GREEN) Smooth the ROC curve by binning the False Positive Rate (variable X of the perfcurve() output)
X_grp = ceil(nbins * tiedrank(X(:,1)) / length(X));
X_thr = grpstats(X, X_grp, @max);
[X_grpFPR,Y_grpFPR,T_grpFPR,AUC_grpFPR] = perfcurve(species(51:end,:),scores,'virginica','XVals',X_thr);
AUC_grpFPR
%% Plot
figure
plot(X,Y,'b.-'); hold on
plot(X_grpScore,Y_grpScore,'rx-')
plot(X_grpFPR,Y_grpFPR,'g.-')
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')
legend('Original ROC curve', ...
sprintf('Smoothed ROC curve in %d bins (based on score bins)', nbins), ...
sprintf('Smoothed ROC curve in %d bins (based on FPR bins)', nbins), ...
'Location', 'SouthEast')
The graphical output from this code is the following:
Note: if you look at the text output generated by the above code, you will notice that, as anticipated, the AUC
values for the original ROC and the smoothed ROC curve based on FPR bins (GREEN option) coincide (AUC = 0.7918
), whereas the AUC
value for the smoothed ROC curve based on score bins (RED option) is quite smaller than the original AUC (= 0.6342
), so the FPR approach should be preferred as smoothing technique for plotting purposes. Note however that the FPR approach requires computing the ROC curve twice, once on the original scores
variable, and once on the binned FPR values (X
values of the first ROC calculation).
However, the second ROC calculation can be avoided because the same smoothed ROC curve can be obtained by binning the X
values and computing the max(Y)
value on each bin, as shown in the following snippet:
%% Compute max(Y) on the binned X values
% Make a dataset with the X and Y variables as columns (for easier manipulation and grouping)
ds = dataset(X,Y);
% Compute equal size bins on X and the corresponding MAX statistics
ds.X_grp = ceil(nbins * tiedrank(ds.X(:,1)) / size(ds.X,1));
ds_grp = grpstats(ds, 'X_grp', @max, 'DataVars', 'X', 'Y');
% Add the smooth curve to the previous plot
hold on
plot(ds_grp.max_X, ds_grp.max_Y, 'mx-')
And now you should see the above plot where the green curve has been overridden by a magenta curve with star points.
answered Mar 25 at 3:19
mastropimastropi
1136
1136
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55296702%2fwhat-is-the-problem-with-the-shape-of-the-roc-curve-with-low-auc-4%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Perfcurve creates a threshold for every single point, when that happens, it will always be a stepwise plot.
– Durkee
Mar 22 at 12:54
There's nothing to solve here. This is just how it works. You could run a smoothing function I guess but that would degrade the quality.
– Durkee
Mar 22 at 13:27
which smoothing function
– leena s
Mar 22 at 13:44