How to understand the arguments of “data” and “subset” in randomForest R package?Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function
Are illustrations in novels frowned upon?
Is it safe to remove the bottom chords of a series of garage roof trusses?
How do I find the fastest route from Heathrow to an address in London using all forms of transport?
Is there any practical application for performing a double Fourier transform? ...or an inverse Fourier transform on a time-domain input?
How to avoid using System.String with Rfc2898DeriveBytes in C#
How to dismiss intrusive questions from a colleague with whom I don't work?
How would a situation where rescue is impossible be handled by the crew?
Which household object drew this pattern?
Were there 486SX revisions without an FPU on the die?
How should I face my manager if I make a mistake because a senior coworker explained something incorrectly to me?
Can you feel passing through the sound barrier in an F-16?
How would one country purchase another?
Are required indicators necessary for radio buttons?
Factoring the square of this polynomial?
Is it appropriate for a prospective landlord to ask me for my credit report?
Is there such a thing as too inconvenient?
How to write triplets in 4/4 time without using a 3 on top of the notes all the time
Is there a limit on how long the casting (speaking aloud part of the spell) of Wish can be?
What professions would a medieval village with a population of 100 need?
Can pay be witheld for hours cleaning up after closing time?
Why don't we use Cavea-B
Co-author responds to email by mistake cc'ing the EiC
Why does The Ancient One think differently about Doctor Strange in Endgame than the film Doctor Strange?
Give function defaults arguments from a dictionary in Python
How to understand the arguments of “data” and “subset” in randomForest R package?
Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
Arguments
data
: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called fromsubset
: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is
data
argument "optional"? Ifdata
is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?Why do we need the
subset
parameter? Let's say, we have theiris
data set. If I want to use the first 100 rows as the training data set, I just selecttraining_data <- iris[1:100,]
. Why bother? What's the benefit of usingsubset
?
r random-forest
add a comment |
Arguments
data
: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called fromsubset
: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is
data
argument "optional"? Ifdata
is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?Why do we need the
subset
parameter? Let's say, we have theiris
data set. If I want to use the first 100 rows as the training data set, I just selecttraining_data <- iris[1:100,]
. Why bother? What's the benefit of usingsubset
?
r random-forest
add a comment |
Arguments
data
: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called fromsubset
: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is
data
argument "optional"? Ifdata
is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?Why do we need the
subset
parameter? Let's say, we have theiris
data set. If I want to use the first 100 rows as the training data set, I just selecttraining_data <- iris[1:100,]
. Why bother? What's the benefit of usingsubset
?
r random-forest
Arguments
data
: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called fromsubset
: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is
data
argument "optional"? Ifdata
is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?Why do we need the
subset
parameter? Let's say, we have theiris
data set. If I want to use the first 100 rows as the training data set, I just selecttraining_data <- iris[1:100,]
. Why bother? What's the benefit of usingsubset
?
r random-forest
r random-forest
asked Mar 27 at 16:02
Raymond LuckyRaymond Lucky
51 silver badge2 bronze badges
51 silver badge2 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This is not an uncommon methodology, and certainly not unique to
randomForests
.mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122So when
lm
(in this case) is attempting to resolve the variables referenced in the formulampg~disp
, it looks atdata
if provided, then in the calling environment. Further example:rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122(Notice that
mpg2
is not inmtcars
, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.Similarly, many similar functions (including
lm
) allow thissubset=
argument, so the fact thatrandomForests
includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)The use of
subset
allows slightly simpler referencing (cyl
versusmtcars$cyl
), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such aswith
, so ... mostly personal preference.
Edit: as joran pointed out, randomForest
(and others but notably not lm
) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x
and y
, as in the following examples taken from ?randomForest
(ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)
2
Also might be helpful to emphasize thatrandomForsest
can be called with either a formula, which is where you'd typically use thedata
argument, or by specifying the predictor/response arguments separately with the argumentsx
andy
.
– joran
Mar 27 at 16:17
1
@r2evans Thank you! Very clear answers. Now I get it.
– Raymond Lucky
Mar 27 at 16:21
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381652%2fhow-to-understand-the-arguments-of-data-and-subset-in-randomforest-r-package%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is not an uncommon methodology, and certainly not unique to
randomForests
.mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122So when
lm
(in this case) is attempting to resolve the variables referenced in the formulampg~disp
, it looks atdata
if provided, then in the calling environment. Further example:rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122(Notice that
mpg2
is not inmtcars
, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.Similarly, many similar functions (including
lm
) allow thissubset=
argument, so the fact thatrandomForests
includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)The use of
subset
allows slightly simpler referencing (cyl
versusmtcars$cyl
), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such aswith
, so ... mostly personal preference.
Edit: as joran pointed out, randomForest
(and others but notably not lm
) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x
and y
, as in the following examples taken from ?randomForest
(ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)
2
Also might be helpful to emphasize thatrandomForsest
can be called with either a formula, which is where you'd typically use thedata
argument, or by specifying the predictor/response arguments separately with the argumentsx
andy
.
– joran
Mar 27 at 16:17
1
@r2evans Thank you! Very clear answers. Now I get it.
– Raymond Lucky
Mar 27 at 16:21
add a comment |
This is not an uncommon methodology, and certainly not unique to
randomForests
.mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122So when
lm
(in this case) is attempting to resolve the variables referenced in the formulampg~disp
, it looks atdata
if provided, then in the calling environment. Further example:rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122(Notice that
mpg2
is not inmtcars
, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.Similarly, many similar functions (including
lm
) allow thissubset=
argument, so the fact thatrandomForests
includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)The use of
subset
allows slightly simpler referencing (cyl
versusmtcars$cyl
), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such aswith
, so ... mostly personal preference.
Edit: as joran pointed out, randomForest
(and others but notably not lm
) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x
and y
, as in the following examples taken from ?randomForest
(ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)
2
Also might be helpful to emphasize thatrandomForsest
can be called with either a formula, which is where you'd typically use thedata
argument, or by specifying the predictor/response arguments separately with the argumentsx
andy
.
– joran
Mar 27 at 16:17
1
@r2evans Thank you! Very clear answers. Now I get it.
– Raymond Lucky
Mar 27 at 16:21
add a comment |
This is not an uncommon methodology, and certainly not unique to
randomForests
.mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122So when
lm
(in this case) is attempting to resolve the variables referenced in the formulampg~disp
, it looks atdata
if provided, then in the calling environment. Further example:rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122(Notice that
mpg2
is not inmtcars
, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.Similarly, many similar functions (including
lm
) allow thissubset=
argument, so the fact thatrandomForests
includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)The use of
subset
allows slightly simpler referencing (cyl
versusmtcars$cyl
), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such aswith
, so ... mostly personal preference.
Edit: as joran pointed out, randomForest
(and others but notably not lm
) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x
and y
, as in the following examples taken from ?randomForest
(ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)
This is not an uncommon methodology, and certainly not unique to
randomForests
.mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122So when
lm
(in this case) is attempting to resolve the variables referenced in the formulampg~disp
, it looks atdata
if provided, then in the calling environment. Further example:rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122(Notice that
mpg2
is not inmtcars
, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.Similarly, many similar functions (including
lm
) allow thissubset=
argument, so the fact thatrandomForests
includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)The use of
subset
allows slightly simpler referencing (cyl
versusmtcars$cyl
), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such aswith
, so ... mostly personal preference.
Edit: as joran pointed out, randomForest
(and others but notably not lm
) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x
and y
, as in the following examples taken from ?randomForest
(ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)
edited Mar 27 at 16:34
answered Mar 27 at 16:14
r2evansr2evans
32.8k4 gold badges34 silver badges60 bronze badges
32.8k4 gold badges34 silver badges60 bronze badges
2
Also might be helpful to emphasize thatrandomForsest
can be called with either a formula, which is where you'd typically use thedata
argument, or by specifying the predictor/response arguments separately with the argumentsx
andy
.
– joran
Mar 27 at 16:17
1
@r2evans Thank you! Very clear answers. Now I get it.
– Raymond Lucky
Mar 27 at 16:21
add a comment |
2
Also might be helpful to emphasize thatrandomForsest
can be called with either a formula, which is where you'd typically use thedata
argument, or by specifying the predictor/response arguments separately with the argumentsx
andy
.
– joran
Mar 27 at 16:17
1
@r2evans Thank you! Very clear answers. Now I get it.
– Raymond Lucky
Mar 27 at 16:21
2
2
Also might be helpful to emphasize that
randomForsest
can be called with either a formula, which is where you'd typically use the data
argument, or by specifying the predictor/response arguments separately with the arguments x
and y
.– joran
Mar 27 at 16:17
Also might be helpful to emphasize that
randomForsest
can be called with either a formula, which is where you'd typically use the data
argument, or by specifying the predictor/response arguments separately with the arguments x
and y
.– joran
Mar 27 at 16:17
1
1
@r2evans Thank you! Very clear answers. Now I get it.
– Raymond Lucky
Mar 27 at 16:21
@r2evans Thank you! Very clear answers. Now I get it.
– Raymond Lucky
Mar 27 at 16:21
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381652%2fhow-to-understand-the-arguments-of-data-and-subset-in-randomforest-r-package%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown