How to understand the arguments of “data” and “subset” in randomForest R package?Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function

Are illustrations in novels frowned upon?

Is it safe to remove the bottom chords of a series of garage roof trusses?

How do I find the fastest route from Heathrow to an address in London using all forms of transport?

Is there any practical application for performing a double Fourier transform? ...or an inverse Fourier transform on a time-domain input?

How to avoid using System.String with Rfc2898DeriveBytes in C#

How to dismiss intrusive questions from a colleague with whom I don't work?

How would a situation where rescue is impossible be handled by the crew?

Which household object drew this pattern?

Were there 486SX revisions without an FPU on the die?

How should I face my manager if I make a mistake because a senior coworker explained something incorrectly to me?

Can you feel passing through the sound barrier in an F-16?

How would one country purchase another?

Are required indicators necessary for radio buttons?

Factoring the square of this polynomial?

Is it appropriate for a prospective landlord to ask me for my credit report?

Is there such a thing as too inconvenient?

How to write triplets in 4/4 time without using a 3 on top of the notes all the time

Is there a limit on how long the casting (speaking aloud part of the spell) of Wish can be?

What professions would a medieval village with a population of 100 need?

Can pay be witheld for hours cleaning up after closing time?

Why don't we use Cavea-B

Co-author responds to email by mistake cc'ing the EiC

Why does The Ancient One think differently about Doctor Strange in Endgame than the film Doctor Strange?

Give function defaults arguments from a dictionary in Python

How to understand the arguments of “data” and “subset” in randomForest R package?

Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

Arguments

data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from

subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)

My questions:

Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?

Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

add a comment |

Arguments

data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from

subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)

My questions:

Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?

Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

add a comment |

Arguments

data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from

subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)

My questions:

Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?

Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

Arguments

data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from

subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)

My questions:

Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?

Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?

r random-forest

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

asked Mar 27 at 16:02

Raymond Lucky

51 silver badge2 bronze badges

add a comment |

1 Answer
1

active

oldest

votes

This is not an uncommon methodology, and certainly not unique to randomForests.
```
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
```
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.

Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
```
lm(mpg~disp, data=mtcars, subset= cyl==4)

lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
```
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.

Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):

iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

edited Mar 27 at 16:34

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

2

Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

– joran
Mar 27 at 16:17

1

@r2evans Thank you! Very clear answers. Now I get it.

– Raymond Lucky
Mar 27 at 16:21

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381652%2fhow-to-understand-the-arguments-of-data-and-subset-in-randomforest-r-package%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

This is not an uncommon methodology, and certainly not unique to randomForests.
```
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
```
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.

Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
```
lm(mpg~disp, data=mtcars, subset= cyl==4)

lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
```
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.

iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

edited Mar 27 at 16:34

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

2

Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

– joran
Mar 27 at 16:17

1

@r2evans Thank you! Very clear answers. Now I get it.

– Raymond Lucky
Mar 27 at 16:21

add a comment |

This is not an uncommon methodology, and certainly not unique to randomForests.
```
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
```
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.

Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
```
lm(mpg~disp, data=mtcars, subset= cyl==4)

lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
```
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.

iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

edited Mar 27 at 16:34

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

2

Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

– joran
Mar 27 at 16:17

1

@r2evans Thank you! Very clear answers. Now I get it.

– Raymond Lucky
Mar 27 at 16:21

add a comment |

This is not an uncommon methodology, and certainly not unique to randomForests.
```
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
```
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.

Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
```
lm(mpg~disp, data=mtcars, subset= cyl==4)

lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
```
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.

iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

edited Mar 27 at 16:34

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

This is not an uncommon methodology, and certainly not unique to randomForests.
```
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
```
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp 
# 29.59985 -0.04122 
```
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.

Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
```
lm(mpg~disp, data=mtcars, subset= cyl==4)

lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
```
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.

iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

edited Mar 27 at 16:34

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

edited Mar 27 at 16:34

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

answered Mar 27 at 16:14

r2evans

32.8k4 gold badges34 silver badges60 bronze badges

2

Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

– joran
Mar 27 at 16:17

1

@r2evans Thank you! Very clear answers. Now I get it.

– Raymond Lucky
Mar 27 at 16:21

add a comment |

2

Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

– joran
Mar 27 at 16:17

1

@r2evans Thank you! Very clear answers. Now I get it.

– Raymond Lucky
Mar 27 at 16:21

Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

– joran
Mar 27 at 16:17

@r2evans Thank you! Very clear answers. Now I get it.

– Raymond Lucky
Mar 27 at 16:21

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1