How to understand the arguments of “data” and “subset” in randomForest R package?Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function

Are illustrations in novels frowned upon?

Is it safe to remove the bottom chords of a series of garage roof trusses?

How do I find the fastest route from Heathrow to an address in London using all forms of transport?

Is there any practical application for performing a double Fourier transform? ...or an inverse Fourier transform on a time-domain input?

How to avoid using System.String with Rfc2898DeriveBytes in C#

How to dismiss intrusive questions from a colleague with whom I don't work?

How would a situation where rescue is impossible be handled by the crew?

Which household object drew this pattern?

Were there 486SX revisions without an FPU on the die?

How should I face my manager if I make a mistake because a senior coworker explained something incorrectly to me?

Can you feel passing through the sound barrier in an F-16?

How would one country purchase another?

Are required indicators necessary for radio buttons?

Factoring the square of this polynomial?

Is it appropriate for a prospective landlord to ask me for my credit report?

Is there such a thing as too inconvenient?

How to write triplets in 4/4 time without using a 3 on top of the notes all the time

Is there a limit on how long the casting (speaking aloud part of the spell) of Wish can be?

What professions would a medieval village with a population of 100 need?

Can pay be witheld for hours cleaning up after closing time?

Why don't we use Cavea-B

Co-author responds to email by mistake cc'ing the EiC

Why does The Ancient One think differently about Doctor Strange in Endgame than the film Doctor Strange?

Give function defaults arguments from a dictionary in Python



How to understand the arguments of “data” and “subset” in randomForest R package?


Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















Arguments



  • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


  • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


My questions:



  1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


  2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?










share|improve this question






























    0















    Arguments



    • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


    • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


    My questions:



    1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


    2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?










    share|improve this question


























      0












      0








      0








      Arguments



      • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


      • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


      My questions:



      1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


      2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?










      share|improve this question














      Arguments



      • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


      • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


      My questions:



      1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


      2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?







      r random-forest






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 27 at 16:02









      Raymond LuckyRaymond Lucky

      51 silver badge2 bronze badges




      51 silver badge2 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          2














          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer






















          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381652%2fhow-to-understand-the-arguments-of-data-and-subset-in-randomforest-r-package%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer






















          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21
















          2














          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer






















          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21














          2












          2








          2








          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer
















          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 27 at 16:34

























          answered Mar 27 at 16:14









          r2evansr2evans

          32.8k4 gold badges34 silver badges60 bronze badges




          32.8k4 gold badges34 silver badges60 bronze badges










          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21













          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21








          2




          2





          Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

          – joran
          Mar 27 at 16:17





          Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

          – joran
          Mar 27 at 16:17




          1




          1





          @r2evans Thank you! Very clear answers. Now I get it.

          – Raymond Lucky
          Mar 27 at 16:21






          @r2evans Thank you! Very clear answers. Now I get it.

          – Raymond Lucky
          Mar 27 at 16:21









          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381652%2fhow-to-understand-the-arguments-of-data-and-subset-in-randomforest-r-package%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript