How to understand the arguments of “data” and “subset” in randomForest R package?Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function

Are illustrations in novels frowned upon?

Is it safe to remove the bottom chords of a series of garage roof trusses?

How do I find the fastest route from Heathrow to an address in London using all forms of transport?

Is there any practical application for performing a double Fourier transform? ...or an inverse Fourier transform on a time-domain input?

How to avoid using System.String with Rfc2898DeriveBytes in C#

How to dismiss intrusive questions from a colleague with whom I don't work?

How would a situation where rescue is impossible be handled by the crew?

Which household object drew this pattern?

Were there 486SX revisions without an FPU on the die?

How should I face my manager if I make a mistake because a senior coworker explained something incorrectly to me?

Can you feel passing through the sound barrier in an F-16?

How would one country purchase another?

Are required indicators necessary for radio buttons?

Factoring the square of this polynomial?

Is it appropriate for a prospective landlord to ask me for my credit report?

Is there such a thing as too inconvenient?

How to write triplets in 4/4 time without using a 3 on top of the notes all the time

Is there a limit on how long the casting (speaking aloud part of the spell) of Wish can be?

What professions would a medieval village with a population of 100 need?

Can pay be witheld for hours cleaning up after closing time?

Why don't we use Cavea-B

Co-author responds to email by mistake cc'ing the EiC

Why does The Ancient One think differently about Doctor Strange in Endgame than the film Doctor Strange?

Give function defaults arguments from a dictionary in Python



How to understand the arguments of “data” and “subset” in randomForest R package?


Drop factor levels in a subsetted data frameHow to join (merge) data frames (inner, outer, left, right)Filter data.frame rows by a logical conditionConvert data.frame columns from factors to charactersGetting a subset of an R data frame using values in the order vectorSubset of rows containing NA (missing) values in a chosen column of a data frameR randomForest subsetting can't get rid of factor levelsSubset data to contain only columns whose names match a conditionsubset multiple times a data frameCan't give a subset when using randomForest inside a function






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















Arguments



  • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


  • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


My questions:



  1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


  2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?










share|improve this question






























    0















    Arguments



    • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


    • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


    My questions:



    1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


    2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?










    share|improve this question


























      0












      0








      0








      Arguments



      • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


      • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


      My questions:



      1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


      2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?










      share|improve this question














      Arguments



      • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from


      • subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)


      My questions:



      1. Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?


      2. Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?







      r random-forest






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 27 at 16:02









      Raymond LuckyRaymond Lucky

      51 silver badge2 bronze badges




      51 silver badge2 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          2














          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer






















          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381652%2fhow-to-understand-the-arguments-of-data-and-subset-in-randomforest-r-package%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer






















          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21
















          2














          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer






















          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21














          2












          2








          2








          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)





          share|improve this answer
















          1. This is not an uncommon methodology, and certainly not unique to randomForests.



            mpg <- mtcars$mpg
            disp <- mtcars$disp
            lm(mpg~disp)
            # Call:
            # lm(formula = mpg ~ disp)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:



            rm(mpg,disp)
            mpg2 <- mtcars$mpg
            lm(mpg2~disp)
            # Error in eval(predvars, data, env) : object 'disp' not found
            lm(mpg2~disp, data=mtcars)
            # Call:
            # lm(formula = mpg2 ~ disp, data = mtcars)
            # Coefficients:
            # (Intercept) disp
            # 29.59985 -0.04122


            (Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.




          2. Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:



            lm(mpg~disp, data=mtcars, subset= cyl==4)

            lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])

            mt <- mtcars[ mtcars$cyl == 4, ]
            lm(mpg~disp, data=mt)


            The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.



          Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):



          iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
          iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 27 at 16:34

























          answered Mar 27 at 16:14









          r2evansr2evans

          32.8k4 gold badges34 silver badges60 bronze badges




          32.8k4 gold badges34 silver badges60 bronze badges










          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21













          • 2





            Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

            – joran
            Mar 27 at 16:17






          • 1





            @r2evans Thank you! Very clear answers. Now I get it.

            – Raymond Lucky
            Mar 27 at 16:21








          2




          2





          Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

          – joran
          Mar 27 at 16:17





          Also might be helpful to emphasize that randomForsest can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y.

          – joran
          Mar 27 at 16:17




          1




          1





          @r2evans Thank you! Very clear answers. Now I get it.

          – Raymond Lucky
          Mar 27 at 16:21






          @r2evans Thank you! Very clear answers. Now I get it.

          – Raymond Lucky
          Mar 27 at 16:21









          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381652%2fhow-to-understand-the-arguments-of-data-and-subset-in-randomforest-r-package%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

          은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현