Quadratic interaction term dropping out in Stata but not in RInterpreting a quadratic logarithmic termStata ANOVA Interaction term differencesWhat to do when both the linear term and quadratic term are significant independently from each other, but not together?Interpreting logistic regression with an interaction and a quadratic term?Interpreting interaction with quadratic term

Why was ambassador Sondland involved in Ukraine?

Surmounting set-theoretical difficulties in algebraic geometry

TSP with revenue maximization

Why does Greedo say "Maclunkey" in the Mos Eisley Cantina?

What is the name of the fallacy involving white and black swans?

Did Ohio pass a law granting students the right to give scientifically wrong answers consistent with their religious beliefs?

Insets around a clock

Why would a family misspell their surname?

Locked folder with obscure app from Sourceforge, now cannot unlock folder

What latex template to use when you do not know the journal you are going to submit

How to break a equation with a single "summation symbol (sum) " common?

Several verbs in subjuntivo mood in one sentence

Beginner Tactics - Why Isn't This Mate?

I think one of my players wants to leave the campaign

Why doesn't remove_reference work on functions?

Days in indexed month

Ran out of space of newly installed HDD?

Does "solicit" mean the solicitor must receive what is being solicited in context of 52 U.S. Code Section 30121?

Why are rain clouds darker?

Can it be viewed as a negative for PhD applications in the US if I have children?

What was the deal with the news stories about rats in "Joker"?

Looking for a reference in Greek

Are we sinners because we sin or do we sin because we are sinners?

What other tricks were there to get more data onto floppy disks?



Quadratic interaction term dropping out in Stata but not in R


Interpreting a quadratic logarithmic termStata ANOVA Interaction term differencesWhat to do when both the linear term and quadratic term are significant independently from each other, but not together?Interpreting logistic regression with an interaction and a quadratic term?Interpreting interaction with quadratic term






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









1















$begingroup$


With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.



For example:



library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
mutate(year2 = year * year)

lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277


The results above from R are what I expect.



However, in Stata, the sexfemale:year2 interaction term gets dropped.



Using RStata:



library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------


If you reverse inc and year (i.e., if you regress year on inc and an inc*inc variable, both interacted with sex), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.



What am I missing?



Here's Stata code to reproduce the dataset into Stata directly:



* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify









share|cite|improve this question









$endgroup$





migrated from stackoverflow.com Mar 30 at 1:40


This question came from our site for professional and enthusiast programmers.














  • 3




    $begingroup$
    Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:28






  • 3




    $begingroup$
    To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:33






  • 4




    $begingroup$
    My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:13






  • 4




    $begingroup$
    Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:14






  • 2




    $begingroup$
    The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
    $endgroup$
    – whuber
    Mar 30 at 14:43


















1















$begingroup$


With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.



For example:



library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
mutate(year2 = year * year)

lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277


The results above from R are what I expect.



However, in Stata, the sexfemale:year2 interaction term gets dropped.



Using RStata:



library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------


If you reverse inc and year (i.e., if you regress year on inc and an inc*inc variable, both interacted with sex), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.



What am I missing?



Here's Stata code to reproduce the dataset into Stata directly:



* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify









share|cite|improve this question









$endgroup$





migrated from stackoverflow.com Mar 30 at 1:40


This question came from our site for professional and enthusiast programmers.














  • 3




    $begingroup$
    Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:28






  • 3




    $begingroup$
    To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:33






  • 4




    $begingroup$
    My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:13






  • 4




    $begingroup$
    Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:14






  • 2




    $begingroup$
    The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
    $endgroup$
    – whuber
    Mar 30 at 14:43














1













1









1





$begingroup$


With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.



For example:



library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
mutate(year2 = year * year)

lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277


The results above from R are what I expect.



However, in Stata, the sexfemale:year2 interaction term gets dropped.



Using RStata:



library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------


If you reverse inc and year (i.e., if you regress year on inc and an inc*inc variable, both interacted with sex), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.



What am I missing?



Here's Stata code to reproduce the dataset into Stata directly:



* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify









share|cite|improve this question









$endgroup$




With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.



For example:



library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
mutate(year2 = year * year)

lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277


The results above from R are what I expect.



However, in Stata, the sexfemale:year2 interaction term gets dropped.



Using RStata:



library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------


If you reverse inc and year (i.e., if you regress year on inc and an inc*inc variable, both interacted with sex), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.



What am I missing?



Here's Stata code to reproduce the dataset into Stata directly:



* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify






r regression stata






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Mar 28 at 22:10









lostlost

1135 bronze badges




1135 bronze badges





migrated from stackoverflow.com Mar 30 at 1:40


This question came from our site for professional and enthusiast programmers.











migrated from stackoverflow.com Mar 30 at 1:40


This question came from our site for professional and enthusiast programmers.









migrated from stackoverflow.com Mar 30 at 1:40


This question came from our site for professional and enthusiast programmers.









  • 3




    $begingroup$
    Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:28






  • 3




    $begingroup$
    To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:33






  • 4




    $begingroup$
    My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:13






  • 4




    $begingroup$
    Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:14






  • 2




    $begingroup$
    The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
    $endgroup$
    – whuber
    Mar 30 at 14:43













  • 3




    $begingroup$
    Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:28






  • 3




    $begingroup$
    To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
    $endgroup$
    – Nick Cox
    Mar 29 at 1:33






  • 4




    $begingroup$
    My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:13






  • 4




    $begingroup$
    Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
    $endgroup$
    – Nick Cox
    Mar 29 at 2:14






  • 2




    $begingroup$
    The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
    $endgroup$
    – whuber
    Mar 30 at 14:43








3




3




$begingroup$
Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28




$begingroup$
Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28




3




3




$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33




$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33




4




4




$begingroup$
My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13




$begingroup$
My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13




4




4




$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14




$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14




2




2




$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
$endgroup$
– whuber
Mar 30 at 14:43





$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
$endgroup$
– whuber
Mar 30 at 14:43











0






active

oldest

votes













Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f400223%2fquadratic-interaction-term-dropping-out-in-stata-but-not-in-r%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown


























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f400223%2fquadratic-interaction-term-dropping-out-in-stata-but-not-in-r%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown









Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript