Quadratic interaction term dropping out in Stata but not in RInterpreting a quadratic logarithmic termStata ANOVA Interaction term differencesWhat to do when both the linear term and quadratic term are significant independently from each other, but not together?Interpreting logistic regression with an interaction and a quadratic term?Interpreting interaction with quadratic term
Why was ambassador Sondland involved in Ukraine?
Surmounting set-theoretical difficulties in algebraic geometry
TSP with revenue maximization
Why does Greedo say "Maclunkey" in the Mos Eisley Cantina?
What is the name of the fallacy involving white and black swans?
Did Ohio pass a law granting students the right to give scientifically wrong answers consistent with their religious beliefs?
Insets around a clock
Why would a family misspell their surname?
Locked folder with obscure app from Sourceforge, now cannot unlock folder
What latex template to use when you do not know the journal you are going to submit
How to break a equation with a single "summation symbol (sum) " common?
Several verbs in subjuntivo mood in one sentence
Beginner Tactics - Why Isn't This Mate?
I think one of my players wants to leave the campaign
Why doesn't remove_reference work on functions?
Days in indexed month
Ran out of space of newly installed HDD?
Does "solicit" mean the solicitor must receive what is being solicited in context of 52 U.S. Code Section 30121?
Why are rain clouds darker?
Can it be viewed as a negative for PhD applications in the US if I have children?
What was the deal with the news stories about rats in "Joker"?
Looking for a reference in Greek
Are we sinners because we sin or do we sin because we are sinners?
What other tricks were there to get more data onto floppy disks?
Quadratic interaction term dropping out in Stata but not in R
Interpreting a quadratic logarithmic termStata ANOVA Interaction term differencesWhat to do when both the linear term and quadratic term are significant independently from each other, but not together?Interpreting logistic regression with an interaction and a quadratic term?Interpreting interaction with quadratic term
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
$begingroup$
With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.
For example:
library(tidyverse)
library(magrittr)
library(broom)
foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))
foo %<>%
mutate(year2 = year * year)
lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277
The results above from R are what I expect.
However, in Stata, the sexfemale:year2
interaction term gets dropped.
Using RStata
:
library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here
stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------
If you reverse inc
and year
(i.e., if you regress year
on inc
and an inc*inc
variable, both interacted with sex
), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.
What am I missing?
Here's Stata code to reproduce the dataset into Stata directly:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify
r regression stata
$endgroup$
migrated from stackoverflow.com Mar 30 at 1:40
This question came from our site for professional and enthusiast programmers.
|
show 10 more comments
$begingroup$
With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.
For example:
library(tidyverse)
library(magrittr)
library(broom)
foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))
foo %<>%
mutate(year2 = year * year)
lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277
The results above from R are what I expect.
However, in Stata, the sexfemale:year2
interaction term gets dropped.
Using RStata
:
library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here
stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------
If you reverse inc
and year
(i.e., if you regress year
on inc
and an inc*inc
variable, both interacted with sex
), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.
What am I missing?
Here's Stata code to reproduce the dataset into Stata directly:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify
r regression stata
$endgroup$
migrated from stackoverflow.com Mar 30 at 1:40
This question came from our site for professional and enthusiast programmers.
3
$begingroup$
Stata thinks that the correlation betweenyear
andyear2
is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general orlm()
in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28
3
$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33
4
$begingroup$
My insight into this particular issue is just one of experience in usingregress
for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13
4
$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14
2
$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By includingyear*year
in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal withlm(inc ~ sex * poly(year, 2), foo)
.
$endgroup$
– whuber♦
Mar 30 at 14:43
|
show 10 more comments
$begingroup$
With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.
For example:
library(tidyverse)
library(magrittr)
library(broom)
foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))
foo %<>%
mutate(year2 = year * year)
lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277
The results above from R are what I expect.
However, in Stata, the sexfemale:year2
interaction term gets dropped.
Using RStata
:
library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here
stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------
If you reverse inc
and year
(i.e., if you regress year
on inc
and an inc*inc
variable, both interacted with sex
), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.
What am I missing?
Here's Stata code to reproduce the dataset into Stata directly:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify
r regression stata
$endgroup$
With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.
For example:
library(tidyverse)
library(magrittr)
library(broom)
foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625,
21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875,
21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779,
28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75,
25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75,
49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000),
sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"),
year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006,
2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994,
1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012,
1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990,
2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996,
1986, 1987)),
row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))
foo %<>%
mutate(year2 = year * year)
lm(data = foo,
inc ~ sex*year + sex*year2) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277
The results above from R are what I expect.
However, in Stata, the sexfemale:year2
interaction term gets dropped.
Using RStata
:
library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here
stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#>
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#>
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------
If you reverse inc
and year
(i.e., if you regress year
on inc
and an inc*inc
variable, both interacted with sex
), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.
What am I missing?
Here's Stata code to reproduce the dataset into Stata directly:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
25545 2 1993 3972049
24879 2 1978 3912484
8843 2 1993 3972049
31668 1 1985 3940225
20335.25 1 2002 4008004
32625 1 1996 3984016
21060 1 2016 4064256
15633.75 2 2006 4024036
31267.5 2 2006 4024036
10505 2 1994 3976036
5167 2 1998 3992004
33926 1 1978 3912484
22050 2 2012 4048144
44863 1 1985 3940225
11731.875 1 2002 4008004
21659 2 1986 3944196
31927.5 2 2014 4056196
16235 1 1994 3976036
56647 1 1980 3920400
7751 2 1998 3992004
4343 1 1988 3952144
29402 1 1978 3912484
40756.5 1 2004 4016016
12779 1 1983 3932289
28030 1 1982 3928324
29452.5 1 2008 4032064
28650 1 1994 3976036
15925 1 2012 4048144
92692 1 1996 3984016
3789 2 1998 3992004
44887 1 2000 4000000
6201 2 1998 3992004
38373.75 1 2006 4024036
25511 2 1989 3956121
45837 1 1976 3904576
27500 2 1987 3948169
27171 1 2004 4016016
7984 2 1990 3960100
15925 2 2012 4048144
18947 2 1998 3992004
12870 2 2016 4064256
42234.75 1 2002 4008004
49813.5 2 2004 4016016
28490 1 2010 4040100
38373.75 1 2006 4024036
10040.625 2 2008 4032064
30150 1 1989 3956121
15406 2 1996 3984016
14015 2 1986 3944196
45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify
r regression stata
r regression stata
asked Mar 28 at 22:10
lostlost
1135 bronze badges
1135 bronze badges
migrated from stackoverflow.com Mar 30 at 1:40
This question came from our site for professional and enthusiast programmers.
migrated from stackoverflow.com Mar 30 at 1:40
This question came from our site for professional and enthusiast programmers.
migrated from stackoverflow.com Mar 30 at 1:40
This question came from our site for professional and enthusiast programmers.
3
$begingroup$
Stata thinks that the correlation betweenyear
andyear2
is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general orlm()
in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28
3
$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33
4
$begingroup$
My insight into this particular issue is just one of experience in usingregress
for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13
4
$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14
2
$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By includingyear*year
in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal withlm(inc ~ sex * poly(year, 2), foo)
.
$endgroup$
– whuber♦
Mar 30 at 14:43
|
show 10 more comments
3
$begingroup$
Stata thinks that the correlation betweenyear
andyear2
is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general orlm()
in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28
3
$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33
4
$begingroup$
My insight into this particular issue is just one of experience in usingregress
for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13
4
$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14
2
$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By includingyear*year
in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal withlm(inc ~ sex * poly(year, 2), foo)
.
$endgroup$
– whuber♦
Mar 30 at 14:43
3
3
$begingroup$
Stata thinks that the correlation between
year
and year2
is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm()
in particular it is documented what it does or does not allow in these cases and why.$endgroup$
– Nick Cox
Mar 29 at 1:28
$begingroup$
Stata thinks that the correlation between
year
and year2
is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm()
in particular it is documented what it does or does not allow in these cases and why.$endgroup$
– Nick Cox
Mar 29 at 1:28
3
3
$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33
$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33
4
4
$begingroup$
My insight into this particular issue is just one of experience in using
regress
for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.$endgroup$
– Nick Cox
Mar 29 at 2:13
$begingroup$
My insight into this particular issue is just one of experience in using
regress
for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.$endgroup$
– Nick Cox
Mar 29 at 2:13
4
4
$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14
$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14
2
2
$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including
year*year
in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo)
.$endgroup$
– whuber♦
Mar 30 at 14:43
$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including
year*year
in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo)
.$endgroup$
– whuber♦
Mar 30 at 14:43
|
show 10 more comments
0
active
oldest
votes
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f400223%2fquadratic-interaction-term-dropping-out-in-stata-but-not-in-r%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f400223%2fquadratic-interaction-term-dropping-out-in-stata-but-not-in-r%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
$begingroup$
Stata thinks that the correlation between
year
andyear2
is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general orlm()
in particular it is documented what it does or does not allow in these cases and why.$endgroup$
– Nick Cox
Mar 29 at 1:28
3
$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33
4
$begingroup$
My insight into this particular issue is just one of experience in using
regress
for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.$endgroup$
– Nick Cox
Mar 29 at 2:13
4
$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14
2
$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including
year*year
in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal withlm(inc ~ sex * poly(year, 2), foo)
.$endgroup$
– whuber♦
Mar 30 at 14:43