Quadratic interaction term dropping out in Stata but not in RInterpreting a quadratic logarithmic termStata ANOVA Interaction term differencesWhat to do when both the linear term and quadratic term are significant independently from each other, but not together?Interpreting logistic regression with an interaction and a quadratic term?Interpreting interaction with quadratic term

Why was ambassador Sondland involved in Ukraine?

Surmounting set-theoretical difficulties in algebraic geometry

TSP with revenue maximization

Why does Greedo say "Maclunkey" in the Mos Eisley Cantina?

What is the name of the fallacy involving white and black swans?

Did Ohio pass a law granting students the right to give scientifically wrong answers consistent with their religious beliefs?

Insets around a clock

Why would a family misspell their surname?

Locked folder with obscure app from Sourceforge, now cannot unlock folder

What latex template to use when you do not know the journal you are going to submit

How to break a equation with a single "summation symbol (sum) " common?

Several verbs in subjuntivo mood in one sentence

Beginner Tactics - Why Isn't This Mate?

I think one of my players wants to leave the campaign

Why doesn't remove_reference work on functions?

Days in indexed month

Ran out of space of newly installed HDD?

Does "solicit" mean the solicitor must receive what is being solicited in context of 52 U.S. Code Section 30121?

Why are rain clouds darker?

Can it be viewed as a negative for PhD applications in the US if I have children?

What was the deal with the news stories about rats in "Joker"?

Looking for a reference in Greek

Are we sinners because we sin or do we sin because we are sinners?

What other tricks were there to get more data onto floppy disks?

Quadratic interaction term dropping out in Stata but not in R

Interpreting a quadratic logarithmic termStata ANOVA Interaction term differencesWhat to do when both the linear term and quadratic term are significant independently from each other, but not together?Interpreting logistic regression with an interaction and a quadratic term?Interpreting interaction with quadratic term

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.

For example:

library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625, 
 21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875, 
 21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779, 
 28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75, 
 25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75, 
 49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000), 
 sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 
 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"), 
 year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006, 
 2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994, 
 1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012, 
 1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990, 
 2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996, 
 1986, 1987)), 
 row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
 mutate(year2 = year * year)

lm(data = foo,
 inc ~ sex*year + sex*year2) %>%
 tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277

The results above from R are what I expect.

However, in Stata, the sexfemale:year2 interaction term gets dropped.

Using RStata:

library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
 data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#> 
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#> 
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------

If you reverse inc and year (i.e., if you regress year on inc and an inc*inc variable, both interacted with sex), you get six terms as expected, and results from Stata and R agree. If you tell R to omit the term that Stata is dropping, they'll agree then too. I cannot seem to get Stata to give me all six terms with the original model.

What am I missing?

Here's Stata code to reproduce the dataset into Stata directly:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
 25545 2 1993 3972049
 24879 2 1978 3912484
 8843 2 1993 3972049
 31668 1 1985 3940225
 20335.25 1 2002 4008004
 32625 1 1996 3984016
 21060 1 2016 4064256
 15633.75 2 2006 4024036
 31267.5 2 2006 4024036
 10505 2 1994 3976036
 5167 2 1998 3992004
 33926 1 1978 3912484
 22050 2 2012 4048144
 44863 1 1985 3940225
11731.875 1 2002 4008004
 21659 2 1986 3944196
 31927.5 2 2014 4056196
 16235 1 1994 3976036
 56647 1 1980 3920400
 7751 2 1998 3992004
 4343 1 1988 3952144
 29402 1 1978 3912484
 40756.5 1 2004 4016016
 12779 1 1983 3932289
 28030 1 1982 3928324
 29452.5 1 2008 4032064
 28650 1 1994 3976036
 15925 1 2012 4048144
 92692 1 1996 3984016
 3789 2 1998 3992004
 44887 1 2000 4000000
 6201 2 1998 3992004
 38373.75 1 2006 4024036
 25511 2 1989 3956121
 45837 1 1976 3904576
 27500 2 1987 3948169
 27171 1 2004 4016016
 7984 2 1990 3960100
 15925 2 2012 4048144
 18947 2 1998 3992004
 12870 2 2016 4064256
 42234.75 1 2002 4008004
 49813.5 2 2004 4016016
 28490 1 2010 4040100
 38373.75 1 2006 4024036
10040.625 2 2008 4032064
 30150 1 1989 3956121
 15406 2 1996 3984016
 14015 2 1986 3944196
 45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify

asked Mar 28 at 22:10

lost

1135 bronze badges

migrated from stackoverflow.com Mar 30 at 1:40

This question came from our site for professional and enthusiast programmers.

3

$begingroup$
Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28

3

$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33

4

$begingroup$
My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13

4

$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14

2

$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
$endgroup$
– whuber♦
Mar 30 at 14:43

|
show 10 more comments

With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.

For example:

library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625, 
 21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875, 
 21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779, 
 28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75, 
 25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75, 
 49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000), 
 sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 
 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"), 
 year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006, 
 2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994, 
 1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012, 
 1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990, 
 2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996, 
 1986, 1987)), 
 row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
 mutate(year2 = year * year)

lm(data = foo,
 inc ~ sex*year + sex*year2) %>%
 tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277

The results above from R are what I expect.

However, in Stata, the sexfemale:year2 interaction term gets dropped.

Using RStata:

library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
 data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#> 
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#> 
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------

What am I missing?

Here's Stata code to reproduce the dataset into Stata directly:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
 25545 2 1993 3972049
 24879 2 1978 3912484
 8843 2 1993 3972049
 31668 1 1985 3940225
 20335.25 1 2002 4008004
 32625 1 1996 3984016
 21060 1 2016 4064256
 15633.75 2 2006 4024036
 31267.5 2 2006 4024036
 10505 2 1994 3976036
 5167 2 1998 3992004
 33926 1 1978 3912484
 22050 2 2012 4048144
 44863 1 1985 3940225
11731.875 1 2002 4008004
 21659 2 1986 3944196
 31927.5 2 2014 4056196
 16235 1 1994 3976036
 56647 1 1980 3920400
 7751 2 1998 3992004
 4343 1 1988 3952144
 29402 1 1978 3912484
 40756.5 1 2004 4016016
 12779 1 1983 3932289
 28030 1 1982 3928324
 29452.5 1 2008 4032064
 28650 1 1994 3976036
 15925 1 2012 4048144
 92692 1 1996 3984016
 3789 2 1998 3992004
 44887 1 2000 4000000
 6201 2 1998 3992004
 38373.75 1 2006 4024036
 25511 2 1989 3956121
 45837 1 1976 3904576
 27500 2 1987 3948169
 27171 1 2004 4016016
 7984 2 1990 3960100
 15925 2 2012 4048144
 18947 2 1998 3992004
 12870 2 2016 4064256
 42234.75 1 2002 4008004
 49813.5 2 2004 4016016
 28490 1 2010 4040100
 38373.75 1 2006 4024036
10040.625 2 2008 4032064
 30150 1 1989 3956121
 15406 2 1996 3984016
 14015 2 1986 3944196
 45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify

asked Mar 28 at 22:10

lost

1135 bronze badges

migrated from stackoverflow.com Mar 30 at 1:40

This question came from our site for professional and enthusiast programmers.

3

$begingroup$
Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28

3

$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33

4

$begingroup$
My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13

4

$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14

2

$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
$endgroup$
– whuber♦
Mar 30 at 14:43

|
show 10 more comments

With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.

For example:

library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625, 
 21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875, 
 21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779, 
 28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75, 
 25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75, 
 49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000), 
 sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 
 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"), 
 year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006, 
 2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994, 
 1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012, 
 1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990, 
 2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996, 
 1986, 1987)), 
 row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
 mutate(year2 = year * year)

lm(data = foo,
 inc ~ sex*year + sex*year2) %>%
 tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277

The results above from R are what I expect.

However, in Stata, the sexfemale:year2 interaction term gets dropped.

Using RStata:

library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
 data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#> 
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#> 
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------

What am I missing?

Here's Stata code to reproduce the dataset into Stata directly:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
 25545 2 1993 3972049
 24879 2 1978 3912484
 8843 2 1993 3972049
 31668 1 1985 3940225
 20335.25 1 2002 4008004
 32625 1 1996 3984016
 21060 1 2016 4064256
 15633.75 2 2006 4024036
 31267.5 2 2006 4024036
 10505 2 1994 3976036
 5167 2 1998 3992004
 33926 1 1978 3912484
 22050 2 2012 4048144
 44863 1 1985 3940225
11731.875 1 2002 4008004
 21659 2 1986 3944196
 31927.5 2 2014 4056196
 16235 1 1994 3976036
 56647 1 1980 3920400
 7751 2 1998 3992004
 4343 1 1988 3952144
 29402 1 1978 3912484
 40756.5 1 2004 4016016
 12779 1 1983 3932289
 28030 1 1982 3928324
 29452.5 1 2008 4032064
 28650 1 1994 3976036
 15925 1 2012 4048144
 92692 1 1996 3984016
 3789 2 1998 3992004
 44887 1 2000 4000000
 6201 2 1998 3992004
 38373.75 1 2006 4024036
 25511 2 1989 3956121
 45837 1 1976 3904576
 27500 2 1987 3948169
 27171 1 2004 4016016
 7984 2 1990 3960100
 15925 2 2012 4048144
 18947 2 1998 3992004
 12870 2 2016 4064256
 42234.75 1 2002 4008004
 49813.5 2 2004 4016016
 28490 1 2010 4040100
 38373.75 1 2006 4024036
10040.625 2 2008 4032064
 30150 1 1989 3956121
 15406 2 1996 3984016
 14015 2 1986 3944196
 45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify

asked Mar 28 at 22:10

lost

1135 bronze badges

With the same data and the same model specification, Stata is dropping one of the terms from my model while R seems to run it just fine.

For example:

library(tidyverse)
library(magrittr)
library(broom)

foo <- structure(list(inc = c(25545, 24879, 8843, 31668, 20335.25, 32625, 
 21060, 15633.75, 31267.5, 10505, 5167, 33926, 22050, 44863, 11731.875, 
 21659, 31927.5, 16235, 56647, 7751, 4343, 29402, 40756.5, 12779, 
 28030, 29452.5, 28650, 15925, 92692, 3789, 44887, 6201, 38373.75, 
 25511, 45837, 27500, 27171, 7984, 15925, 18947, 12870, 42234.75, 
 49813.5, 28490, 38373.75, 10040.625, 30150, 15406, 14015, 45000), 
 sex = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 
 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("male", "female"), class = "factor"), 
 year = c(1993, 1978, 1993, 1985, 2002, 1996, 2016, 2006, 
 2006, 1994, 1998, 1978, 2012, 1985, 2002, 1986, 2014, 1994, 
 1980, 1998, 1988, 1978, 2004, 1983, 1982, 2008, 1994, 2012, 
 1996, 1998, 2000, 1998, 2006, 1989, 1976, 1987, 2004, 1990, 
 2012, 1998, 2016, 2002, 2004, 2010, 2006, 2008, 1989, 1996, 
 1986, 1987)), 
 row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

foo %<>%
 mutate(year2 = year * year)

lm(data = foo,
 inc ~ sex*year + sex*year2) %>%
 tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -63645382. 97555833. -0.652 0.518
#> 2 sexfemale 164988796. 150579599. 1.10 0.279
#> 3 year 64070. 97815. 0.655 0.516
#> 4 year2 -16.1 24.5 -0.657 0.514
#> 5 sexfemale:year -165546. 150810. -1.10 0.278
#> 6 sexfemale:year2 41.5 37.8 1.10 0.277

The results above from R are what I expect.

However, in Stata, the sexfemale:year2 interaction term gets dropped.

Using RStata:

library(RStata)
options("RStata.StataPath" = "") # you path here
options("RStata.StataVersion" = ) # your version here

stata("reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2",
 data.in = foo)
#> . reg inc i.sex#c.year i.sex#c.year2 i.sex c.year c.year2
#> note: 2.sex#c.year2 omitted because of collinearity
#> 
#> Source | SS df MS Number of obs = 50
#> -------------+------------------------------ F( 4, 45) = 3.36
#> Model | 3.0037e+09 4 750914046 Prob > F = 0.0173
#> Residual | 1.0058e+10 45 223510687 R-squared = 0.2300
#> -------------+------------------------------ Adj R-squared = 0.1615
#> Total | 1.3062e+10 49 266564022 Root MSE = 14950
#> 
#> ------------------------------------------------------------------------------
#> inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> sex#c.year |
#> female | 287.9784 427.8164 0.67 0.504 -573.6881 1149.645
#> |
#> sex#c.year2 |
#> female | 0 (omitted)
#> |
#> sex |
#> female | -589746.3 854533.4 -0.69 0.494 -2310865 1131372
#> year | -5771.122 74562.38 -0.08 0.939 -155947.5 144405.2
#> year2 | 1.391068 18.68956 0.07 0.941 -36.25164 39.03378
#> _cons | 6009728 7.44e+07 0.08 0.936 -1.44e+08 1.56e+08
#> ------------------------------------------------------------------------------

What am I missing?

Here's Stata code to reproduce the dataset into Stata directly:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double inc long sex double(year year2)
 25545 2 1993 3972049
 24879 2 1978 3912484
 8843 2 1993 3972049
 31668 1 1985 3940225
 20335.25 1 2002 4008004
 32625 1 1996 3984016
 21060 1 2016 4064256
 15633.75 2 2006 4024036
 31267.5 2 2006 4024036
 10505 2 1994 3976036
 5167 2 1998 3992004
 33926 1 1978 3912484
 22050 2 2012 4048144
 44863 1 1985 3940225
11731.875 1 2002 4008004
 21659 2 1986 3944196
 31927.5 2 2014 4056196
 16235 1 1994 3976036
 56647 1 1980 3920400
 7751 2 1998 3992004
 4343 1 1988 3952144
 29402 1 1978 3912484
 40756.5 1 2004 4016016
 12779 1 1983 3932289
 28030 1 1982 3928324
 29452.5 1 2008 4032064
 28650 1 1994 3976036
 15925 1 2012 4048144
 92692 1 1996 3984016
 3789 2 1998 3992004
 44887 1 2000 4000000
 6201 2 1998 3992004
 38373.75 1 2006 4024036
 25511 2 1989 3956121
 45837 1 1976 3904576
 27500 2 1987 3948169
 27171 1 2004 4016016
 7984 2 1990 3960100
 15925 2 2012 4048144
 18947 2 1998 3992004
 12870 2 2016 4064256
 42234.75 1 2002 4008004
 49813.5 2 2004 4016016
 28490 1 2010 4040100
 38373.75 1 2006 4024036
10040.625 2 2008 4032064
 30150 1 1989 3956121
 15406 2 1996 3984016
 14015 2 1986 3944196
 45000 1 1987 3948169
end
label values sex sex
label def sex 1 "male", modify
label def sex 2 "female", modify

r regression stata

asked Mar 28 at 22:10

lost

1135 bronze badges

asked Mar 28 at 22:10

lost

1135 bronze badges

asked Mar 28 at 22:10

lost

1135 bronze badges

asked Mar 28 at 22:10

lost

1135 bronze badges

asked Mar 28 at 22:10

lost

1135 bronze badges

migrated from stackoverflow.com Mar 30 at 1:40

This question came from our site for professional and enthusiast programmers.

migrated from stackoverflow.com Mar 30 at 1:40

This question came from our site for professional and enthusiast programmers.

migrated from stackoverflow.com Mar 30 at 1:40

This question came from our site for professional and enthusiast programmers.

3

$begingroup$
Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28

3

$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33

4

$begingroup$
My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13

4

$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14

2

$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
$endgroup$
– whuber♦
Mar 30 at 14:43

|
show 10 more comments

3

$begingroup$
Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.
$endgroup$
– Nick Cox
Mar 29 at 1:28

3

$begingroup$
To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.
$endgroup$
– Nick Cox
Mar 29 at 1:33

4

$begingroup$
My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.
$endgroup$
– Nick Cox
Mar 29 at 2:13

4

$begingroup$
Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.
$endgroup$
– Nick Cox
Mar 29 at 2:14

2

$begingroup$
The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).
$endgroup$
– whuber♦
Mar 30 at 14:43

Stata thinks that the correlation between year and year2 is about 0.999996121207278543. I am happy to regard that as collinear for practical purposes. Your point about documentation is a serious one: perhaps you can tell us where for R in general or lm() in particular it is documented what it does or does not allow in these cases and why.

– Nick Cox
Mar 29 at 1:28

To me the main issue here is statistical. A parameterisation in terms of say (year - 2000) and its square is much easier to think about and at the same time cuts out problems with precision and instability.

– Nick Cox
Mar 29 at 1:33

My insight into this particular issue is just one of experience in using regress for 28 years over versions of Stata since 2.05. First, almost always when a predictor is thrown out it is for the same reasons as in R, perfect collinearity. Second, I've seen this kind of thing only occasionally and so suspect that Stata is only a little more strict than R. Stata isn't written for idiots but for users who want to be smart even if they are learning. So, most of the time the user is expected to look at output of coefficients and standard errors and work out what is a bad idea.

– Nick Cox
Mar 29 at 2:13

Avoiding calendar years as predictors is something I have warned about in various places, e.g. Cross Validated and Statalist.

– Nick Cox
Mar 29 at 2:14

The high p-values provide a clue. Indeed, the eigenvalues of the covariance matrix of (year, year2, inc) range over nearly 13 orders of magnitude. This totally wipes out the precision inherent in the least squares calculations (which typically is 12 to 13 orders of magnitude). By including year*year in these models, you are just confusing the calculations with noise. If you really think it must be included, then improve your chances at seeing the signal with lm(inc ~ sex * poly(year, 2), foo).

– whuber♦
Mar 30 at 14:43

|
show 10 more comments

0

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f400223%2fquadratic-interaction-term-dropping-out-in-stata-but-not-in-r%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

XBmMg6SpvQUVoQbfXvcA6LbQ7 CSq2pgwUDZ2FNxlK0wm0tldVgzNzR w,vaLS38q znha

搜尋此網誌

Styjun

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

migrated from stackoverflow.com Mar 30 at 1:40

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari