Why do attempts to filter/subset a raked survey design object fail?Problems with subsetting survey designfiltering large data sets to exclude an identical element across all columnsWhy is `[` better than `subset`?Analysis of complex survey design with multiple plausible valuesR Looping through in survey packageSubsetting Survey Design Objects Dynamically in RImputation for complex survey design (nhanesIII)mean for subsets using surveyProblems with subsetting survey designUsing `survey` package for rakingR survey - twophase function and warnings with multistage designs
Numbers Decrease while Letters Increase
Can a Rogue PC teach an NPC to perform Sneak Attack?
Algorithms vs LP or MIP
Non-visual Computers - thoughts?
Prevent use of CNAME Record for Untrusted Domain
What is the difference between "Grippe" and "Männergrippe"?
French abbreviation for comparing two items ("vs")
Add newline to prompt if it's too long
Is gzip atomic?
How do the Etherealness and Banishment spells interact?
Architectural feasibility of a tiered circular stone keep
Which book is the Murderer's Gloves magic item from?
How do we calculate energy of food?
Network helper class with retry logic on failure
Papers on arXiv solving the same problem at the same time
pgfplots: Missing one group of bars
Is MOSFET active device?
Read file lines into shell line separated by space
Very slow boot time and poor perfomance
Did anyone try to find the little box that held Professor Moriarty and his wife after the crash?
Sql server sleeping state is increasing using ADO.NET?
How many String objects would be created when concatenating multiple Strings?
Can RMSE and MAE have the same value?
Why did Khan ask Admiral James T. Kirk about Project Genesis?
Why do attempts to filter/subset a raked survey design object fail?
Problems with subsetting survey designfiltering large data sets to exclude an identical element across all columnsWhy is `[` better than `subset`?Analysis of complex survey design with multiple plausible valuesR Looping through in survey packageSubsetting Survey Design Objects Dynamically in RImputation for complex survey design (nhanesIII)mean for subsets using surveyProblems with subsetting survey designUsing `survey` package for rakingR survey - twophase function and warnings with multistage designs
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I'm trying to filter rows in a survey design object to exclude a particular subset of data. In the example below, which consists of survey data from several schools, I'm trying to exclude data from schools in Alameda County, California.
Surprisingly, when the survey design object includes weights created by raking, attempts to filter or subset the data fail. I think this is a bug, but I'm not sure. Why does the presence of raked weights alter the result of attempting to filter or subset the data?
library(survey)
data(api)
# Declare basic clustered design ----
cluster_design <- svydesign(data = apiclus1,
id = ~dnum,
weights = ~pw,
fpc = ~fpc)
# Add raking weights for school type ----
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))
raked_design <- rake(cluster_design,
sample.margins = list(~stype,~sch.wide),
population.margins = list(pop.types, pop.schwide))
# Filter the two different design objects ----
subset_from_raked_design <- subset(raked_design, cname != "Alameda")
subset_from_cluster_design <- subset(cluster_design, cname != "Alameda")
# Count number of rows in the subsets
# Note that they surprisingly differ
nrow(subset_from_raked_design)
#> [1] 183
nrow(subset_from_cluster_design)
#> [1] 172
This issue occurs no matter how you attempt to subset the data. For example, here's what happens when you try to use row-indexing to subset only the first 10 rows:
nrow(cluster_design[1:10,])
#> 10
nrow(raked_design[1:10,])
#> 183
r survey
add a comment |
I'm trying to filter rows in a survey design object to exclude a particular subset of data. In the example below, which consists of survey data from several schools, I'm trying to exclude data from schools in Alameda County, California.
Surprisingly, when the survey design object includes weights created by raking, attempts to filter or subset the data fail. I think this is a bug, but I'm not sure. Why does the presence of raked weights alter the result of attempting to filter or subset the data?
library(survey)
data(api)
# Declare basic clustered design ----
cluster_design <- svydesign(data = apiclus1,
id = ~dnum,
weights = ~pw,
fpc = ~fpc)
# Add raking weights for school type ----
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))
raked_design <- rake(cluster_design,
sample.margins = list(~stype,~sch.wide),
population.margins = list(pop.types, pop.schwide))
# Filter the two different design objects ----
subset_from_raked_design <- subset(raked_design, cname != "Alameda")
subset_from_cluster_design <- subset(cluster_design, cname != "Alameda")
# Count number of rows in the subsets
# Note that they surprisingly differ
nrow(subset_from_raked_design)
#> [1] 183
nrow(subset_from_cluster_design)
#> [1] 172
This issue occurs no matter how you attempt to subset the data. For example, here's what happens when you try to use row-indexing to subset only the first 10 rows:
nrow(cluster_design[1:10,])
#> 10
nrow(raked_design[1:10,])
#> 183
r survey
add a comment |
I'm trying to filter rows in a survey design object to exclude a particular subset of data. In the example below, which consists of survey data from several schools, I'm trying to exclude data from schools in Alameda County, California.
Surprisingly, when the survey design object includes weights created by raking, attempts to filter or subset the data fail. I think this is a bug, but I'm not sure. Why does the presence of raked weights alter the result of attempting to filter or subset the data?
library(survey)
data(api)
# Declare basic clustered design ----
cluster_design <- svydesign(data = apiclus1,
id = ~dnum,
weights = ~pw,
fpc = ~fpc)
# Add raking weights for school type ----
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))
raked_design <- rake(cluster_design,
sample.margins = list(~stype,~sch.wide),
population.margins = list(pop.types, pop.schwide))
# Filter the two different design objects ----
subset_from_raked_design <- subset(raked_design, cname != "Alameda")
subset_from_cluster_design <- subset(cluster_design, cname != "Alameda")
# Count number of rows in the subsets
# Note that they surprisingly differ
nrow(subset_from_raked_design)
#> [1] 183
nrow(subset_from_cluster_design)
#> [1] 172
This issue occurs no matter how you attempt to subset the data. For example, here's what happens when you try to use row-indexing to subset only the first 10 rows:
nrow(cluster_design[1:10,])
#> 10
nrow(raked_design[1:10,])
#> 183
r survey
I'm trying to filter rows in a survey design object to exclude a particular subset of data. In the example below, which consists of survey data from several schools, I'm trying to exclude data from schools in Alameda County, California.
Surprisingly, when the survey design object includes weights created by raking, attempts to filter or subset the data fail. I think this is a bug, but I'm not sure. Why does the presence of raked weights alter the result of attempting to filter or subset the data?
library(survey)
data(api)
# Declare basic clustered design ----
cluster_design <- svydesign(data = apiclus1,
id = ~dnum,
weights = ~pw,
fpc = ~fpc)
# Add raking weights for school type ----
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))
raked_design <- rake(cluster_design,
sample.margins = list(~stype,~sch.wide),
population.margins = list(pop.types, pop.schwide))
# Filter the two different design objects ----
subset_from_raked_design <- subset(raked_design, cname != "Alameda")
subset_from_cluster_design <- subset(cluster_design, cname != "Alameda")
# Count number of rows in the subsets
# Note that they surprisingly differ
nrow(subset_from_raked_design)
#> [1] 183
nrow(subset_from_cluster_design)
#> [1] 172
This issue occurs no matter how you attempt to subset the data. For example, here's what happens when you try to use row-indexing to subset only the first 10 rows:
nrow(cluster_design[1:10,])
#> 10
nrow(raked_design[1:10,])
#> 183
r survey
r survey
edited Mar 28 at 14:32
bschneidr
asked Mar 27 at 18:23
bschneidrbschneidr
2,3201 gold badge20 silver badges36 bronze badges
2,3201 gold badge20 silver badges36 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This behavior is a result of the fact that the survey
package is trying to help you avoid making a statistical mistake.
For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.
So to keep you from running into this statistical issue, the survey
package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.
In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob
object equals Inf
(which effectively means the corresponding rows in the data are assigned a weight of zero.)
subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf
#> 0.01986881 ....
raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....
1
thanks for writing this up so clearly.. related code gist.github.com/ajdamico/9b3232a1d986b3460baaa90f5fed3402
– Anthony Damico
Apr 24 at 19:17
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55384157%2fwhy-do-attempts-to-filter-subset-a-raked-survey-design-object-fail%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This behavior is a result of the fact that the survey
package is trying to help you avoid making a statistical mistake.
For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.
So to keep you from running into this statistical issue, the survey
package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.
In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob
object equals Inf
(which effectively means the corresponding rows in the data are assigned a weight of zero.)
subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf
#> 0.01986881 ....
raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....
1
thanks for writing this up so clearly.. related code gist.github.com/ajdamico/9b3232a1d986b3460baaa90f5fed3402
– Anthony Damico
Apr 24 at 19:17
add a comment |
This behavior is a result of the fact that the survey
package is trying to help you avoid making a statistical mistake.
For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.
So to keep you from running into this statistical issue, the survey
package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.
In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob
object equals Inf
(which effectively means the corresponding rows in the data are assigned a weight of zero.)
subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf
#> 0.01986881 ....
raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....
1
thanks for writing this up so clearly.. related code gist.github.com/ajdamico/9b3232a1d986b3460baaa90f5fed3402
– Anthony Damico
Apr 24 at 19:17
add a comment |
This behavior is a result of the fact that the survey
package is trying to help you avoid making a statistical mistake.
For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.
So to keep you from running into this statistical issue, the survey
package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.
In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob
object equals Inf
(which effectively means the corresponding rows in the data are assigned a weight of zero.)
subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf
#> 0.01986881 ....
raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....
This behavior is a result of the fact that the survey
package is trying to help you avoid making a statistical mistake.
For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.
So to keep you from running into this statistical issue, the survey
package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.
In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob
object equals Inf
(which effectively means the corresponding rows in the data are assigned a weight of zero.)
subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf
#> 0.01986881 ....
raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....
answered Mar 27 at 22:13
bschneidrbschneidr
2,3201 gold badge20 silver badges36 bronze badges
2,3201 gold badge20 silver badges36 bronze badges
1
thanks for writing this up so clearly.. related code gist.github.com/ajdamico/9b3232a1d986b3460baaa90f5fed3402
– Anthony Damico
Apr 24 at 19:17
add a comment |
1
thanks for writing this up so clearly.. related code gist.github.com/ajdamico/9b3232a1d986b3460baaa90f5fed3402
– Anthony Damico
Apr 24 at 19:17
1
1
thanks for writing this up so clearly.. related code gist.github.com/ajdamico/9b3232a1d986b3460baaa90f5fed3402
– Anthony Damico
Apr 24 at 19:17
thanks for writing this up so clearly.. related code gist.github.com/ajdamico/9b3232a1d986b3460baaa90f5fed3402
– Anthony Damico
Apr 24 at 19:17
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55384157%2fwhy-do-attempts-to-filter-subset-a-raked-survey-design-object-fail%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown