Identify groups with differing observationsWhat is the difference between range and xrange functions in Python 2.X?How to identify unused css definitionsDifference between declaring variables before or in loop?What are the differences between “=” and “<-” assignment operators in R?What is the difference between call and apply?Get difference between two listsGrouping functions (tapply, by, aggregate) and the *apply familyWhat is the difference between require() and library()?Difference between map, applymap and apply methods in Pandasdata.table vs dplyr: can one do something well the other can't or does poorly?
Is this really played by 2200+ players?
Is encrypted e-mail sent over TLS 1.3 a form of "forward secrecy" (similar to something like Signal)?
Function of って in this sentence
Are homeless people protected by antidiscrimination laws?
RAM stress test
Proof coefficient in log-log model is equal to coefficient of elasticity
What kind of mathematical disciplines would be most useful for physics?
Teaching asymptotic notations at the beginning of Calculus
Google bot crawl my page too often
Why doesn't the nucleus have "nucleus-probability cloud"?
Why are there never-ending wars in the Middle East?
How to avoid that customers come to local shop to get advice and then buy online?
How was the space shuttle fuelled?
What is Noita downloading every time I quit the game?
Is it plausible for a certain area of a continent to be/remain/become uninhabited for a long period of time?
How to run fortran77 program with inputs from file?
What does "he was equally game to slip into bit parts" mean?
Response to referee after rejection
Is rent considered a debt?
Building an amplifier out of diodes
Replacing each letter with the letter that is in the corresponding position from the end of the English alphabet
Phrase: the sun is out
Could you please confirm or provide a better translation for this?
It's right here. It's very very far
Identify groups with differing observations
What is the difference between range and xrange functions in Python 2.X?How to identify unused css definitionsDifference between declaring variables before or in loop?What are the differences between “=” and “<-” assignment operators in R?What is the difference between call and apply?Get difference between two listsGrouping functions (tapply, by, aggregate) and the *apply familyWhat is the difference between require() and library()?Difference between map, applymap and apply methods in Pandasdata.table vs dplyr: can one do something well the other can't or does poorly?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
I am trying to identify groups in a dataset where values of a specific variable differ.
For example, in the data below, I had four patients and made three appointments to see each.
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).
I can easily see that the status of at least some of them differed between appointments:
nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE
I am trying to work out how to identify which patients have differing statuses.
The best I have so far is:
# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])
# split table and apply function
sapply(split(dat$status, dat$patient), all_same)
This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?
r performance loops vectorization
add a comment
|
I am trying to identify groups in a dataset where values of a specific variable differ.
For example, in the data below, I had four patients and made three appointments to see each.
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).
I can easily see that the status of at least some of them differed between appointments:
nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE
I am trying to work out how to identify which patients have differing statuses.
The best I have so far is:
# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])
# split table and apply function
sapply(split(dat$status, dat$patient), all_same)
This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?
r performance loops vectorization
add a comment
|
I am trying to identify groups in a dataset where values of a specific variable differ.
For example, in the data below, I had four patients and made three appointments to see each.
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).
I can easily see that the status of at least some of them differed between appointments:
nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE
I am trying to work out how to identify which patients have differing statuses.
The best I have so far is:
# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])
# split table and apply function
sapply(split(dat$status, dat$patient), all_same)
This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?
r performance loops vectorization
I am trying to identify groups in a dataset where values of a specific variable differ.
For example, in the data below, I had four patients and made three appointments to see each.
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).
I can easily see that the status of at least some of them differed between appointments:
nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE
I am trying to work out how to identify which patients have differing statuses.
The best I have so far is:
# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])
# split table and apply function
sapply(split(dat$status, dat$patient), all_same)
This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?
r performance loops vectorization
r performance loops vectorization
asked Mar 28 at 21:41
Dan LewerDan Lewer
4462 silver badges11 bronze badges
4462 silver badges11 bronze badges
add a comment
|
add a comment
|
4 Answers
4
active
oldest
votes
Here is a non tidy way:
table(unique(dat)[,'patient'])
gives
Jack Jean Jess John
1 2 2 1
add a comment
|
And a slightly different tidy approach where you keep information about the status:
library("tidyverse")
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
dat %>%
# Keep unique combinations of patient and status
distinct(patient, status) %>%
# Are they are any patients with more than one status?
group_by(patient) %>%
filter(n() > 1) %>%
summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status
#> <chr> <chr>
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well
Created on 2019-03-28 by the reprex package (v0.2.1)
add a comment
|
And here's a data.table approach
library(data.table)
setDT(dat);
dat[,.(unique=uniqueN(status)),patient]
patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2
add a comment
|
Here's one idea...
d <- function (x) # test whether each element of a vector is different to the element before
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])
EDIT - Here's my first ever effort at benchmarking
The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.
# function for my approach above
ch <- function(dat, group, status)
d <- function (x)
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
unique(dat[,group][d(dat[,status]) & !d(dat[,group])])
# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups
library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
stringsAsFactors = F)
microbenchmark(
dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
table = table(unique(dat)[,'patient']),
ch = ch(dat, 'patient', 'status'),
ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
times = 1
)
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407262%2fidentify-groups-with-differing-observations%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here is a non tidy way:
table(unique(dat)[,'patient'])
gives
Jack Jean Jess John
1 2 2 1
add a comment
|
Here is a non tidy way:
table(unique(dat)[,'patient'])
gives
Jack Jean Jess John
1 2 2 1
add a comment
|
Here is a non tidy way:
table(unique(dat)[,'patient'])
gives
Jack Jean Jess John
1 2 2 1
Here is a non tidy way:
table(unique(dat)[,'patient'])
gives
Jack Jean Jess John
1 2 2 1
answered Mar 28 at 21:52
Leo BrueggemanLeo Brueggeman
3531 silver badge4 bronze badges
3531 silver badge4 bronze badges
add a comment
|
add a comment
|
And a slightly different tidy approach where you keep information about the status:
library("tidyverse")
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
dat %>%
# Keep unique combinations of patient and status
distinct(patient, status) %>%
# Are they are any patients with more than one status?
group_by(patient) %>%
filter(n() > 1) %>%
summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status
#> <chr> <chr>
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well
Created on 2019-03-28 by the reprex package (v0.2.1)
add a comment
|
And a slightly different tidy approach where you keep information about the status:
library("tidyverse")
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
dat %>%
# Keep unique combinations of patient and status
distinct(patient, status) %>%
# Are they are any patients with more than one status?
group_by(patient) %>%
filter(n() > 1) %>%
summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status
#> <chr> <chr>
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well
Created on 2019-03-28 by the reprex package (v0.2.1)
add a comment
|
And a slightly different tidy approach where you keep information about the status:
library("tidyverse")
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
dat %>%
# Keep unique combinations of patient and status
distinct(patient, status) %>%
# Are they are any patients with more than one status?
group_by(patient) %>%
filter(n() > 1) %>%
summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status
#> <chr> <chr>
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well
Created on 2019-03-28 by the reprex package (v0.2.1)
And a slightly different tidy approach where you keep information about the status:
library("tidyverse")
dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))
dat %>%
# Keep unique combinations of patient and status
distinct(patient, status) %>%
# Are they are any patients with more than one status?
group_by(patient) %>%
filter(n() > 1) %>%
summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status
#> <chr> <chr>
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well
Created on 2019-03-28 by the reprex package (v0.2.1)
answered Mar 28 at 21:55
dipetkovdipetkov
1,5061 silver badge8 bronze badges
1,5061 silver badge8 bronze badges
add a comment
|
add a comment
|
And here's a data.table approach
library(data.table)
setDT(dat);
dat[,.(unique=uniqueN(status)),patient]
patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2
add a comment
|
And here's a data.table approach
library(data.table)
setDT(dat);
dat[,.(unique=uniqueN(status)),patient]
patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2
add a comment
|
And here's a data.table approach
library(data.table)
setDT(dat);
dat[,.(unique=uniqueN(status)),patient]
patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2
And here's a data.table approach
library(data.table)
setDT(dat);
dat[,.(unique=uniqueN(status)),patient]
patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2
answered Mar 29 at 1:45
David FDavid F
8071 gold badge9 silver badges12 bronze badges
8071 gold badge9 silver badges12 bronze badges
add a comment
|
add a comment
|
Here's one idea...
d <- function (x) # test whether each element of a vector is different to the element before
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])
EDIT - Here's my first ever effort at benchmarking
The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.
# function for my approach above
ch <- function(dat, group, status)
d <- function (x)
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
unique(dat[,group][d(dat[,status]) & !d(dat[,group])])
# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups
library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
stringsAsFactors = F)
microbenchmark(
dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
table = table(unique(dat)[,'patient']),
ch = ch(dat, 'patient', 'status'),
ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
times = 1
)
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1
add a comment
|
Here's one idea...
d <- function (x) # test whether each element of a vector is different to the element before
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])
EDIT - Here's my first ever effort at benchmarking
The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.
# function for my approach above
ch <- function(dat, group, status)
d <- function (x)
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
unique(dat[,group][d(dat[,status]) & !d(dat[,group])])
# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups
library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
stringsAsFactors = F)
microbenchmark(
dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
table = table(unique(dat)[,'patient']),
ch = ch(dat, 'patient', 'status'),
ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
times = 1
)
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1
add a comment
|
Here's one idea...
d <- function (x) # test whether each element of a vector is different to the element before
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])
EDIT - Here's my first ever effort at benchmarking
The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.
# function for my approach above
ch <- function(dat, group, status)
d <- function (x)
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
unique(dat[,group][d(dat[,status]) & !d(dat[,group])])
# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups
library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
stringsAsFactors = F)
microbenchmark(
dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
table = table(unique(dat)[,'patient']),
ch = ch(dat, 'patient', 'status'),
ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
times = 1
)
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1
Here's one idea...
d <- function (x) # test whether each element of a vector is different to the element before
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])
EDIT - Here's my first ever effort at benchmarking
The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.
# function for my approach above
ch <- function(dat, group, status)
d <- function (x)
y <- x != c(x[-1], NA)
y <- c(F, y)
y[-length(y)]
unique(dat[,group][d(dat[,status]) & !d(dat[,group])])
# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups
library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
stringsAsFactors = F)
microbenchmark(
dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
table = table(unique(dat)[,'patient']),
ch = ch(dat, 'patient', 'status'),
ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
times = 1
)
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1
edited Mar 29 at 8:52
answered Mar 28 at 23:52
Dan LewerDan Lewer
4462 silver badges11 bronze badges
4462 silver badges11 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407262%2fidentify-groups-with-differing-observations%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown