Identify groups with differing observationsWhat is the difference between range and xrange functions in Python 2.X?How to identify unused css definitionsDifference between declaring variables before or in loop?What are the differences between “=” and “<-” assignment operators in R?What is the difference between call and apply?Get difference between two listsGrouping functions (tapply, by, aggregate) and the *apply familyWhat is the difference between require() and library()?Difference between map, applymap and apply methods in Pandasdata.table vs dplyr: can one do something well the other can't or does poorly?

Is this really played by 2200+ players?

Is encrypted e-mail sent over TLS 1.3 a form of "forward secrecy" (similar to something like Signal)?

Function of って in this sentence

Are homeless people protected by antidiscrimination laws?

RAM stress test

Proof coefficient in log-log model is equal to coefficient of elasticity

What kind of mathematical disciplines would be most useful for physics?

Teaching asymptotic notations at the beginning of Calculus

Google bot crawl my page too often

Why doesn't the nucleus have "nucleus-probability cloud"?

Why are there never-ending wars in the Middle East?

How to avoid that customers come to local shop to get advice and then buy online?

How was the space shuttle fuelled?

What is Noita downloading every time I quit the game?

Is it plausible for a certain area of a continent to be/remain/become uninhabited for a long period of time?

How to run fortran77 program with inputs from file?

What does "he was equally game to slip into bit parts" mean?

Response to referee after rejection

Is rent considered a debt?

Building an amplifier out of diodes

Replacing each letter with the letter that is in the corresponding position from the end of the English alphabet

Phrase: the sun is out

Could you please confirm or provide a better translation for this?

It's right here. It's very very far

Identify groups with differing observations

What is the difference between range and xrange functions in Python 2.X?How to identify unused css definitionsDifference between declaring variables before or in loop?What are the differences between “=” and “<-” assignment operators in R?What is the difference between call and apply?Get difference between two listsGrouping functions (tapply, by, aggregate) and the *apply familyWhat is the difference between require() and library()?Difference between map, applymap and apply methods in Pandasdata.table vs dplyr: can one do something well the other can't or does poorly?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

I am trying to identify groups in a dataset where values of a specific variable differ.

For example, in the data below, I had four patients and made three appointments to see each.

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).

I can easily see that the status of at least some of them differed between appointments:

nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE

I am trying to work out how to identify which patients have differing statuses.

The best I have so far is:

# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])

# split table and apply function
sapply(split(dat$status, dat$patient), all_same)

This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

add a comment
|

I am trying to identify groups in a dataset where values of a specific variable differ.

For example, in the data below, I had four patients and made three appointments to see each.

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).

I can easily see that the status of at least some of them differed between appointments:

nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE

I am trying to work out how to identify which patients have differing statuses.

The best I have so far is:

# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])

# split table and apply function
sapply(split(dat$status, dat$patient), all_same)

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

add a comment
|

I am trying to identify groups in a dataset where values of a specific variable differ.

For example, in the data below, I had four patients and made three appointments to see each.

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).

I can easily see that the status of at least some of them differed between appointments:

nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE

I am trying to work out how to identify which patients have differing statuses.

The best I have so far is:

# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])

# split table and apply function
sapply(split(dat$status, dat$patient), all_same)

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

I am trying to identify groups in a dataset where values of a specific variable differ.

For example, in the data below, I had four patients and made three appointments to see each.

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).

I can easily see that the status of at least some of them differed between appointments:

nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE

I am trying to work out how to identify which patients have differing statuses.

The best I have so far is:

# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])

# split table and apply function
sapply(split(dat$status, dat$patient), all_same)

r performance loops vectorization

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

asked Mar 28 at 21:41

Dan Lewer

4462 silver badges11 bronze badges

add a comment
|

4 Answers
4

active

oldest

votes

Here is a non tidy way:

table(unique(dat)[,'patient'])

gives

Jack Jean Jess John 
 1 2 2 1

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

add a comment
|

And a slightly different tidy approach where you keep information about the status:

library("tidyverse")

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

dat %>% 
 # Keep unique combinations of patient and status
 distinct(patient, status) %>%
 # Are they are any patients with more than one status?
 group_by(patient) %>%
 filter(n() > 1) %>%
 summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status 
#> <chr> <chr> 
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well

^{Created on 2019-03-28 by the reprex package (v0.2.1)}

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

add a comment
|

And here's a data.table approach

 library(data.table)
 setDT(dat); 
 dat[,.(unique=uniqueN(status)),patient]

 patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

add a comment
|

Here's one idea...

d <- function (x) # test whether each element of a vector is different to the element before
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]


dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])

EDIT - Here's my first ever effort at benchmarking

The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.

# function for my approach above

ch <- function(dat, group, status) 
 d <- function (x) 
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]
 
 unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups

library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
 status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
 stringsAsFactors = F)

microbenchmark(
 dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
 split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
 table = table(unique(dat)[,'patient']),
 ch = ch(dat, 'patient', 'status'),
 ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
 datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
 times = 1
)

Unit: milliseconds
 expr min lq mean median uq max neval
 dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
 split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
 table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
 ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
 ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
 datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1

edited Mar 29 at 8:52

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407262%2fidentify-groups-with-differing-observations%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Here is a non tidy way:

table(unique(dat)[,'patient'])

gives

Jack Jean Jess John 
 1 2 2 1

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

add a comment
|

Here is a non tidy way:

table(unique(dat)[,'patient'])

gives

Jack Jean Jess John 
 1 2 2 1

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

add a comment
|

Here is a non tidy way:

table(unique(dat)[,'patient'])

gives

Jack Jean Jess John 
 1 2 2 1

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

Here is a non tidy way:

table(unique(dat)[,'patient'])

gives

Jack Jean Jess John 
 1 2 2 1

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

answered Mar 28 at 21:52

Leo Brueggeman

3531 silver badge4 bronze badges

add a comment
|

And a slightly different tidy approach where you keep information about the status:

library("tidyverse")

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

dat %>% 
 # Keep unique combinations of patient and status
 distinct(patient, status) %>%
 # Are they are any patients with more than one status?
 group_by(patient) %>%
 filter(n() > 1) %>%
 summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status 
#> <chr> <chr> 
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well

^{Created on 2019-03-28 by the reprex package (v0.2.1)}

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

add a comment
|

And a slightly different tidy approach where you keep information about the status:

library("tidyverse")

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

dat %>% 
 # Keep unique combinations of patient and status
 distinct(patient, status) %>%
 # Are they are any patients with more than one status?
 group_by(patient) %>%
 filter(n() > 1) %>%
 summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status 
#> <chr> <chr> 
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well

^{Created on 2019-03-28 by the reprex package (v0.2.1)}

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

add a comment
|

And a slightly different tidy approach where you keep information about the status:

library("tidyverse")

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

dat %>% 
 # Keep unique combinations of patient and status
 distinct(patient, status) %>%
 # Are they are any patients with more than one status?
 group_by(patient) %>%
 filter(n() > 1) %>%
 summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status 
#> <chr> <chr> 
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well

^{Created on 2019-03-28 by the reprex package (v0.2.1)}

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

And a slightly different tidy approach where you keep information about the status:

library("tidyverse")

dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
 status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

dat %>% 
 # Keep unique combinations of patient and status
 distinct(patient, status) %>%
 # Are they are any patients with more than one status?
 group_by(patient) %>%
 filter(n() > 1) %>%
 summarise(status=paste(status, collapse = ","))
#> # A tibble: 2 x 2
#> patient status 
#> <chr> <chr> 
#> 1 Jean Well,Sick
#> 2 Jess DNA,Well

^{Created on 2019-03-28 by the reprex package (v0.2.1)}

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

answered Mar 28 at 21:55

dipetkov

1,5061 silver badge8 bronze badges

add a comment
|

And here's a data.table approach

 library(data.table)
 setDT(dat); 
 dat[,.(unique=uniqueN(status)),patient]

 patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

add a comment
|

And here's a data.table approach

 library(data.table)
 setDT(dat); 
 dat[,.(unique=uniqueN(status)),patient]

 patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

add a comment
|

And here's a data.table approach

 library(data.table)
 setDT(dat); 
 dat[,.(unique=uniqueN(status)),patient]

 patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

And here's a data.table approach

 library(data.table)
 setDT(dat); 
 dat[,.(unique=uniqueN(status)),patient]

 patient unique
1: John 1
2: Jean 2
3: Jack 1
4: Jess 2

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

answered Mar 29 at 1:45

David F

8071 gold badge9 silver badges12 bronze badges

add a comment
|

Here's one idea...

d <- function (x) # test whether each element of a vector is different to the element before
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]


dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])

EDIT - Here's my first ever effort at benchmarking

# function for my approach above

ch <- function(dat, group, status) 
 d <- function (x) 
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]
 
 unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups

library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
 status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
 stringsAsFactors = F)

microbenchmark(
 dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
 split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
 table = table(unique(dat)[,'patient']),
 ch = ch(dat, 'patient', 'status'),
 ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
 datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
 times = 1
)

Unit: milliseconds
 expr min lq mean median uq max neval
 dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
 split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
 table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
 ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
 ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
 datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1

edited Mar 29 at 8:52

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

add a comment
|

Here's one idea...

d <- function (x) # test whether each element of a vector is different to the element before
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]


dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])

EDIT - Here's my first ever effort at benchmarking

# function for my approach above

ch <- function(dat, group, status) 
 d <- function (x) 
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]
 
 unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups

library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
 status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
 stringsAsFactors = F)

microbenchmark(
 dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
 split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
 table = table(unique(dat)[,'patient']),
 ch = ch(dat, 'patient', 'status'),
 ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
 datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
 times = 1
)

Unit: milliseconds
 expr min lq mean median uq max neval
 dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
 split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
 table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
 ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
 ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
 datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1

edited Mar 29 at 8:52

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

add a comment
|

Here's one idea...

d <- function (x) # test whether each element of a vector is different to the element before
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]


dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])

EDIT - Here's my first ever effort at benchmarking

# function for my approach above

ch <- function(dat, group, status) 
 d <- function (x) 
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]
 
 unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups

library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
 status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
 stringsAsFactors = F)

microbenchmark(
 dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
 split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
 table = table(unique(dat)[,'patient']),
 ch = ch(dat, 'patient', 'status'),
 ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
 datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
 times = 1
)

Unit: milliseconds
 expr min lq mean median uq max neval
 dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
 split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
 table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
 ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
 ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
 datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1

edited Mar 29 at 8:52

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

Here's one idea...

d <- function (x) # test whether each element of a vector is different to the element before
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]


dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
unique(dat$patient[dat$nc])

EDIT - Here's my first ever effort at benchmarking

# function for my approach above

ch <- function(dat, group, status) 
 d <- function (x) 
 y <- x != c(x[-1], NA)
 y <- c(F, y)
 y[-length(y)]
 
 unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


# you can also use factor and diff - see 'ch2' below
# generate data with 20000 groups

library(stringi)
dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
 status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
 stringsAsFactors = F)

microbenchmark(
 dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
 split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
 table = table(unique(dat)[,'patient']),
 ch = ch(dat, 'patient', 'status'),
 ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
 datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
 times = 1
)

Unit: milliseconds
 expr min lq mean median uq max neval
 dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
 split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
 table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
 ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
 ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
 datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1

edited Mar 29 at 8:52

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

edited Mar 29 at 8:52

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

answered Mar 28 at 23:52

Dan Lewer

4462 silver badges11 bronze badges

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

4 Answers
4

EDIT - Here's my first ever effort at benchmarking

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

EDIT - Here's my first ever effort at benchmarking

EDIT - Here's my first ever effort at benchmarking

EDIT - Here's my first ever effort at benchmarking

EDIT - Here's my first ever effort at benchmarking

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

4 Answers 4

EDIT - Here's my first ever effort at benchmarking

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

EDIT - Here's my first ever effort at benchmarking

EDIT - Here's my first ever effort at benchmarking

EDIT - Here's my first ever effort at benchmarking

EDIT - Here's my first ever effort at benchmarking

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

4 Answers
4

4 Answers
4

4 Answers
4