Identify groups with differing observationsWhat is the difference between range and xrange functions in Python 2.X?How to identify unused css definitionsDifference between declaring variables before or in loop?What are the differences between “=” and “<-” assignment operators in R?What is the difference between call and apply?Get difference between two listsGrouping functions (tapply, by, aggregate) and the *apply familyWhat is the difference between require() and library()?Difference between map, applymap and apply methods in Pandasdata.table vs dplyr: can one do something well the other can't or does poorly?

Is this really played by 2200+ players?

Is encrypted e-mail sent over TLS 1.3 a form of "forward secrecy" (similar to something like Signal)?

Function of って in this sentence

Are homeless people protected by antidiscrimination laws?

RAM stress test

Proof coefficient in log-log model is equal to coefficient of elasticity

What kind of mathematical disciplines would be most useful for physics?

Teaching asymptotic notations at the beginning of Calculus

Google bot crawl my page too often

Why doesn't the nucleus have "nucleus-probability cloud"?

Why are there never-ending wars in the Middle East?

How to avoid that customers come to local shop to get advice and then buy online?

How was the space shuttle fuelled?

What is Noita downloading every time I quit the game?

Is it plausible for a certain area of a continent to be/remain/become uninhabited for a long period of time?

How to run fortran77 program with inputs from file?

What does "he was equally game to slip into bit parts" mean?

Response to referee after rejection

Is rent considered a debt?

Building an amplifier out of diodes

Replacing each letter with the letter that is in the corresponding position from the end of the English alphabet

Phrase: the sun is out

Could you please confirm or provide a better translation for this?

It's right here. It's very very far



Identify groups with differing observations


What is the difference between range and xrange functions in Python 2.X?How to identify unused css definitionsDifference between declaring variables before or in loop?What are the differences between “=” and “<-” assignment operators in R?What is the difference between call and apply?Get difference between two listsGrouping functions (tapply, by, aggregate) and the *apply familyWhat is the difference between require() and library()?Difference between map, applymap and apply methods in Pandasdata.table vs dplyr: can one do something well the other can't or does poorly?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









1

















I am trying to identify groups in a dataset where values of a specific variable differ.



For example, in the data below, I had four patients and made three appointments to see each.



dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))


Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).



I can easily see that the status of at least some of them differed between appointments:



nrow(unique(dat)) == length(unique(dat$patient))
# gives FALSE


I am trying to work out how to identify which patients have differing statuses.



The best I have so far is:



# function to find if all elements of a vector are the same
all_same <- function(x) all(x == x[1])

# split table and apply function
sapply(split(dat$status, dat$patient), all_same)


This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?










share|improve this question
































    1

















    I am trying to identify groups in a dataset where values of a specific variable differ.



    For example, in the data below, I had four patients and made three appointments to see each.



    dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
    status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))


    Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).



    I can easily see that the status of at least some of them differed between appointments:



    nrow(unique(dat)) == length(unique(dat$patient))
    # gives FALSE


    I am trying to work out how to identify which patients have differing statuses.



    The best I have so far is:



    # function to find if all elements of a vector are the same
    all_same <- function(x) all(x == x[1])

    # split table and apply function
    sapply(split(dat$status, dat$patient), all_same)


    This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?










    share|improve this question




























      1












      1








      1








      I am trying to identify groups in a dataset where values of a specific variable differ.



      For example, in the data below, I had four patients and made three appointments to see each.



      dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
      status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))


      Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).



      I can easily see that the status of at least some of them differed between appointments:



      nrow(unique(dat)) == length(unique(dat$patient))
      # gives FALSE


      I am trying to work out how to identify which patients have differing statuses.



      The best I have so far is:



      # function to find if all elements of a vector are the same
      all_same <- function(x) all(x == x[1])

      # split table and apply function
      sapply(split(dat$status, dat$patient), all_same)


      This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?










      share|improve this question















      I am trying to identify groups in a dataset where values of a specific variable differ.



      For example, in the data below, I had four patients and made three appointments to see each.



      dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'), 
      status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))


      Sometimes they were well, sometimes sick and sometimes they didn't attend (DNA).



      I can easily see that the status of at least some of them differed between appointments:



      nrow(unique(dat)) == length(unique(dat$patient))
      # gives FALSE


      I am trying to work out how to identify which patients have differing statuses.



      The best I have so far is:



      # function to find if all elements of a vector are the same
      all_same <- function(x) all(x == x[1])

      # split table and apply function
      sapply(split(dat$status, dat$patient), all_same)


      This works, but I have a big dataset with many groups (i.e. patients). I seem to come across this specific problem quite often. I feel there must be an elegant and vectorized way to do this. I know I can improve the speed of my approach using dplyr/data.table but I can only think of approaches that split the data and then loop a function over the groups. What is the best way to do this?







      r performance loops vectorization






      share|improve this question














      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 28 at 21:41









      Dan LewerDan Lewer

      4462 silver badges11 bronze badges




      4462 silver badges11 bronze badges

























          4 Answers
          4






          active

          oldest

          votes


















          3


















          Here is a non tidy way:



          table(unique(dat)[,'patient'])


          gives



          Jack Jean Jess John 
          1 2 2 1





          share|improve this answer

































            1


















            And a slightly different tidy approach where you keep information about the status:



            library("tidyverse")

            dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
            status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

            dat %>%
            # Keep unique combinations of patient and status
            distinct(patient, status) %>%
            # Are they are any patients with more than one status?
            group_by(patient) %>%
            filter(n() > 1) %>%
            summarise(status=paste(status, collapse = ","))
            #> # A tibble: 2 x 2
            #> patient status
            #> <chr> <chr>
            #> 1 Jean Well,Sick
            #> 2 Jess DNA,Well


            Created on 2019-03-28 by the reprex package (v0.2.1)






            share|improve this answer

































              1


















              And here's a data.table approach



               library(data.table)
              setDT(dat);
              dat[,.(unique=uniqueN(status)),patient]

              patient unique
              1: John 1
              2: Jean 2
              3: Jack 1
              4: Jess 2





              share|improve this answer

































                0


















                Here's one idea...



                d <- function (x) # test whether each element of a vector is different to the element before
                y <- x != c(x[-1], NA)
                y <- c(F, y)
                y[-length(y)]


                dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
                unique(dat$patient[dat$nc])



                EDIT - Here's my first ever effort at benchmarking



                The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.



                # function for my approach above

                ch <- function(dat, group, status)
                d <- function (x)
                y <- x != c(x[-1], NA)
                y <- c(F, y)
                y[-length(y)]

                unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


                # you can also use factor and diff - see 'ch2' below
                # generate data with 20000 groups

                library(stringi)
                dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
                status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
                stringsAsFactors = F)

                microbenchmark(
                dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
                split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
                table = table(unique(dat)[,'patient']),
                ch = ch(dat, 'patient', 'status'),
                ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
                datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
                times = 1
                )

                Unit: milliseconds
                expr min lq mean median uq max neval
                dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
                split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
                table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
                ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
                ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
                datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1





                share|improve this answer





























                  Your Answer






                  StackExchange.ifUsing("editor", function ()
                  StackExchange.using("externalEditor", function ()
                  StackExchange.using("snippets", function ()
                  StackExchange.snippets.init();
                  );
                  );
                  , "code-snippets");

                  StackExchange.ready(function()
                  var channelOptions =
                  tags: "".split(" "),
                  id: "1"
                  ;
                  initTagRenderer("".split(" "), "".split(" "), channelOptions);

                  StackExchange.using("externalEditor", function()
                  // Have to fire editor after snippets, if snippets enabled
                  if (StackExchange.settings.snippets.snippetsEnabled)
                  StackExchange.using("snippets", function()
                  createEditor();
                  );

                  else
                  createEditor();

                  );

                  function createEditor()
                  StackExchange.prepareEditor(
                  heartbeatType: 'answer',
                  autoActivateHeartbeat: false,
                  convertImagesToLinks: true,
                  noModals: true,
                  showLowRepImageUploadWarning: true,
                  reputationToPostImages: 10,
                  bindNavPrevention: true,
                  postfix: "",
                  imageUploader:
                  brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                  contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                  allowUrls: true
                  ,
                  onDemand: true,
                  discardSelector: ".discard-answer"
                  ,immediatelyShowMarkdownHelp:true
                  );



                  );














                  draft saved

                  draft discarded
















                  StackExchange.ready(
                  function ()
                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407262%2fidentify-groups-with-differing-observations%23new-answer', 'question_page');

                  );

                  Post as a guest















                  Required, but never shown


























                  4 Answers
                  4






                  active

                  oldest

                  votes








                  4 Answers
                  4






                  active

                  oldest

                  votes









                  active

                  oldest

                  votes






                  active

                  oldest

                  votes









                  3


















                  Here is a non tidy way:



                  table(unique(dat)[,'patient'])


                  gives



                  Jack Jean Jess John 
                  1 2 2 1





                  share|improve this answer






























                    3


















                    Here is a non tidy way:



                    table(unique(dat)[,'patient'])


                    gives



                    Jack Jean Jess John 
                    1 2 2 1





                    share|improve this answer




























                      3














                      3










                      3









                      Here is a non tidy way:



                      table(unique(dat)[,'patient'])


                      gives



                      Jack Jean Jess John 
                      1 2 2 1





                      share|improve this answer














                      Here is a non tidy way:



                      table(unique(dat)[,'patient'])


                      gives



                      Jack Jean Jess John 
                      1 2 2 1






                      share|improve this answer













                      share|improve this answer




                      share|improve this answer










                      answered Mar 28 at 21:52









                      Leo BrueggemanLeo Brueggeman

                      3531 silver badge4 bronze badges




                      3531 silver badge4 bronze badges


























                          1


















                          And a slightly different tidy approach where you keep information about the status:



                          library("tidyverse")

                          dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
                          status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

                          dat %>%
                          # Keep unique combinations of patient and status
                          distinct(patient, status) %>%
                          # Are they are any patients with more than one status?
                          group_by(patient) %>%
                          filter(n() > 1) %>%
                          summarise(status=paste(status, collapse = ","))
                          #> # A tibble: 2 x 2
                          #> patient status
                          #> <chr> <chr>
                          #> 1 Jean Well,Sick
                          #> 2 Jess DNA,Well


                          Created on 2019-03-28 by the reprex package (v0.2.1)






                          share|improve this answer






























                            1


















                            And a slightly different tidy approach where you keep information about the status:



                            library("tidyverse")

                            dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
                            status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

                            dat %>%
                            # Keep unique combinations of patient and status
                            distinct(patient, status) %>%
                            # Are they are any patients with more than one status?
                            group_by(patient) %>%
                            filter(n() > 1) %>%
                            summarise(status=paste(status, collapse = ","))
                            #> # A tibble: 2 x 2
                            #> patient status
                            #> <chr> <chr>
                            #> 1 Jean Well,Sick
                            #> 2 Jess DNA,Well


                            Created on 2019-03-28 by the reprex package (v0.2.1)






                            share|improve this answer




























                              1














                              1










                              1









                              And a slightly different tidy approach where you keep information about the status:



                              library("tidyverse")

                              dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
                              status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

                              dat %>%
                              # Keep unique combinations of patient and status
                              distinct(patient, status) %>%
                              # Are they are any patients with more than one status?
                              group_by(patient) %>%
                              filter(n() > 1) %>%
                              summarise(status=paste(status, collapse = ","))
                              #> # A tibble: 2 x 2
                              #> patient status
                              #> <chr> <chr>
                              #> 1 Jean Well,Sick
                              #> 2 Jess DNA,Well


                              Created on 2019-03-28 by the reprex package (v0.2.1)






                              share|improve this answer














                              And a slightly different tidy approach where you keep information about the status:



                              library("tidyverse")

                              dat <- structure(list(patient = c('John', 'John', 'John', 'Jean', 'Jean', 'Jean', 'Jack', 'Jack', 'Jack', 'Jess', 'Jess', 'Jess'),
                              status = c('Well', 'Well', 'Well', 'Well', 'Sick', 'Well', 'DNA', 'DNA', 'DNA', 'DNA', 'Well', 'Well')), class = "data.frame", row.names = c(NA, -12L))

                              dat %>%
                              # Keep unique combinations of patient and status
                              distinct(patient, status) %>%
                              # Are they are any patients with more than one status?
                              group_by(patient) %>%
                              filter(n() > 1) %>%
                              summarise(status=paste(status, collapse = ","))
                              #> # A tibble: 2 x 2
                              #> patient status
                              #> <chr> <chr>
                              #> 1 Jean Well,Sick
                              #> 2 Jess DNA,Well


                              Created on 2019-03-28 by the reprex package (v0.2.1)







                              share|improve this answer













                              share|improve this answer




                              share|improve this answer










                              answered Mar 28 at 21:55









                              dipetkovdipetkov

                              1,5061 silver badge8 bronze badges




                              1,5061 silver badge8 bronze badges
























                                  1


















                                  And here's a data.table approach



                                   library(data.table)
                                  setDT(dat);
                                  dat[,.(unique=uniqueN(status)),patient]

                                  patient unique
                                  1: John 1
                                  2: Jean 2
                                  3: Jack 1
                                  4: Jess 2





                                  share|improve this answer






























                                    1


















                                    And here's a data.table approach



                                     library(data.table)
                                    setDT(dat);
                                    dat[,.(unique=uniqueN(status)),patient]

                                    patient unique
                                    1: John 1
                                    2: Jean 2
                                    3: Jack 1
                                    4: Jess 2





                                    share|improve this answer




























                                      1














                                      1










                                      1









                                      And here's a data.table approach



                                       library(data.table)
                                      setDT(dat);
                                      dat[,.(unique=uniqueN(status)),patient]

                                      patient unique
                                      1: John 1
                                      2: Jean 2
                                      3: Jack 1
                                      4: Jess 2





                                      share|improve this answer














                                      And here's a data.table approach



                                       library(data.table)
                                      setDT(dat);
                                      dat[,.(unique=uniqueN(status)),patient]

                                      patient unique
                                      1: John 1
                                      2: Jean 2
                                      3: Jack 1
                                      4: Jess 2






                                      share|improve this answer













                                      share|improve this answer




                                      share|improve this answer










                                      answered Mar 29 at 1:45









                                      David FDavid F

                                      8071 gold badge9 silver badges12 bronze badges




                                      8071 gold badge9 silver badges12 bronze badges
























                                          0


















                                          Here's one idea...



                                          d <- function (x) # test whether each element of a vector is different to the element before
                                          y <- x != c(x[-1], NA)
                                          y <- c(F, y)
                                          y[-length(y)]


                                          dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
                                          unique(dat$patient[dat$nc])



                                          EDIT - Here's my first ever effort at benchmarking



                                          The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.



                                          # function for my approach above

                                          ch <- function(dat, group, status)
                                          d <- function (x)
                                          y <- x != c(x[-1], NA)
                                          y <- c(F, y)
                                          y[-length(y)]

                                          unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


                                          # you can also use factor and diff - see 'ch2' below
                                          # generate data with 20000 groups

                                          library(stringi)
                                          dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
                                          status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
                                          stringsAsFactors = F)

                                          microbenchmark(
                                          dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
                                          split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
                                          table = table(unique(dat)[,'patient']),
                                          ch = ch(dat, 'patient', 'status'),
                                          ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
                                          datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
                                          times = 1
                                          )

                                          Unit: milliseconds
                                          expr min lq mean median uq max neval
                                          dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
                                          split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
                                          table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
                                          ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
                                          ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
                                          datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1





                                          share|improve this answer
































                                            0


















                                            Here's one idea...



                                            d <- function (x) # test whether each element of a vector is different to the element before
                                            y <- x != c(x[-1], NA)
                                            y <- c(F, y)
                                            y[-length(y)]


                                            dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
                                            unique(dat$patient[dat$nc])



                                            EDIT - Here's my first ever effort at benchmarking



                                            The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.



                                            # function for my approach above

                                            ch <- function(dat, group, status)
                                            d <- function (x)
                                            y <- x != c(x[-1], NA)
                                            y <- c(F, y)
                                            y[-length(y)]

                                            unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


                                            # you can also use factor and diff - see 'ch2' below
                                            # generate data with 20000 groups

                                            library(stringi)
                                            dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
                                            status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
                                            stringsAsFactors = F)

                                            microbenchmark(
                                            dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
                                            split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
                                            table = table(unique(dat)[,'patient']),
                                            ch = ch(dat, 'patient', 'status'),
                                            ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
                                            datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
                                            times = 1
                                            )

                                            Unit: milliseconds
                                            expr min lq mean median uq max neval
                                            dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
                                            split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
                                            table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
                                            ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
                                            ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
                                            datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1





                                            share|improve this answer






























                                              0














                                              0










                                              0









                                              Here's one idea...



                                              d <- function (x) # test whether each element of a vector is different to the element before
                                              y <- x != c(x[-1], NA)
                                              y <- c(F, y)
                                              y[-length(y)]


                                              dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
                                              unique(dat$patient[dat$nc])



                                              EDIT - Here's my first ever effort at benchmarking



                                              The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.



                                              # function for my approach above

                                              ch <- function(dat, group, status)
                                              d <- function (x)
                                              y <- x != c(x[-1], NA)
                                              y <- c(F, y)
                                              y[-length(y)]

                                              unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


                                              # you can also use factor and diff - see 'ch2' below
                                              # generate data with 20000 groups

                                              library(stringi)
                                              dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
                                              status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
                                              stringsAsFactors = F)

                                              microbenchmark(
                                              dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
                                              split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
                                              table = table(unique(dat)[,'patient']),
                                              ch = ch(dat, 'patient', 'status'),
                                              ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
                                              datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
                                              times = 1
                                              )

                                              Unit: milliseconds
                                              expr min lq mean median uq max neval
                                              dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
                                              split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
                                              table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
                                              ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
                                              ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
                                              datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1





                                              share|improve this answer
















                                              Here's one idea...



                                              d <- function (x) # test whether each element of a vector is different to the element before
                                              y <- x != c(x[-1], NA)
                                              y <- c(F, y)
                                              y[-length(y)]


                                              dat$nc <- d(dat$status) & !d(dat$patient) # status changes but patient doesn't
                                              unique(dat$patient[dat$nc])



                                              EDIT - Here's my first ever effort at benchmarking



                                              The results suggest that split/apply and 'table' approaches in base are actually faster than either dplyr or data.table for this purpose, while the 'ch' function is much faster. The 'ch' function does rely on the patients being on consecutive rows in the table, which the other approaches don't.



                                              # function for my approach above

                                              ch <- function(dat, group, status)
                                              d <- function (x)
                                              y <- x != c(x[-1], NA)
                                              y <- c(F, y)
                                              y[-length(y)]

                                              unique(dat[,group][d(dat[,status]) & !d(dat[,group])])


                                              # you can also use factor and diff - see 'ch2' below
                                              # generate data with 20000 groups

                                              library(stringi)
                                              dat <- data.frame(patient = rep(stri_rand_strings(20000, 7), each = 4),
                                              status = sample(c('A', 'B', 'C'), 80000, replace = T, prob = c(0.8, 0.1, 0.1)),
                                              stringsAsFactors = F)

                                              microbenchmark(
                                              dplyr = dat %>% as_tibble() %>% group_by(patient) %>% summarise(result = n_distinct(status)),
                                              split_apply = sapply(split(dat$status, dat$patient), function(x) all(x == x[1])),
                                              table = table(unique(dat)[,'patient']),
                                              ch = ch(dat, 'patient', 'status'),
                                              ch2 = unique(dat$patient[c(F, diff(as.numeric(factor(dat$patient))) != 0 & diff(as.numeric(factor(dat$status))) == 0)]),
                                              datatable = setDT(dat); dat[,.(unique=uniqueN(status)),patient],
                                              times = 1
                                              )

                                              Unit: milliseconds
                                              expr min lq mean median uq max neval
                                              dplyr 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 5523.6048 1
                                              split_apply 165.8760 165.8760 165.8760 165.8760 165.8760 165.8760 1
                                              table 224.9030 224.9030 224.9030 224.9030 224.9030 224.9030 1
                                              ch 10.8821 10.8821 10.8821 10.8821 10.8821 10.8821 1
                                              ch2 146.2358 146.2358 146.2358 146.2358 146.2358 146.2358 1
                                              datatable 851.1028 851.1028 851.1028 851.1028 851.1028 851.1028 1






                                              share|improve this answer















                                              share|improve this answer




                                              share|improve this answer








                                              edited Mar 29 at 8:52

























                                              answered Mar 28 at 23:52









                                              Dan LewerDan Lewer

                                              4462 silver badges11 bronze badges




                                              4462 silver badges11 bronze badges































                                                  draft saved

                                                  draft discarded















































                                                  Thanks for contributing an answer to Stack Overflow!


                                                  • Please be sure to answer the question. Provide details and share your research!

                                                  But avoid


                                                  • Asking for help, clarification, or responding to other answers.

                                                  • Making statements based on opinion; back them up with references or personal experience.

                                                  To learn more, see our tips on writing great answers.




                                                  draft saved


                                                  draft discarded














                                                  StackExchange.ready(
                                                  function ()
                                                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407262%2fidentify-groups-with-differing-observations%23new-answer', 'question_page');

                                                  );

                                                  Post as a guest















                                                  Required, but never shown





















































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown

































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown









                                                  Popular posts from this blog

                                                  Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                                                  Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

                                                  Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript