Remove (quasi) identical rowsFinding ALL duplicate rows, including “elements with smaller subscripts”Remove rows with all or some NAs (missing values) in data.frameSimultaneously merge multiple data.frames in a listFind indices of duplicated rowsFind duplicate values in RRemove duplicates, with “cookie crumbs” to remember why it was removedRemoving duplicate rows with ddplyfinding non-identical entries across a rowAre my R scripts identical?rxDataStep “transform” argument using quasi-quotationfinding mean of the variables according to date in panel data

Counterexample finite intersection property

What do Unicorns want?

Can I use Sitecore's Configuration patching mechanics for my Identity Server configuration?

Recursive search on Node Tree with Linq and Queue

What does a Nintendo Game Boy do when turned on without a game cartridge inserted?

Capture SQL Server queries without third-party tooling and without using deprecated features?

Magic is the twist

Monday's Blocking Donimoes Problem

Strange LED behavior

Is there an English word to describe when a sound "protrudes"?

Why can't a country print its own money to spend it only abroad?

Has Iron Man made any suit for underwater combat?

Is it better to have a 10 year gap or a bad reference?

Caption in landscape table, need help?

Has Peter Parker ever eaten bugs?

1025th term of the given sequence.

How to deal with making design decisions

Can "Taking algebraic closure" be made into a functor?

How can I disable a reserved profile?

MITM on HTTPS traffic in Kazakhstan 2019

Acoustic guitar chords' positions vs those of a Bass guitar

Calculating Fibonacci sequence in several different ways

How did pilots avoid thunderstorms and related weather before “reliable” airborne weather radar was introduced on airliners?

My current job follows "worst practices". How can I talk about my experience in an interview without giving off red flags?



Remove (quasi) identical rows


Finding ALL duplicate rows, including “elements with smaller subscripts”Remove rows with all or some NAs (missing values) in data.frameSimultaneously merge multiple data.frames in a listFind indices of duplicated rowsFind duplicate values in RRemove duplicates, with “cookie crumbs” to remember why it was removedRemoving duplicate rows with ddplyfinding non-identical entries across a rowAre my R scripts identical?rxDataStep “transform” argument using quasi-quotationfinding mean of the variables according to date in panel data






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.



 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930


Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...



What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv)) to find the "quasi" same rows...



the expected result being something like, in the first case:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494


and in the second one:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405


any idea?










share|improve this question



















  • 2





    You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

    – kath
    Mar 26 at 12:29


















1















in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.



 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930


Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...



What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv)) to find the "quasi" same rows...



the expected result being something like, in the first case:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494


and in the second one:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405


any idea?










share|improve this question



















  • 2





    You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

    – kath
    Mar 26 at 12:29














1












1








1








in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.



 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930


Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...



What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv)) to find the "quasi" same rows...



the expected result being something like, in the first case:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494


and in the second one:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405


any idea?










share|improve this question
















in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.



 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930


Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...



What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv)) to find the "quasi" same rows...



the expected result being something like, in the first case:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494


and in the second one:



2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405


any idea?







r






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 26 at 12:56









Ronak Shah

69k10 gold badges48 silver badges80 bronze badges




69k10 gold badges48 silver badges80 bronze badges










asked Mar 26 at 12:24









TeYaPTeYaP

2054 silver badges16 bronze badges




2054 silver badges16 bronze badges







  • 2





    You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

    – kath
    Mar 26 at 12:29













  • 2





    You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

    – kath
    Mar 26 at 12:29








2




2





You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

– kath
Mar 26 at 12:29






You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

– kath
Mar 26 at 12:29













4 Answers
4






active

oldest

votes


















3














We could write a function and then pass columns which we want to consider.



get_duplicated_rows <- function(df, cols) 
df[duplicated(df[cols])

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069





share|improve this answer























  • And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

    – TeYaP
    Mar 26 at 12:34







  • 1





    @TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

    – Ronak Shah
    Mar 26 at 12:38


















1














You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.



toread <- " iso3 dest code year uv mean
ALB AUT 490700 2002 14027.2433 427387.640
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 58069.405
ALB BGR 843050 2002 677.9827 4272.176
ALB BGR 851030 2002 31004.0946 32364.379
ALB HRV 392329 2002 1410.0072 6970.930"

df <- read.table(textConnection(toread), header = TRUE)
closeAllConnections()

get_quasi_duplicated_rows <- function(df, cols, cut)
result <- matrix(nrow = nrow(df), ncol = length(cols))
colnames(result) <- cols
for(col in cols)
dup <- duplicated(df[col])
return(df[which(rowSums(result) > cut), ])


get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40





share|improve this answer






























    1














    Using dplyr and rlang package we can achive this-



    Solution-



    find_dupes <- function(df,cols)
    df <- df %>% get_dupes(!!!rlang::syms(cols))
    return(df)



    Output-



    1st Case-



    > cols
    [1] "iso3" "dest" "code" "year" "uv"

    > find_dupes(df, cols)

    # A tibble: 3 x 7
    iso3 dest code year uv dupe_count mean
    <fct> <fct> <int> <int> <dbl> <int> <dbl>
    1 ALB BGR 490700 2002 1216. 3 11886.
    2 ALB BGR 490700 2002 1216. 3 11886.
    3 ALB BGR 490700 2002 1216. 3 58069.


    2nd Case-



    > cols
    [1] "iso3" "dest" "code" "year" "uv" "mean"

    > find_dupes(df,cols)

    # A tibble: 2 x 7
    iso3 dest code year uv mean dupe_count
    <fct> <fct> <int> <int> <dbl> <dbl> <int>
    1 ALB BGR 490700 2002 1216. 11886. 2
    2 ALB BGR 490700 2002 1216. 11886. 2


    Note-



    rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.



    To pass a list of vector names in dplyr function, we use syms.



    !!! is used to unquote






    share|improve this answer
































      1














      We can use group_by_all and filter that having more than 1 frequency count



      library(dplyr)
      df1 %>%
      group_by_all() %>%
      filter(n() > 1)
      # A tibble: 2 x 6
      # Groups: iso3, dest, code, year, uv, mean [1]
      # iso3 dest code year uv mean
      # <chr> <chr> <int> <int> <dbl> <dbl>
      #1 ALB BGR 490700 2002 1216. 11886.
      #2 ALB BGR 490700 2002 1216. 11886.


      if it is a subset of columns, use group_by_at



      df1 %>%
      group_by_at(vars(iso3, dest, code, year, uv)) %>%
      filter(n() > 1)
      # A tibble: 3 x 6
      # Groups: iso3, dest, code, year, uv [1]
      # iso3 dest code year uv mean
      # <chr> <chr> <int> <int> <dbl> <dbl>
      #1 ALB BGR 490700 2002 1216. 11886.
      #2 ALB BGR 490700 2002 1216. 11886.
      #3 ALB BGR 490700 2002 1216. 58069.





      share|improve this answer

























        Your Answer






        StackExchange.ifUsing("editor", function ()
        StackExchange.using("externalEditor", function ()
        StackExchange.using("snippets", function ()
        StackExchange.snippets.init();
        );
        );
        , "code-snippets");

        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "1"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55357100%2fremove-quasi-identical-rows%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        3














        We could write a function and then pass columns which we want to consider.



        get_duplicated_rows <- function(df, cols) 
        df[duplicated(df[cols])

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886
        #4 ALB BGR 490700 2002 1215.6 58069





        share|improve this answer























        • And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

          – TeYaP
          Mar 26 at 12:34







        • 1





          @TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

          – Ronak Shah
          Mar 26 at 12:38















        3














        We could write a function and then pass columns which we want to consider.



        get_duplicated_rows <- function(df, cols) 
        df[duplicated(df[cols])

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886
        #4 ALB BGR 490700 2002 1215.6 58069





        share|improve this answer























        • And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

          – TeYaP
          Mar 26 at 12:34







        • 1





          @TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

          – Ronak Shah
          Mar 26 at 12:38













        3












        3








        3







        We could write a function and then pass columns which we want to consider.



        get_duplicated_rows <- function(df, cols) 
        df[duplicated(df[cols])

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886
        #4 ALB BGR 490700 2002 1215.6 58069





        share|improve this answer













        We could write a function and then pass columns which we want to consider.



        get_duplicated_rows <- function(df, cols) 
        df[duplicated(df[cols])

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886

        get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
        # iso3 dest code year uv mean
        #2 ALB BGR 490700 2002 1215.6 11886
        #3 ALB BGR 490700 2002 1215.6 11886
        #4 ALB BGR 490700 2002 1215.6 58069






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 26 at 12:29









        Ronak ShahRonak Shah

        69k10 gold badges48 silver badges80 bronze badges




        69k10 gold badges48 silver badges80 bronze badges












        • And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

          – TeYaP
          Mar 26 at 12:34







        • 1





          @TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

          – Ronak Shah
          Mar 26 at 12:38

















        • And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

          – TeYaP
          Mar 26 at 12:34







        • 1





          @TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

          – Ronak Shah
          Mar 26 at 12:38
















        And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

        – TeYaP
        Mar 26 at 12:34






        And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

        – TeYaP
        Mar 26 at 12:34





        1




        1





        @TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

        – Ronak Shah
        Mar 26 at 12:38





        @TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

        – Ronak Shah
        Mar 26 at 12:38













        1














        You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.



        toread <- " iso3 dest code year uv mean
        ALB AUT 490700 2002 14027.2433 427387.640
        ALB BGR 490700 2002 1215.5613 11886.494
        ALB BGR 490700 2002 1215.5613 11886.494
        ALB BGR 490700 2002 1215.5613 58069.405
        ALB BGR 843050 2002 677.9827 4272.176
        ALB BGR 851030 2002 31004.0946 32364.379
        ALB HRV 392329 2002 1410.0072 6970.930"

        df <- read.table(textConnection(toread), header = TRUE)
        closeAllConnections()

        get_quasi_duplicated_rows <- function(df, cols, cut)
        result <- matrix(nrow = nrow(df), ncol = length(cols))
        colnames(result) <- cols
        for(col in cols)
        dup <- duplicated(df[col])
        return(df[which(rowSums(result) > cut), ])


        get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


        iso3 dest code year uv mean
        2 ALB BGR 490700 2002 1215.561 11886.49
        3 ALB BGR 490700 2002 1215.561 11886.49
        4 ALB BGR 490700 2002 1215.561 58069.40





        share|improve this answer



























          1














          You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.



          toread <- " iso3 dest code year uv mean
          ALB AUT 490700 2002 14027.2433 427387.640
          ALB BGR 490700 2002 1215.5613 11886.494
          ALB BGR 490700 2002 1215.5613 11886.494
          ALB BGR 490700 2002 1215.5613 58069.405
          ALB BGR 843050 2002 677.9827 4272.176
          ALB BGR 851030 2002 31004.0946 32364.379
          ALB HRV 392329 2002 1410.0072 6970.930"

          df <- read.table(textConnection(toread), header = TRUE)
          closeAllConnections()

          get_quasi_duplicated_rows <- function(df, cols, cut)
          result <- matrix(nrow = nrow(df), ncol = length(cols))
          colnames(result) <- cols
          for(col in cols)
          dup <- duplicated(df[col])
          return(df[which(rowSums(result) > cut), ])


          get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


          iso3 dest code year uv mean
          2 ALB BGR 490700 2002 1215.561 11886.49
          3 ALB BGR 490700 2002 1215.561 11886.49
          4 ALB BGR 490700 2002 1215.561 58069.40





          share|improve this answer

























            1












            1








            1







            You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.



            toread <- " iso3 dest code year uv mean
            ALB AUT 490700 2002 14027.2433 427387.640
            ALB BGR 490700 2002 1215.5613 11886.494
            ALB BGR 490700 2002 1215.5613 11886.494
            ALB BGR 490700 2002 1215.5613 58069.405
            ALB BGR 843050 2002 677.9827 4272.176
            ALB BGR 851030 2002 31004.0946 32364.379
            ALB HRV 392329 2002 1410.0072 6970.930"

            df <- read.table(textConnection(toread), header = TRUE)
            closeAllConnections()

            get_quasi_duplicated_rows <- function(df, cols, cut)
            result <- matrix(nrow = nrow(df), ncol = length(cols))
            colnames(result) <- cols
            for(col in cols)
            dup <- duplicated(df[col])
            return(df[which(rowSums(result) > cut), ])


            get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


            iso3 dest code year uv mean
            2 ALB BGR 490700 2002 1215.561 11886.49
            3 ALB BGR 490700 2002 1215.561 11886.49
            4 ALB BGR 490700 2002 1215.561 58069.40





            share|improve this answer













            You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.



            toread <- " iso3 dest code year uv mean
            ALB AUT 490700 2002 14027.2433 427387.640
            ALB BGR 490700 2002 1215.5613 11886.494
            ALB BGR 490700 2002 1215.5613 11886.494
            ALB BGR 490700 2002 1215.5613 58069.405
            ALB BGR 843050 2002 677.9827 4272.176
            ALB BGR 851030 2002 31004.0946 32364.379
            ALB HRV 392329 2002 1410.0072 6970.930"

            df <- read.table(textConnection(toread), header = TRUE)
            closeAllConnections()

            get_quasi_duplicated_rows <- function(df, cols, cut)
            result <- matrix(nrow = nrow(df), ncol = length(cols))
            colnames(result) <- cols
            for(col in cols)
            dup <- duplicated(df[col])
            return(df[which(rowSums(result) > cut), ])


            get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


            iso3 dest code year uv mean
            2 ALB BGR 490700 2002 1215.561 11886.49
            3 ALB BGR 490700 2002 1215.561 11886.49
            4 ALB BGR 490700 2002 1215.561 58069.40






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Mar 26 at 13:27









            Graeme Prentice-MottGraeme Prentice-Mott

            688 bronze badges




            688 bronze badges





















                1














                Using dplyr and rlang package we can achive this-



                Solution-



                find_dupes <- function(df,cols)
                df <- df %>% get_dupes(!!!rlang::syms(cols))
                return(df)



                Output-



                1st Case-



                > cols
                [1] "iso3" "dest" "code" "year" "uv"

                > find_dupes(df, cols)

                # A tibble: 3 x 7
                iso3 dest code year uv dupe_count mean
                <fct> <fct> <int> <int> <dbl> <int> <dbl>
                1 ALB BGR 490700 2002 1216. 3 11886.
                2 ALB BGR 490700 2002 1216. 3 11886.
                3 ALB BGR 490700 2002 1216. 3 58069.


                2nd Case-



                > cols
                [1] "iso3" "dest" "code" "year" "uv" "mean"

                > find_dupes(df,cols)

                # A tibble: 2 x 7
                iso3 dest code year uv mean dupe_count
                <fct> <fct> <int> <int> <dbl> <dbl> <int>
                1 ALB BGR 490700 2002 1216. 11886. 2
                2 ALB BGR 490700 2002 1216. 11886. 2


                Note-



                rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.



                To pass a list of vector names in dplyr function, we use syms.



                !!! is used to unquote






                share|improve this answer





























                  1














                  Using dplyr and rlang package we can achive this-



                  Solution-



                  find_dupes <- function(df,cols)
                  df <- df %>% get_dupes(!!!rlang::syms(cols))
                  return(df)



                  Output-



                  1st Case-



                  > cols
                  [1] "iso3" "dest" "code" "year" "uv"

                  > find_dupes(df, cols)

                  # A tibble: 3 x 7
                  iso3 dest code year uv dupe_count mean
                  <fct> <fct> <int> <int> <dbl> <int> <dbl>
                  1 ALB BGR 490700 2002 1216. 3 11886.
                  2 ALB BGR 490700 2002 1216. 3 11886.
                  3 ALB BGR 490700 2002 1216. 3 58069.


                  2nd Case-



                  > cols
                  [1] "iso3" "dest" "code" "year" "uv" "mean"

                  > find_dupes(df,cols)

                  # A tibble: 2 x 7
                  iso3 dest code year uv mean dupe_count
                  <fct> <fct> <int> <int> <dbl> <dbl> <int>
                  1 ALB BGR 490700 2002 1216. 11886. 2
                  2 ALB BGR 490700 2002 1216. 11886. 2


                  Note-



                  rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.



                  To pass a list of vector names in dplyr function, we use syms.



                  !!! is used to unquote






                  share|improve this answer



























                    1












                    1








                    1







                    Using dplyr and rlang package we can achive this-



                    Solution-



                    find_dupes <- function(df,cols)
                    df <- df %>% get_dupes(!!!rlang::syms(cols))
                    return(df)



                    Output-



                    1st Case-



                    > cols
                    [1] "iso3" "dest" "code" "year" "uv"

                    > find_dupes(df, cols)

                    # A tibble: 3 x 7
                    iso3 dest code year uv dupe_count mean
                    <fct> <fct> <int> <int> <dbl> <int> <dbl>
                    1 ALB BGR 490700 2002 1216. 3 11886.
                    2 ALB BGR 490700 2002 1216. 3 11886.
                    3 ALB BGR 490700 2002 1216. 3 58069.


                    2nd Case-



                    > cols
                    [1] "iso3" "dest" "code" "year" "uv" "mean"

                    > find_dupes(df,cols)

                    # A tibble: 2 x 7
                    iso3 dest code year uv mean dupe_count
                    <fct> <fct> <int> <int> <dbl> <dbl> <int>
                    1 ALB BGR 490700 2002 1216. 11886. 2
                    2 ALB BGR 490700 2002 1216. 11886. 2


                    Note-



                    rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.



                    To pass a list of vector names in dplyr function, we use syms.



                    !!! is used to unquote






                    share|improve this answer















                    Using dplyr and rlang package we can achive this-



                    Solution-



                    find_dupes <- function(df,cols)
                    df <- df %>% get_dupes(!!!rlang::syms(cols))
                    return(df)



                    Output-



                    1st Case-



                    > cols
                    [1] "iso3" "dest" "code" "year" "uv"

                    > find_dupes(df, cols)

                    # A tibble: 3 x 7
                    iso3 dest code year uv dupe_count mean
                    <fct> <fct> <int> <int> <dbl> <int> <dbl>
                    1 ALB BGR 490700 2002 1216. 3 11886.
                    2 ALB BGR 490700 2002 1216. 3 11886.
                    3 ALB BGR 490700 2002 1216. 3 58069.


                    2nd Case-



                    > cols
                    [1] "iso3" "dest" "code" "year" "uv" "mean"

                    > find_dupes(df,cols)

                    # A tibble: 2 x 7
                    iso3 dest code year uv mean dupe_count
                    <fct> <fct> <int> <int> <dbl> <dbl> <int>
                    1 ALB BGR 490700 2002 1216. 11886. 2
                    2 ALB BGR 490700 2002 1216. 11886. 2


                    Note-



                    rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.



                    To pass a list of vector names in dplyr function, we use syms.



                    !!! is used to unquote







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Mar 26 at 13:44

























                    answered Mar 26 at 13:36









                    RushabhRushabh

                    1,5994 silver badges22 bronze badges




                    1,5994 silver badges22 bronze badges





















                        1














                        We can use group_by_all and filter that having more than 1 frequency count



                        library(dplyr)
                        df1 %>%
                        group_by_all() %>%
                        filter(n() > 1)
                        # A tibble: 2 x 6
                        # Groups: iso3, dest, code, year, uv, mean [1]
                        # iso3 dest code year uv mean
                        # <chr> <chr> <int> <int> <dbl> <dbl>
                        #1 ALB BGR 490700 2002 1216. 11886.
                        #2 ALB BGR 490700 2002 1216. 11886.


                        if it is a subset of columns, use group_by_at



                        df1 %>%
                        group_by_at(vars(iso3, dest, code, year, uv)) %>%
                        filter(n() > 1)
                        # A tibble: 3 x 6
                        # Groups: iso3, dest, code, year, uv [1]
                        # iso3 dest code year uv mean
                        # <chr> <chr> <int> <int> <dbl> <dbl>
                        #1 ALB BGR 490700 2002 1216. 11886.
                        #2 ALB BGR 490700 2002 1216. 11886.
                        #3 ALB BGR 490700 2002 1216. 58069.





                        share|improve this answer



























                          1














                          We can use group_by_all and filter that having more than 1 frequency count



                          library(dplyr)
                          df1 %>%
                          group_by_all() %>%
                          filter(n() > 1)
                          # A tibble: 2 x 6
                          # Groups: iso3, dest, code, year, uv, mean [1]
                          # iso3 dest code year uv mean
                          # <chr> <chr> <int> <int> <dbl> <dbl>
                          #1 ALB BGR 490700 2002 1216. 11886.
                          #2 ALB BGR 490700 2002 1216. 11886.


                          if it is a subset of columns, use group_by_at



                          df1 %>%
                          group_by_at(vars(iso3, dest, code, year, uv)) %>%
                          filter(n() > 1)
                          # A tibble: 3 x 6
                          # Groups: iso3, dest, code, year, uv [1]
                          # iso3 dest code year uv mean
                          # <chr> <chr> <int> <int> <dbl> <dbl>
                          #1 ALB BGR 490700 2002 1216. 11886.
                          #2 ALB BGR 490700 2002 1216. 11886.
                          #3 ALB BGR 490700 2002 1216. 58069.





                          share|improve this answer

























                            1












                            1








                            1







                            We can use group_by_all and filter that having more than 1 frequency count



                            library(dplyr)
                            df1 %>%
                            group_by_all() %>%
                            filter(n() > 1)
                            # A tibble: 2 x 6
                            # Groups: iso3, dest, code, year, uv, mean [1]
                            # iso3 dest code year uv mean
                            # <chr> <chr> <int> <int> <dbl> <dbl>
                            #1 ALB BGR 490700 2002 1216. 11886.
                            #2 ALB BGR 490700 2002 1216. 11886.


                            if it is a subset of columns, use group_by_at



                            df1 %>%
                            group_by_at(vars(iso3, dest, code, year, uv)) %>%
                            filter(n() > 1)
                            # A tibble: 3 x 6
                            # Groups: iso3, dest, code, year, uv [1]
                            # iso3 dest code year uv mean
                            # <chr> <chr> <int> <int> <dbl> <dbl>
                            #1 ALB BGR 490700 2002 1216. 11886.
                            #2 ALB BGR 490700 2002 1216. 11886.
                            #3 ALB BGR 490700 2002 1216. 58069.





                            share|improve this answer













                            We can use group_by_all and filter that having more than 1 frequency count



                            library(dplyr)
                            df1 %>%
                            group_by_all() %>%
                            filter(n() > 1)
                            # A tibble: 2 x 6
                            # Groups: iso3, dest, code, year, uv, mean [1]
                            # iso3 dest code year uv mean
                            # <chr> <chr> <int> <int> <dbl> <dbl>
                            #1 ALB BGR 490700 2002 1216. 11886.
                            #2 ALB BGR 490700 2002 1216. 11886.


                            if it is a subset of columns, use group_by_at



                            df1 %>%
                            group_by_at(vars(iso3, dest, code, year, uv)) %>%
                            filter(n() > 1)
                            # A tibble: 3 x 6
                            # Groups: iso3, dest, code, year, uv [1]
                            # iso3 dest code year uv mean
                            # <chr> <chr> <int> <int> <dbl> <dbl>
                            #1 ALB BGR 490700 2002 1216. 11886.
                            #2 ALB BGR 490700 2002 1216. 11886.
                            #3 ALB BGR 490700 2002 1216. 58069.






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Mar 26 at 14:04









                            akrunakrun

                            452k15 gold badges252 silver badges338 bronze badges




                            452k15 gold badges252 silver badges338 bronze badges



























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55357100%2fremove-quasi-identical-rows%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                                Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

                                Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript