Remove (quasi) identical rowsFinding ALL duplicate rows, including “elements with smaller subscripts”Remove rows with all or some NAs (missing values) in data.frameSimultaneously merge multiple data.frames in a listFind indices of duplicated rowsFind duplicate values in RRemove duplicates, with “cookie crumbs” to remember why it was removedRemoving duplicate rows with ddplyfinding non-identical entries across a rowAre my R scripts identical?rxDataStep “transform” argument using quasi-quotationfinding mean of the variables according to date in panel data

Counterexample finite intersection property

What do Unicorns want?

Can I use Sitecore's Configuration patching mechanics for my Identity Server configuration?

Recursive search on Node Tree with Linq and Queue

What does a Nintendo Game Boy do when turned on without a game cartridge inserted?

Capture SQL Server queries without third-party tooling and without using deprecated features?

Magic is the twist

Monday's Blocking Donimoes Problem

Strange LED behavior

Is there an English word to describe when a sound "protrudes"?

Why can't a country print its own money to spend it only abroad?

Has Iron Man made any suit for underwater combat?

Is it better to have a 10 year gap or a bad reference?

Caption in landscape table, need help?

Has Peter Parker ever eaten bugs?

1025th term of the given sequence.

How to deal with making design decisions

Can "Taking algebraic closure" be made into a functor?

How can I disable a reserved profile?

MITM on HTTPS traffic in Kazakhstan 2019

Acoustic guitar chords' positions vs those of a Bass guitar

Calculating Fibonacci sequence in several different ways

How did pilots avoid thunderstorms and related weather before “reliable” airborne weather radar was introduced on airliners?

My current job follows "worst practices". How can I talk about my experience in an interview without giving off red flags?

Remove (quasi) identical rows

Finding ALL duplicate rows, including “elements with smaller subscripts”Remove rows with all or some NAs (missing values) in data.frameSimultaneously merge multiple data.frames in a listFind indices of duplicated rowsFind duplicate values in RRemove duplicates, with “cookie crumbs” to remember why it was removedRemoving duplicate rows with ddplyfinding non-identical entries across a rowAre my R scripts identical?rxDataStep “transform” argument using quasi-quotationfinding mean of the variables according to date in panel data

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.

 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930

Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...

What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv)) to find the "quasi" same rows...

the expected result being something like, in the first case:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494

and in the second one:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405

any idea?

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

2

You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

– kath
Mar 26 at 12:29

add a comment |

in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.

 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930

Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...

the expected result being something like, in the first case:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494

and in the second one:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405

any idea?

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

2

You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

– kath
Mar 26 at 12:29

add a comment |

in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.

 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930

Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...

the expected result being something like, in the first case:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494

and in the second one:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405

any idea?

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.

 iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930

Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...

the expected result being something like, in the first case:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494

and in the second one:

2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405

any idea?

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

edited Mar 26 at 12:56

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

asked Mar 26 at 12:24

TeYaP

2054 silver badges16 bronze badges

2

You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

– kath
Mar 26 at 12:29

add a comment |

2

You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

– kath
Mar 26 at 12:29

You should try dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)

– kath
Mar 26 at 12:29

add a comment |

4 Answers
4

active

oldest

votes

We could write a function and then pass columns which we want to consider.

get_duplicated_rows <- function(df, cols) 
 df[duplicated(df[cols]) 

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

– TeYaP
Mar 26 at 12:34

1

@TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

– Ronak Shah
Mar 26 at 12:38

add a comment |

You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.

toread <- " iso3 dest code year uv mean
 ALB AUT 490700 2002 14027.2433 427387.640
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 58069.405
 ALB BGR 843050 2002 677.9827 4272.176
 ALB BGR 851030 2002 31004.0946 32364.379
 ALB HRV 392329 2002 1410.0072 6970.930"

df <- read.table(textConnection(toread), header = TRUE) 
closeAllConnections()

get_quasi_duplicated_rows <- function(df, cols, cut)
 result <- matrix(nrow = nrow(df), ncol = length(cols))
 colnames(result) <- cols
 for(col in cols)
 dup <- duplicated(df[col]) 
 return(df[which(rowSums(result) > cut), ])


get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


 iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

add a comment |

Using dplyr and rlang package we can achive this-

Solution-

find_dupes <- function(df,cols)
 df <- df %>% get_dupes(!!!rlang::syms(cols))
 return(df)

Output-

1st Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" 

> find_dupes(df, cols)

# A tibble: 3 x 7
 iso3 dest code year uv dupe_count mean
 <fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.

2nd Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"

> find_dupes(df,cols)

# A tibble: 2 x 7
 iso3 dest code year uv mean dupe_count
 <fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2

Note-

rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.

To pass a list of vector names in dplyr function, we use syms.

!!! is used to unquote

edited Mar 26 at 13:44

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

add a comment |

We can use group_by_all and filter that having more than 1 frequency count

library(dplyr)
df1 %>%
 group_by_all() %>% 
 filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.

if it is a subset of columns, use group_by_at

df1 %>%
 group_by_at(vars(iso3, dest, code, year, uv)) %>%
 filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55357100%2fremove-quasi-identical-rows%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

We could write a function and then pass columns which we want to consider.

get_duplicated_rows <- function(df, cols) 
 df[duplicated(df[cols]) 

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

– TeYaP
Mar 26 at 12:34

1

@TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

– Ronak Shah
Mar 26 at 12:38

add a comment |

We could write a function and then pass columns which we want to consider.

get_duplicated_rows <- function(df, cols) 
 df[duplicated(df[cols]) 

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

– TeYaP
Mar 26 at 12:34

1

@TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

– Ronak Shah
Mar 26 at 12:38

add a comment |

We could write a function and then pass columns which we want to consider.

get_duplicated_rows <- function(df, cols) 
 df[duplicated(df[cols]) 

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

We could write a function and then pass columns which we want to consider.

get_duplicated_rows <- function(df, cols) 
 df[duplicated(df[cols]) 

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

answered Mar 26 at 12:29

Ronak Shah

69k10 gold badges48 silver badges80 bronze badges

And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

– TeYaP
Mar 26 at 12:34

1

@TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

– Ronak Shah
Mar 26 at 12:38

add a comment |

And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

– TeYaP
Mar 26 at 12:34

1

@TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

– Ronak Shah
Mar 26 at 12:38

And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)

– TeYaP
Mar 26 at 12:34

@TeYaP If you want to keep only one of the duplicates, remove the | duplicated(df[cols], fromLast = TRUE part from the function and it will keep only one row.

– Ronak Shah
Mar 26 at 12:38

add a comment |

You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.

toread <- " iso3 dest code year uv mean
 ALB AUT 490700 2002 14027.2433 427387.640
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 58069.405
 ALB BGR 843050 2002 677.9827 4272.176
 ALB BGR 851030 2002 31004.0946 32364.379
 ALB HRV 392329 2002 1410.0072 6970.930"

df <- read.table(textConnection(toread), header = TRUE) 
closeAllConnections()

get_quasi_duplicated_rows <- function(df, cols, cut)
 result <- matrix(nrow = nrow(df), ncol = length(cols))
 colnames(result) <- cols
 for(col in cols)
 dup <- duplicated(df[col]) 
 return(df[which(rowSums(result) > cut), ])


get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


 iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

add a comment |

You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.

toread <- " iso3 dest code year uv mean
 ALB AUT 490700 2002 14027.2433 427387.640
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 58069.405
 ALB BGR 843050 2002 677.9827 4272.176
 ALB BGR 851030 2002 31004.0946 32364.379
 ALB HRV 392329 2002 1410.0072 6970.930"

df <- read.table(textConnection(toread), header = TRUE) 
closeAllConnections()

get_quasi_duplicated_rows <- function(df, cols, cut)
 result <- matrix(nrow = nrow(df), ncol = length(cols))
 colnames(result) <- cols
 for(col in cols)
 dup <- duplicated(df[col]) 
 return(df[which(rowSums(result) > cut), ])


get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


 iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

add a comment |

You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.

toread <- " iso3 dest code year uv mean
 ALB AUT 490700 2002 14027.2433 427387.640
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 58069.405
 ALB BGR 843050 2002 677.9827 4272.176
 ALB BGR 851030 2002 31004.0946 32364.379
 ALB HRV 392329 2002 1410.0072 6970.930"

df <- read.table(textConnection(toread), header = TRUE) 
closeAllConnections()

get_quasi_duplicated_rows <- function(df, cols, cut)
 result <- matrix(nrow = nrow(df), ncol = length(cols))
 colnames(result) <- cols
 for(col in cols)
 dup <- duplicated(df[col]) 
 return(df[which(rowSums(result) > cut), ])


get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


 iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.

toread <- " iso3 dest code year uv mean
 ALB AUT 490700 2002 14027.2433 427387.640
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 11886.494
 ALB BGR 490700 2002 1215.5613 58069.405
 ALB BGR 843050 2002 677.9827 4272.176
 ALB BGR 851030 2002 31004.0946 32364.379
 ALB HRV 392329 2002 1410.0072 6970.930"

df <- read.table(textConnection(toread), header = TRUE) 
closeAllConnections()

get_quasi_duplicated_rows <- function(df, cols, cut)
 result <- matrix(nrow = nrow(df), ncol = length(cols))
 colnames(result) <- cols
 for(col in cols)
 dup <- duplicated(df[col]) 
 return(df[which(rowSums(result) > cut), ])


get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


 iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

answered Mar 26 at 13:27

Graeme Prentice-Mott

688 bronze badges

add a comment |

Using dplyr and rlang package we can achive this-

Solution-

find_dupes <- function(df,cols)
 df <- df %>% get_dupes(!!!rlang::syms(cols))
 return(df)

Output-

1st Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" 

> find_dupes(df, cols)

# A tibble: 3 x 7
 iso3 dest code year uv dupe_count mean
 <fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.

2nd Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"

> find_dupes(df,cols)

# A tibble: 2 x 7
 iso3 dest code year uv mean dupe_count
 <fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2

Note-

To pass a list of vector names in dplyr function, we use syms.

!!! is used to unquote

edited Mar 26 at 13:44

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

add a comment |

Using dplyr and rlang package we can achive this-

Solution-

find_dupes <- function(df,cols)
 df <- df %>% get_dupes(!!!rlang::syms(cols))
 return(df)

Output-

1st Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" 

> find_dupes(df, cols)

# A tibble: 3 x 7
 iso3 dest code year uv dupe_count mean
 <fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.

2nd Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"

> find_dupes(df,cols)

# A tibble: 2 x 7
 iso3 dest code year uv mean dupe_count
 <fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2

Note-

To pass a list of vector names in dplyr function, we use syms.

!!! is used to unquote

edited Mar 26 at 13:44

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

add a comment |

Using dplyr and rlang package we can achive this-

Solution-

find_dupes <- function(df,cols)
 df <- df %>% get_dupes(!!!rlang::syms(cols))
 return(df)

Output-

1st Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" 

> find_dupes(df, cols)

# A tibble: 3 x 7
 iso3 dest code year uv dupe_count mean
 <fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.

2nd Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"

> find_dupes(df,cols)

# A tibble: 2 x 7
 iso3 dest code year uv mean dupe_count
 <fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2

Note-

To pass a list of vector names in dplyr function, we use syms.

!!! is used to unquote

edited Mar 26 at 13:44

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

Using dplyr and rlang package we can achive this-

Solution-

find_dupes <- function(df,cols)
 df <- df %>% get_dupes(!!!rlang::syms(cols))
 return(df)

Output-

1st Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" 

> find_dupes(df, cols)

# A tibble: 3 x 7
 iso3 dest code year uv dupe_count mean
 <fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.

2nd Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"

> find_dupes(df,cols)

# A tibble: 2 x 7
 iso3 dest code year uv mean dupe_count
 <fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2

Note-

To pass a list of vector names in dplyr function, we use syms.

!!! is used to unquote

edited Mar 26 at 13:44

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

edited Mar 26 at 13:44

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

answered Mar 26 at 13:36

Rushabh

1,5994 silver badges22 bronze badges

add a comment |

We can use group_by_all and filter that having more than 1 frequency count

library(dplyr)
df1 %>%
 group_by_all() %>% 
 filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.

if it is a subset of columns, use group_by_at

df1 %>%
 group_by_at(vars(iso3, dest, code, year, uv)) %>%
 filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

add a comment |

We can use group_by_all and filter that having more than 1 frequency count

library(dplyr)
df1 %>%
 group_by_all() %>% 
 filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.

if it is a subset of columns, use group_by_at

df1 %>%
 group_by_at(vars(iso3, dest, code, year, uv)) %>%
 filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

add a comment |

We can use group_by_all and filter that having more than 1 frequency count

library(dplyr)
df1 %>%
 group_by_all() %>% 
 filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.

if it is a subset of columns, use group_by_at

df1 %>%
 group_by_at(vars(iso3, dest, code, year, uv)) %>%
 filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

We can use group_by_all and filter that having more than 1 frequency count

library(dplyr)
df1 %>%
 group_by_all() %>% 
 filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.

if it is a subset of columns, use group_by_at

df1 %>%
 group_by_at(vars(iso3, dest, code, year, uv)) %>%
 filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

answered Mar 26 at 14:04

akrun

452k15 gold badges252 silver badges338 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

4 Answers
4

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

4 Answers
4

4 Answers
4

4 Answers
4