Remove (quasi) identical rowsFinding ALL duplicate rows, including “elements with smaller subscripts”Remove rows with all or some NAs (missing values) in data.frameSimultaneously merge multiple data.frames in a listFind indices of duplicated rowsFind duplicate values in RRemove duplicates, with “cookie crumbs” to remember why it was removedRemoving duplicate rows with ddplyfinding non-identical entries across a rowAre my R scripts identical?rxDataStep “transform” argument using quasi-quotationfinding mean of the variables according to date in panel data
Counterexample finite intersection property
What do Unicorns want?
Can I use Sitecore's Configuration patching mechanics for my Identity Server configuration?
Recursive search on Node Tree with Linq and Queue
What does a Nintendo Game Boy do when turned on without a game cartridge inserted?
Capture SQL Server queries without third-party tooling and without using deprecated features?
Magic is the twist
Monday's Blocking Donimoes Problem
Strange LED behavior
Is there an English word to describe when a sound "protrudes"?
Why can't a country print its own money to spend it only abroad?
Has Iron Man made any suit for underwater combat?
Is it better to have a 10 year gap or a bad reference?
Caption in landscape table, need help?
Has Peter Parker ever eaten bugs?
1025th term of the given sequence.
How to deal with making design decisions
Can "Taking algebraic closure" be made into a functor?
How can I disable a reserved profile?
MITM on HTTPS traffic in Kazakhstan 2019
Acoustic guitar chords' positions vs those of a Bass guitar
Calculating Fibonacci sequence in several different ways
How did pilots avoid thunderstorms and related weather before “reliable” airborne weather radar was introduced on airliners?
My current job follows "worst practices". How can I talk about my experience in an interview without giving off red flags?
Remove (quasi) identical rows
Finding ALL duplicate rows, including “elements with smaller subscripts”Remove rows with all or some NAs (missing values) in data.frameSimultaneously merge multiple data.frames in a listFind indices of duplicated rowsFind duplicate values in RRemove duplicates, with “cookie crumbs” to remember why it was removedRemoving duplicate rows with ddplyfinding non-identical entries across a rowAre my R scripts identical?rxDataStep “transform” argument using quasi-quotationfinding mean of the variables according to date in panel data
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.
iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930
Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...
What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv))
to find the "quasi" same rows...
the expected result being something like, in the first case:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
and in the second one:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
any idea?
r
add a comment |
in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.
iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930
Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...
What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv))
to find the "quasi" same rows...
the expected result being something like, in the first case:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
and in the second one:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
any idea?
r
2
You should trydplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)
– kath
Mar 26 at 12:29
add a comment |
in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.
iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930
Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...
What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv))
to find the "quasi" same rows...
the expected result being something like, in the first case:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
and in the second one:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
any idea?
r
in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.
iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930
Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...
What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv))
to find the "quasi" same rows...
the expected result being something like, in the first case:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
and in the second one:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
any idea?
r
r
edited Mar 26 at 12:56
Ronak Shah
69k10 gold badges48 silver badges80 bronze badges
69k10 gold badges48 silver badges80 bronze badges
asked Mar 26 at 12:24
TeYaPTeYaP
2054 silver badges16 bronze badges
2054 silver badges16 bronze badges
2
You should trydplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)
– kath
Mar 26 at 12:29
add a comment |
2
You should trydplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)
– kath
Mar 26 at 12:29
2
2
You should try
dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)
– kath
Mar 26 at 12:29
You should try
dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)
– kath
Mar 26 at 12:29
add a comment |
4 Answers
4
active
oldest
votes
We could write a function and then pass columns which we want to consider.
get_duplicated_rows <- function(df, cols)
df[duplicated(df[cols])
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069
And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)
– TeYaP
Mar 26 at 12:34
1
@TeYaP If you want to keep only one of the duplicates, remove the| duplicated(df[cols], fromLast = TRUE
part from the function and it will keep only one row.
– Ronak Shah
Mar 26 at 12:38
add a comment |
You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.
toread <- " iso3 dest code year uv mean
ALB AUT 490700 2002 14027.2433 427387.640
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 58069.405
ALB BGR 843050 2002 677.9827 4272.176
ALB BGR 851030 2002 31004.0946 32364.379
ALB HRV 392329 2002 1410.0072 6970.930"
df <- read.table(textConnection(toread), header = TRUE)
closeAllConnections()
get_quasi_duplicated_rows <- function(df, cols, cut)
result <- matrix(nrow = nrow(df), ncol = length(cols))
colnames(result) <- cols
for(col in cols)
dup <- duplicated(df[col])
return(df[which(rowSums(result) > cut), ])
get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)
iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40
add a comment |
Using dplyr
and rlang
package we can achive this-
Solution-
find_dupes <- function(df,cols)
df <- df %>% get_dupes(!!!rlang::syms(cols))
return(df)
Output-
1st Case-
> cols
[1] "iso3" "dest" "code" "year" "uv"
> find_dupes(df, cols)
# A tibble: 3 x 7
iso3 dest code year uv dupe_count mean
<fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.
2nd Case-
> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"
> find_dupes(df,cols)
# A tibble: 2 x 7
iso3 dest code year uv mean dupe_count
<fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2
Note-
rlan::syms
function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.
To pass a list of vector names in dplyr
function, we use syms
.
!!!
is used to unquote
add a comment |
We can use group_by_all
and filter that having more than 1 frequency count
library(dplyr)
df1 %>%
group_by_all() %>%
filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
if it is a subset of columns, use group_by_at
df1 %>%
group_by_at(vars(iso3, dest, code, year, uv)) %>%
filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55357100%2fremove-quasi-identical-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
We could write a function and then pass columns which we want to consider.
get_duplicated_rows <- function(df, cols)
df[duplicated(df[cols])
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069
And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)
– TeYaP
Mar 26 at 12:34
1
@TeYaP If you want to keep only one of the duplicates, remove the| duplicated(df[cols], fromLast = TRUE
part from the function and it will keep only one row.
– Ronak Shah
Mar 26 at 12:38
add a comment |
We could write a function and then pass columns which we want to consider.
get_duplicated_rows <- function(df, cols)
df[duplicated(df[cols])
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069
And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)
– TeYaP
Mar 26 at 12:34
1
@TeYaP If you want to keep only one of the duplicates, remove the| duplicated(df[cols], fromLast = TRUE
part from the function and it will keep only one row.
– Ronak Shah
Mar 26 at 12:38
add a comment |
We could write a function and then pass columns which we want to consider.
get_duplicated_rows <- function(df, cols)
df[duplicated(df[cols])
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069
We could write a function and then pass columns which we want to consider.
get_duplicated_rows <- function(df, cols)
df[duplicated(df[cols])
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069
answered Mar 26 at 12:29
Ronak ShahRonak Shah
69k10 gold badges48 silver badges80 bronze badges
69k10 gold badges48 silver badges80 bronze badges
And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)
– TeYaP
Mar 26 at 12:34
1
@TeYaP If you want to keep only one of the duplicates, remove the| duplicated(df[cols], fromLast = TRUE
part from the function and it will keep only one row.
– Ronak Shah
Mar 26 at 12:38
add a comment |
And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)
– TeYaP
Mar 26 at 12:34
1
@TeYaP If you want to keep only one of the duplicates, remove the| duplicated(df[cols], fromLast = TRUE
part from the function and it will keep only one row.
– Ronak Shah
Mar 26 at 12:38
And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)
– TeYaP
Mar 26 at 12:34
And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...)
– TeYaP
Mar 26 at 12:34
1
1
@TeYaP If you want to keep only one of the duplicates, remove the
| duplicated(df[cols], fromLast = TRUE
part from the function and it will keep only one row.– Ronak Shah
Mar 26 at 12:38
@TeYaP If you want to keep only one of the duplicates, remove the
| duplicated(df[cols], fromLast = TRUE
part from the function and it will keep only one row.– Ronak Shah
Mar 26 at 12:38
add a comment |
You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.
toread <- " iso3 dest code year uv mean
ALB AUT 490700 2002 14027.2433 427387.640
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 58069.405
ALB BGR 843050 2002 677.9827 4272.176
ALB BGR 851030 2002 31004.0946 32364.379
ALB HRV 392329 2002 1410.0072 6970.930"
df <- read.table(textConnection(toread), header = TRUE)
closeAllConnections()
get_quasi_duplicated_rows <- function(df, cols, cut)
result <- matrix(nrow = nrow(df), ncol = length(cols))
colnames(result) <- cols
for(col in cols)
dup <- duplicated(df[col])
return(df[which(rowSums(result) > cut), ])
get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)
iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40
add a comment |
You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.
toread <- " iso3 dest code year uv mean
ALB AUT 490700 2002 14027.2433 427387.640
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 58069.405
ALB BGR 843050 2002 677.9827 4272.176
ALB BGR 851030 2002 31004.0946 32364.379
ALB HRV 392329 2002 1410.0072 6970.930"
df <- read.table(textConnection(toread), header = TRUE)
closeAllConnections()
get_quasi_duplicated_rows <- function(df, cols, cut)
result <- matrix(nrow = nrow(df), ncol = length(cols))
colnames(result) <- cols
for(col in cols)
dup <- duplicated(df[col])
return(df[which(rowSums(result) > cut), ])
get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)
iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40
add a comment |
You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.
toread <- " iso3 dest code year uv mean
ALB AUT 490700 2002 14027.2433 427387.640
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 58069.405
ALB BGR 843050 2002 677.9827 4272.176
ALB BGR 851030 2002 31004.0946 32364.379
ALB HRV 392329 2002 1410.0072 6970.930"
df <- read.table(textConnection(toread), header = TRUE)
closeAllConnections()
get_quasi_duplicated_rows <- function(df, cols, cut)
result <- matrix(nrow = nrow(df), ncol = length(cols))
colnames(result) <- cols
for(col in cols)
dup <- duplicated(df[col])
return(df[which(rowSums(result) > cut), ])
get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)
iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40
You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.
toread <- " iso3 dest code year uv mean
ALB AUT 490700 2002 14027.2433 427387.640
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 58069.405
ALB BGR 843050 2002 677.9827 4272.176
ALB BGR 851030 2002 31004.0946 32364.379
ALB HRV 392329 2002 1410.0072 6970.930"
df <- read.table(textConnection(toread), header = TRUE)
closeAllConnections()
get_quasi_duplicated_rows <- function(df, cols, cut)
result <- matrix(nrow = nrow(df), ncol = length(cols))
colnames(result) <- cols
for(col in cols)
dup <- duplicated(df[col])
return(df[which(rowSums(result) > cut), ])
get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)
iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40
answered Mar 26 at 13:27
Graeme Prentice-MottGraeme Prentice-Mott
688 bronze badges
688 bronze badges
add a comment |
add a comment |
Using dplyr
and rlang
package we can achive this-
Solution-
find_dupes <- function(df,cols)
df <- df %>% get_dupes(!!!rlang::syms(cols))
return(df)
Output-
1st Case-
> cols
[1] "iso3" "dest" "code" "year" "uv"
> find_dupes(df, cols)
# A tibble: 3 x 7
iso3 dest code year uv dupe_count mean
<fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.
2nd Case-
> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"
> find_dupes(df,cols)
# A tibble: 2 x 7
iso3 dest code year uv mean dupe_count
<fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2
Note-
rlan::syms
function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.
To pass a list of vector names in dplyr
function, we use syms
.
!!!
is used to unquote
add a comment |
Using dplyr
and rlang
package we can achive this-
Solution-
find_dupes <- function(df,cols)
df <- df %>% get_dupes(!!!rlang::syms(cols))
return(df)
Output-
1st Case-
> cols
[1] "iso3" "dest" "code" "year" "uv"
> find_dupes(df, cols)
# A tibble: 3 x 7
iso3 dest code year uv dupe_count mean
<fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.
2nd Case-
> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"
> find_dupes(df,cols)
# A tibble: 2 x 7
iso3 dest code year uv mean dupe_count
<fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2
Note-
rlan::syms
function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.
To pass a list of vector names in dplyr
function, we use syms
.
!!!
is used to unquote
add a comment |
Using dplyr
and rlang
package we can achive this-
Solution-
find_dupes <- function(df,cols)
df <- df %>% get_dupes(!!!rlang::syms(cols))
return(df)
Output-
1st Case-
> cols
[1] "iso3" "dest" "code" "year" "uv"
> find_dupes(df, cols)
# A tibble: 3 x 7
iso3 dest code year uv dupe_count mean
<fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.
2nd Case-
> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"
> find_dupes(df,cols)
# A tibble: 2 x 7
iso3 dest code year uv mean dupe_count
<fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2
Note-
rlan::syms
function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.
To pass a list of vector names in dplyr
function, we use syms
.
!!!
is used to unquote
Using dplyr
and rlang
package we can achive this-
Solution-
find_dupes <- function(df,cols)
df <- df %>% get_dupes(!!!rlang::syms(cols))
return(df)
Output-
1st Case-
> cols
[1] "iso3" "dest" "code" "year" "uv"
> find_dupes(df, cols)
# A tibble: 3 x 7
iso3 dest code year uv dupe_count mean
<fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.
2nd Case-
> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"
> find_dupes(df,cols)
# A tibble: 2 x 7
iso3 dest code year uv mean dupe_count
<fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2
Note-
rlan::syms
function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.
To pass a list of vector names in dplyr
function, we use syms
.
!!!
is used to unquote
edited Mar 26 at 13:44
answered Mar 26 at 13:36
RushabhRushabh
1,5994 silver badges22 bronze badges
1,5994 silver badges22 bronze badges
add a comment |
add a comment |
We can use group_by_all
and filter that having more than 1 frequency count
library(dplyr)
df1 %>%
group_by_all() %>%
filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
if it is a subset of columns, use group_by_at
df1 %>%
group_by_at(vars(iso3, dest, code, year, uv)) %>%
filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.
add a comment |
We can use group_by_all
and filter that having more than 1 frequency count
library(dplyr)
df1 %>%
group_by_all() %>%
filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
if it is a subset of columns, use group_by_at
df1 %>%
group_by_at(vars(iso3, dest, code, year, uv)) %>%
filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.
add a comment |
We can use group_by_all
and filter that having more than 1 frequency count
library(dplyr)
df1 %>%
group_by_all() %>%
filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
if it is a subset of columns, use group_by_at
df1 %>%
group_by_at(vars(iso3, dest, code, year, uv)) %>%
filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.
We can use group_by_all
and filter that having more than 1 frequency count
library(dplyr)
df1 %>%
group_by_all() %>%
filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
if it is a subset of columns, use group_by_at
df1 %>%
group_by_at(vars(iso3, dest, code, year, uv)) %>%
filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.
answered Mar 26 at 14:04
akrunakrun
452k15 gold badges252 silver badges338 bronze badges
452k15 gold badges252 silver badges338 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55357100%2fremove-quasi-identical-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
You should try
dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)
– kath
Mar 26 at 12:29