Is there an R function for returning sorted indexes of any values of a vector?Rcpp rank function that does average tiesHow do you sort a dictionary by value?How do I sort a list of dictionaries by a value of the dictionary?Sort a Map<Key, Value> by valuesHow do I sort a dictionary by value?Set a default parameter value for a JavaScript functionSorting JavaScript Object by property valueSort array of objects by string property valueHow to Sort Multi-dimensional Array by Value?How to return a string value from a Bash functionextending a function that takes a data.table as an argument to use the full table (instead of a subset)
If I said I had $100 when asked, but I actually had $200, would I be lying by omission?
Can a character use multiple reactions in response to the same trigger?
Alternatives to Network Backup
Drawing probabilities on a simplex in TikZ
Why does a sticker slowly peel off, but if it is pulled quickly it tears?
Can a paladin prepare more spells if they didn't cast any the previous day?
Is it true that different variants of the same model aircraft don't require pilot retraining?
Why does the weaker C–H bond have a higher wavenumber than the C=O bond?
Will removing shelving screws from studs damage the studs?
74S vs 74LS ICs
Find feasible point in polynomial time in linear programming
助けてくれて有難う meaning and usage
How to report a deceptive in app purchase
Dotted background on a flowchart
Was a star-crossed lover
Why can't you say don't instead of won't?
Force SQL Server to use fragmented indexes?
Is there a word or phrase that means "use other people's wifi or Internet service without consent"?
Defending Castle from Zombies
Commercial company wants me to list all prior "inventions", give up everything not listed
How many petaflops does it take to land on the moon? What does Artemis need with an Aitken?
Stolen MacBook should I worry about my data?
Is this password scheme legit?
Did ancient peoples ever hide their treasure behind puzzles?
Is there an R function for returning sorted indexes of any values of a vector?
Rcpp rank function that does average tiesHow do you sort a dictionary by value?How do I sort a list of dictionaries by a value of the dictionary?Sort a Map<Key, Value> by valuesHow do I sort a dictionary by value?Set a default parameter value for a JavaScript functionSorting JavaScript Object by property valueSort array of objects by string property valueHow to Sort Multi-dimensional Array by Value?How to return a string value from a Bash functionextending a function that takes a data.table as an argument to use the full table (instead of a subset)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I'm not fluent in R data.table and any help will be greatly appreciated to resolve the following problem !
I have big data.table(~1000000 rows) with columns of numeric values and i want to output a same dimension data.table with the sorted indexes position of each row values.
a short example:
-Input:
dt = data.frame(ack = 1:7)
dt$A1 = c( 1, 6, 9, 10, 3, 5, NA)
dt$A2 = c( 25, 12, 30, 10, 50, 1, 30)
dt$A3 = c( 100, 63, 91, 110, 1, 4, 10)
dt$A4 = c( 51, 65, 2, 1, 0, 200, 1)
first row: 1 (1) <= 25 (2) <= 51 (3) <= 100 (4),
row sorted indexes position for (1, 25, 100, 51) are (1, 2, 4, 3) and output should be:
dt$PosA1 = c(1, ...
dt$PosA2 = c(2, ...
dt$PosA3 = c(4, ...
dt$PosA4 = c(3, ...
3rd row : 2 (1) <= 9 (2) <= 30 (3) <= 91 (4) , must output:
dt$PosA1 = c( 1,1,2,...)
dt$PosA2 = c( 2,2,3,...)
dt$PosA3 = c( 4,3,4,...)
dt$PosA4 = c( 3,4,1,...)
Output is a same dimension of input data.table filled with values of sorted indexes by rows .
dt$PosA1 = c( 1, 1, 2, 2, 3, 1, NA)
dt$PosA2 = c( 2, 2, 3, 3, 4, 2, 3)
dt$PosA3 = c( 4, 3, 4, 4, 2, 2, 2)
dt$PosA4 = c( 3, 4, 1, 1, 1, 4, 1)
I think about perhaps something like this?
library(data.table)
setDT(dt)
# pseudocode
dt[, PosA1 := rowPosition(.SD, 1, na.rm=T),
PosA2 := rowPosition(.SD, 2, na.rm=T),
PosA3 := rowPosition(.SD, 3, na.rm=T),
PosA4 := rowPosition(.SD, 4, na.rm=T),
.SDcols=c(A1, A2, A3, A4)]
I'm not sure of syntax and i miss a rowPosition Function. does any function exist to do that ? (i named it rowPosition here)
A little help would be great to code an efficient one , or another approach to solve the problem!
regards.
r function sorting data.table row
add a comment |
I'm not fluent in R data.table and any help will be greatly appreciated to resolve the following problem !
I have big data.table(~1000000 rows) with columns of numeric values and i want to output a same dimension data.table with the sorted indexes position of each row values.
a short example:
-Input:
dt = data.frame(ack = 1:7)
dt$A1 = c( 1, 6, 9, 10, 3, 5, NA)
dt$A2 = c( 25, 12, 30, 10, 50, 1, 30)
dt$A3 = c( 100, 63, 91, 110, 1, 4, 10)
dt$A4 = c( 51, 65, 2, 1, 0, 200, 1)
first row: 1 (1) <= 25 (2) <= 51 (3) <= 100 (4),
row sorted indexes position for (1, 25, 100, 51) are (1, 2, 4, 3) and output should be:
dt$PosA1 = c(1, ...
dt$PosA2 = c(2, ...
dt$PosA3 = c(4, ...
dt$PosA4 = c(3, ...
3rd row : 2 (1) <= 9 (2) <= 30 (3) <= 91 (4) , must output:
dt$PosA1 = c( 1,1,2,...)
dt$PosA2 = c( 2,2,3,...)
dt$PosA3 = c( 4,3,4,...)
dt$PosA4 = c( 3,4,1,...)
Output is a same dimension of input data.table filled with values of sorted indexes by rows .
dt$PosA1 = c( 1, 1, 2, 2, 3, 1, NA)
dt$PosA2 = c( 2, 2, 3, 3, 4, 2, 3)
dt$PosA3 = c( 4, 3, 4, 4, 2, 2, 2)
dt$PosA4 = c( 3, 4, 1, 1, 1, 4, 1)
I think about perhaps something like this?
library(data.table)
setDT(dt)
# pseudocode
dt[, PosA1 := rowPosition(.SD, 1, na.rm=T),
PosA2 := rowPosition(.SD, 2, na.rm=T),
PosA3 := rowPosition(.SD, 3, na.rm=T),
PosA4 := rowPosition(.SD, 4, na.rm=T),
.SDcols=c(A1, A2, A3, A4)]
I'm not sure of syntax and i miss a rowPosition Function. does any function exist to do that ? (i named it rowPosition here)
A little help would be great to code an efficient one , or another approach to solve the problem!
regards.
r function sorting data.table row
add a comment |
I'm not fluent in R data.table and any help will be greatly appreciated to resolve the following problem !
I have big data.table(~1000000 rows) with columns of numeric values and i want to output a same dimension data.table with the sorted indexes position of each row values.
a short example:
-Input:
dt = data.frame(ack = 1:7)
dt$A1 = c( 1, 6, 9, 10, 3, 5, NA)
dt$A2 = c( 25, 12, 30, 10, 50, 1, 30)
dt$A3 = c( 100, 63, 91, 110, 1, 4, 10)
dt$A4 = c( 51, 65, 2, 1, 0, 200, 1)
first row: 1 (1) <= 25 (2) <= 51 (3) <= 100 (4),
row sorted indexes position for (1, 25, 100, 51) are (1, 2, 4, 3) and output should be:
dt$PosA1 = c(1, ...
dt$PosA2 = c(2, ...
dt$PosA3 = c(4, ...
dt$PosA4 = c(3, ...
3rd row : 2 (1) <= 9 (2) <= 30 (3) <= 91 (4) , must output:
dt$PosA1 = c( 1,1,2,...)
dt$PosA2 = c( 2,2,3,...)
dt$PosA3 = c( 4,3,4,...)
dt$PosA4 = c( 3,4,1,...)
Output is a same dimension of input data.table filled with values of sorted indexes by rows .
dt$PosA1 = c( 1, 1, 2, 2, 3, 1, NA)
dt$PosA2 = c( 2, 2, 3, 3, 4, 2, 3)
dt$PosA3 = c( 4, 3, 4, 4, 2, 2, 2)
dt$PosA4 = c( 3, 4, 1, 1, 1, 4, 1)
I think about perhaps something like this?
library(data.table)
setDT(dt)
# pseudocode
dt[, PosA1 := rowPosition(.SD, 1, na.rm=T),
PosA2 := rowPosition(.SD, 2, na.rm=T),
PosA3 := rowPosition(.SD, 3, na.rm=T),
PosA4 := rowPosition(.SD, 4, na.rm=T),
.SDcols=c(A1, A2, A3, A4)]
I'm not sure of syntax and i miss a rowPosition Function. does any function exist to do that ? (i named it rowPosition here)
A little help would be great to code an efficient one , or another approach to solve the problem!
regards.
r function sorting data.table row
I'm not fluent in R data.table and any help will be greatly appreciated to resolve the following problem !
I have big data.table(~1000000 rows) with columns of numeric values and i want to output a same dimension data.table with the sorted indexes position of each row values.
a short example:
-Input:
dt = data.frame(ack = 1:7)
dt$A1 = c( 1, 6, 9, 10, 3, 5, NA)
dt$A2 = c( 25, 12, 30, 10, 50, 1, 30)
dt$A3 = c( 100, 63, 91, 110, 1, 4, 10)
dt$A4 = c( 51, 65, 2, 1, 0, 200, 1)
first row: 1 (1) <= 25 (2) <= 51 (3) <= 100 (4),
row sorted indexes position for (1, 25, 100, 51) are (1, 2, 4, 3) and output should be:
dt$PosA1 = c(1, ...
dt$PosA2 = c(2, ...
dt$PosA3 = c(4, ...
dt$PosA4 = c(3, ...
3rd row : 2 (1) <= 9 (2) <= 30 (3) <= 91 (4) , must output:
dt$PosA1 = c( 1,1,2,...)
dt$PosA2 = c( 2,2,3,...)
dt$PosA3 = c( 4,3,4,...)
dt$PosA4 = c( 3,4,1,...)
Output is a same dimension of input data.table filled with values of sorted indexes by rows .
dt$PosA1 = c( 1, 1, 2, 2, 3, 1, NA)
dt$PosA2 = c( 2, 2, 3, 3, 4, 2, 3)
dt$PosA3 = c( 4, 3, 4, 4, 2, 2, 2)
dt$PosA4 = c( 3, 4, 1, 1, 1, 4, 1)
I think about perhaps something like this?
library(data.table)
setDT(dt)
# pseudocode
dt[, PosA1 := rowPosition(.SD, 1, na.rm=T),
PosA2 := rowPosition(.SD, 2, na.rm=T),
PosA3 := rowPosition(.SD, 3, na.rm=T),
PosA4 := rowPosition(.SD, 4, na.rm=T),
.SDcols=c(A1, A2, A3, A4)]
I'm not sure of syntax and i miss a rowPosition Function. does any function exist to do that ? (i named it rowPosition here)
A little help would be great to code an efficient one , or another approach to solve the problem!
regards.
r function sorting data.table row
r function sorting data.table row
edited Mar 27 at 21:30
Frank
59.6k6 gold badges67 silver badges143 bronze badges
59.6k6 gold badges67 silver badges143 bronze badges
asked Mar 27 at 20:38
PascalPascal
82 bronze badges
82 bronze badges
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank
that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function()
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
data.table.frank <- function()
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank
:
for (j in 1:nc)
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
option B) amend the avg_rank
code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n)
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++)
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) #additional code
r[w[i + k]] = NA_REAL; #additional code
else
r[w[i + k]] = i + (n + 1) / 2.;
return r;
hello @chinsoon12, should be great but i don't now how avg_rank can be available from my Rstudio envt (library(Rcpp) is not sufficient , and i don't know how to ````#assuming that you have saved nrussell code in avg_rank.cpp
.
– Pascal
Mar 29 at 16:10
sorry for my low-level knowledge in R :(
– Pascal
Mar 29 at 16:12
I red TFM, got Rtool installed and source avg_rank.cpp and launch again and.... Greaaaaaat !!! 20s instead 8mn !!!! If I can abuse. I would like NA value stay NA and keep Columns name instead V1...VN. Thx a lot !!!!!!
– Pascal
Mar 29 at 17:45
You got so far in a few hours. These last 2 questions are nothing to you.
– chinsoon12
Mar 30 at 0:13
:)) thanks too encourage me to read harder. I resolve "columns" question (dt[, (cols) = ....]), but inspecting and modifying nrussel code is too hard for me at the moment. So i can get around in looking for a way to compare values of result table and orig and print result values if not NA else NA. But the smart way, in one call, would be to give avg_rank( ) a parameter like na.last = "keep" to take this exception in count).
– Pascal
Mar 30 at 9:33
|
show 2 more comments
You can convert to long form and use rank
. Or, since you're using data.table, frank
:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt
switches to long form; while dcast
converts back to wide form.
Thx @Frank , but i encounter an error:Error in melt.data.table(dt, id = "ack") : One or more values in 'id.vars' is invalid.
– Pascal
Mar 27 at 22:19
@Pascal You will need to create a row-ID column, likedt[, ack := .I]
ordt$ack <- seq_len(nrow(dt))
. I'm using the code from your post after I edited it so that it is copy-pastable. You can look above to see what I mean. Of course, you don't need to name it ack :)
– Frank
Mar 27 at 22:39
it works fine and do the Job!
– Pascal
Mar 27 at 23:11
But if I Sys.time() on my data.table (811000 x 16 ) and take about 8mn on a 4 Core I5 vPro 8th Gen , 16Go RAM. Is there a way to optimize this duration or i should consider it's a good count ?
– Pascal
Mar 27 at 23:18
1
Thanks a lot for this solution ! i wil take lot of coffee cup i waiting for better :)!
– Pascal
Mar 28 at 0:04
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55386049%2fis-there-an-r-function-for-returning-sorted-indexes-of-any-values-of-a-vector%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank
that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function()
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
data.table.frank <- function()
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank
:
for (j in 1:nc)
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
option B) amend the avg_rank
code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n)
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++)
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) #additional code
r[w[i + k]] = NA_REAL; #additional code
else
r[w[i + k]] = i + (n + 1) / 2.;
return r;
hello @chinsoon12, should be great but i don't now how avg_rank can be available from my Rstudio envt (library(Rcpp) is not sufficient , and i don't know how to ````#assuming that you have saved nrussell code in avg_rank.cpp
.
– Pascal
Mar 29 at 16:10
sorry for my low-level knowledge in R :(
– Pascal
Mar 29 at 16:12
I red TFM, got Rtool installed and source avg_rank.cpp and launch again and.... Greaaaaaat !!! 20s instead 8mn !!!! If I can abuse. I would like NA value stay NA and keep Columns name instead V1...VN. Thx a lot !!!!!!
– Pascal
Mar 29 at 17:45
You got so far in a few hours. These last 2 questions are nothing to you.
– chinsoon12
Mar 30 at 0:13
:)) thanks too encourage me to read harder. I resolve "columns" question (dt[, (cols) = ....]), but inspecting and modifying nrussel code is too hard for me at the moment. So i can get around in looking for a way to compare values of result table and orig and print result values if not NA else NA. But the smart way, in one call, would be to give avg_rank( ) a parameter like na.last = "keep" to take this exception in count).
– Pascal
Mar 30 at 9:33
|
show 2 more comments
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank
that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function()
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
data.table.frank <- function()
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank
:
for (j in 1:nc)
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
option B) amend the avg_rank
code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n)
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++)
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) #additional code
r[w[i + k]] = NA_REAL; #additional code
else
r[w[i + k]] = i + (n + 1) / 2.;
return r;
hello @chinsoon12, should be great but i don't now how avg_rank can be available from my Rstudio envt (library(Rcpp) is not sufficient , and i don't know how to ````#assuming that you have saved nrussell code in avg_rank.cpp
.
– Pascal
Mar 29 at 16:10
sorry for my low-level knowledge in R :(
– Pascal
Mar 29 at 16:12
I red TFM, got Rtool installed and source avg_rank.cpp and launch again and.... Greaaaaaat !!! 20s instead 8mn !!!! If I can abuse. I would like NA value stay NA and keep Columns name instead V1...VN. Thx a lot !!!!!!
– Pascal
Mar 29 at 17:45
You got so far in a few hours. These last 2 questions are nothing to you.
– chinsoon12
Mar 30 at 0:13
:)) thanks too encourage me to read harder. I resolve "columns" question (dt[, (cols) = ....]), but inspecting and modifying nrussel code is too hard for me at the moment. So i can get around in looking for a way to compare values of result table and orig and print result values if not NA else NA. But the smart way, in one call, would be to give avg_rank( ) a parameter like na.last = "keep" to take this exception in count).
– Pascal
Mar 30 at 9:33
|
show 2 more comments
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank
that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function()
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
data.table.frank <- function()
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank
:
for (j in 1:nc)
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
option B) amend the avg_rank
code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n)
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++)
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) #additional code
r[w[i + k]] = NA_REAL; #additional code
else
r[w[i + k]] = i + (n + 1) / 2.;
return r;
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank
that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function()
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
data.table.frank <- function()
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank
:
for (j in 1:nc)
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
option B) amend the avg_rank
code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n)
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++)
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) #additional code
r[w[i + k]] = NA_REAL; #additional code
else
r[w[i + k]] = i + (n + 1) / 2.;
return r;
edited Apr 3 at 0:54
community wiki
4 revs
chinsoon12
hello @chinsoon12, should be great but i don't now how avg_rank can be available from my Rstudio envt (library(Rcpp) is not sufficient , and i don't know how to ````#assuming that you have saved nrussell code in avg_rank.cpp
.
– Pascal
Mar 29 at 16:10
sorry for my low-level knowledge in R :(
– Pascal
Mar 29 at 16:12
I red TFM, got Rtool installed and source avg_rank.cpp and launch again and.... Greaaaaaat !!! 20s instead 8mn !!!! If I can abuse. I would like NA value stay NA and keep Columns name instead V1...VN. Thx a lot !!!!!!
– Pascal
Mar 29 at 17:45
You got so far in a few hours. These last 2 questions are nothing to you.
– chinsoon12
Mar 30 at 0:13
:)) thanks too encourage me to read harder. I resolve "columns" question (dt[, (cols) = ....]), but inspecting and modifying nrussel code is too hard for me at the moment. So i can get around in looking for a way to compare values of result table and orig and print result values if not NA else NA. But the smart way, in one call, would be to give avg_rank( ) a parameter like na.last = "keep" to take this exception in count).
– Pascal
Mar 30 at 9:33
|
show 2 more comments
hello @chinsoon12, should be great but i don't now how avg_rank can be available from my Rstudio envt (library(Rcpp) is not sufficient , and i don't know how to ````#assuming that you have saved nrussell code in avg_rank.cpp
.
– Pascal
Mar 29 at 16:10
sorry for my low-level knowledge in R :(
– Pascal
Mar 29 at 16:12
I red TFM, got Rtool installed and source avg_rank.cpp and launch again and.... Greaaaaaat !!! 20s instead 8mn !!!! If I can abuse. I would like NA value stay NA and keep Columns name instead V1...VN. Thx a lot !!!!!!
– Pascal
Mar 29 at 17:45
You got so far in a few hours. These last 2 questions are nothing to you.
– chinsoon12
Mar 30 at 0:13
:)) thanks too encourage me to read harder. I resolve "columns" question (dt[, (cols) = ....]), but inspecting and modifying nrussel code is too hard for me at the moment. So i can get around in looking for a way to compare values of result table and orig and print result values if not NA else NA. But the smart way, in one call, would be to give avg_rank( ) a parameter like na.last = "keep" to take this exception in count).
– Pascal
Mar 30 at 9:33
hello @chinsoon12, should be great but i don't now how avg_rank can be available from my Rstudio envt (library(Rcpp) is not sufficient , and i don't know how to ````
#assuming that you have saved nrussell code in avg_rank.cpp
.– Pascal
Mar 29 at 16:10
hello @chinsoon12, should be great but i don't now how avg_rank can be available from my Rstudio envt (library(Rcpp) is not sufficient , and i don't know how to ````
#assuming that you have saved nrussell code in avg_rank.cpp
.– Pascal
Mar 29 at 16:10
sorry for my low-level knowledge in R :(
– Pascal
Mar 29 at 16:12
sorry for my low-level knowledge in R :(
– Pascal
Mar 29 at 16:12
I red TFM, got Rtool installed and source avg_rank.cpp and launch again and.... Greaaaaaat !!! 20s instead 8mn !!!! If I can abuse. I would like NA value stay NA and keep Columns name instead V1...VN. Thx a lot !!!!!!
– Pascal
Mar 29 at 17:45
I red TFM, got Rtool installed and source avg_rank.cpp and launch again and.... Greaaaaaat !!! 20s instead 8mn !!!! If I can abuse. I would like NA value stay NA and keep Columns name instead V1...VN. Thx a lot !!!!!!
– Pascal
Mar 29 at 17:45
You got so far in a few hours. These last 2 questions are nothing to you.
– chinsoon12
Mar 30 at 0:13
You got so far in a few hours. These last 2 questions are nothing to you.
– chinsoon12
Mar 30 at 0:13
:)) thanks too encourage me to read harder. I resolve "columns" question (dt[, (cols) = ....]), but inspecting and modifying nrussel code is too hard for me at the moment. So i can get around in looking for a way to compare values of result table and orig and print result values if not NA else NA. But the smart way, in one call, would be to give avg_rank( ) a parameter like na.last = "keep" to take this exception in count).
– Pascal
Mar 30 at 9:33
:)) thanks too encourage me to read harder. I resolve "columns" question (dt[, (cols) = ....]), but inspecting and modifying nrussel code is too hard for me at the moment. So i can get around in looking for a way to compare values of result table and orig and print result values if not NA else NA. But the smart way, in one call, would be to give avg_rank( ) a parameter like na.last = "keep" to take this exception in count).
– Pascal
Mar 30 at 9:33
|
show 2 more comments
You can convert to long form and use rank
. Or, since you're using data.table, frank
:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt
switches to long form; while dcast
converts back to wide form.
Thx @Frank , but i encounter an error:Error in melt.data.table(dt, id = "ack") : One or more values in 'id.vars' is invalid.
– Pascal
Mar 27 at 22:19
@Pascal You will need to create a row-ID column, likedt[, ack := .I]
ordt$ack <- seq_len(nrow(dt))
. I'm using the code from your post after I edited it so that it is copy-pastable. You can look above to see what I mean. Of course, you don't need to name it ack :)
– Frank
Mar 27 at 22:39
it works fine and do the Job!
– Pascal
Mar 27 at 23:11
But if I Sys.time() on my data.table (811000 x 16 ) and take about 8mn on a 4 Core I5 vPro 8th Gen , 16Go RAM. Is there a way to optimize this duration or i should consider it's a good count ?
– Pascal
Mar 27 at 23:18
1
Thanks a lot for this solution ! i wil take lot of coffee cup i waiting for better :)!
– Pascal
Mar 28 at 0:04
|
show 1 more comment
You can convert to long form and use rank
. Or, since you're using data.table, frank
:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt
switches to long form; while dcast
converts back to wide form.
Thx @Frank , but i encounter an error:Error in melt.data.table(dt, id = "ack") : One or more values in 'id.vars' is invalid.
– Pascal
Mar 27 at 22:19
@Pascal You will need to create a row-ID column, likedt[, ack := .I]
ordt$ack <- seq_len(nrow(dt))
. I'm using the code from your post after I edited it so that it is copy-pastable. You can look above to see what I mean. Of course, you don't need to name it ack :)
– Frank
Mar 27 at 22:39
it works fine and do the Job!
– Pascal
Mar 27 at 23:11
But if I Sys.time() on my data.table (811000 x 16 ) and take about 8mn on a 4 Core I5 vPro 8th Gen , 16Go RAM. Is there a way to optimize this duration or i should consider it's a good count ?
– Pascal
Mar 27 at 23:18
1
Thanks a lot for this solution ! i wil take lot of coffee cup i waiting for better :)!
– Pascal
Mar 28 at 0:04
|
show 1 more comment
You can convert to long form and use rank
. Or, since you're using data.table, frank
:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt
switches to long form; while dcast
converts back to wide form.
You can convert to long form and use rank
. Or, since you're using data.table, frank
:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt
switches to long form; while dcast
converts back to wide form.
answered Mar 27 at 21:35
FrankFrank
59.6k6 gold badges67 silver badges143 bronze badges
59.6k6 gold badges67 silver badges143 bronze badges
Thx @Frank , but i encounter an error:Error in melt.data.table(dt, id = "ack") : One or more values in 'id.vars' is invalid.
– Pascal
Mar 27 at 22:19
@Pascal You will need to create a row-ID column, likedt[, ack := .I]
ordt$ack <- seq_len(nrow(dt))
. I'm using the code from your post after I edited it so that it is copy-pastable. You can look above to see what I mean. Of course, you don't need to name it ack :)
– Frank
Mar 27 at 22:39
it works fine and do the Job!
– Pascal
Mar 27 at 23:11
But if I Sys.time() on my data.table (811000 x 16 ) and take about 8mn on a 4 Core I5 vPro 8th Gen , 16Go RAM. Is there a way to optimize this duration or i should consider it's a good count ?
– Pascal
Mar 27 at 23:18
1
Thanks a lot for this solution ! i wil take lot of coffee cup i waiting for better :)!
– Pascal
Mar 28 at 0:04
|
show 1 more comment
Thx @Frank , but i encounter an error:Error in melt.data.table(dt, id = "ack") : One or more values in 'id.vars' is invalid.
– Pascal
Mar 27 at 22:19
@Pascal You will need to create a row-ID column, likedt[, ack := .I]
ordt$ack <- seq_len(nrow(dt))
. I'm using the code from your post after I edited it so that it is copy-pastable. You can look above to see what I mean. Of course, you don't need to name it ack :)
– Frank
Mar 27 at 22:39
it works fine and do the Job!
– Pascal
Mar 27 at 23:11
But if I Sys.time() on my data.table (811000 x 16 ) and take about 8mn on a 4 Core I5 vPro 8th Gen , 16Go RAM. Is there a way to optimize this duration or i should consider it's a good count ?
– Pascal
Mar 27 at 23:18
1
Thanks a lot for this solution ! i wil take lot of coffee cup i waiting for better :)!
– Pascal
Mar 28 at 0:04
Thx @Frank , but i encounter an error:
Error in melt.data.table(dt, id = "ack") : One or more values in 'id.vars' is invalid.
– Pascal
Mar 27 at 22:19
Thx @Frank , but i encounter an error:
Error in melt.data.table(dt, id = "ack") : One or more values in 'id.vars' is invalid.
– Pascal
Mar 27 at 22:19
@Pascal You will need to create a row-ID column, like
dt[, ack := .I]
or dt$ack <- seq_len(nrow(dt))
. I'm using the code from your post after I edited it so that it is copy-pastable. You can look above to see what I mean. Of course, you don't need to name it ack :)– Frank
Mar 27 at 22:39
@Pascal You will need to create a row-ID column, like
dt[, ack := .I]
or dt$ack <- seq_len(nrow(dt))
. I'm using the code from your post after I edited it so that it is copy-pastable. You can look above to see what I mean. Of course, you don't need to name it ack :)– Frank
Mar 27 at 22:39
it works fine and do the Job!
– Pascal
Mar 27 at 23:11
it works fine and do the Job!
– Pascal
Mar 27 at 23:11
But if I Sys.time() on my data.table (811000 x 16 ) and take about 8mn on a 4 Core I5 vPro 8th Gen , 16Go RAM. Is there a way to optimize this duration or i should consider it's a good count ?
– Pascal
Mar 27 at 23:18
But if I Sys.time() on my data.table (811000 x 16 ) and take about 8mn on a 4 Core I5 vPro 8th Gen , 16Go RAM. Is there a way to optimize this duration or i should consider it's a good count ?
– Pascal
Mar 27 at 23:18
1
1
Thanks a lot for this solution ! i wil take lot of coffee cup i waiting for better :)!
– Pascal
Mar 28 at 0:04
Thanks a lot for this solution ! i wil take lot of coffee cup i waiting for better :)!
– Pascal
Mar 28 at 0:04
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55386049%2fis-there-an-r-function-for-returning-sorted-indexes-of-any-values-of-a-vector%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown