I’m looking to find an efficient method of matching all values of vector x
in vector y
rather than just the first position, as is returned by match()
. What I’m after essentially is the default behavior of pmatch()
but without partial matching:
x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)
Expected output:
pmatch(x, y)
[1] 1 5 NA 2 3 NA
One way is to use ave()
however this becomes slow and very memory inefficient as the number of groups increases:
ave(x, x, FUN = (v) which(y == v[1])[1:length(v)])
[1] 1 5 NA 2 3 NA
Can anyone recommend an efficient way to achieve this in preferably (but not mandatory) base R?
Larger dataset for benchmarking:
set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)
4 Answers
4
Highest score (default)
Trending (recent votes count more)
Date modified (newest first)
Date created (oldest first)
Just to point out, you can use match + make.unique
to accomplish the same. Speedwise, it might be slower than the data.table approach:
match(make.unique(as.character(x)), make.unique(as.character(y)))
[1] 1 5 NA 2 3 NA
match(make.names(x, TRUE), make.names(y, TRUE))
[1] 1 5 NA 2 3 NA
Using a data.table
join, inspired by this Q&A.
library(data.table)
matchall <- function(x, y) {
data.table(y, rowid(y))[
data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
]
}
Check behavior
x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)
matchall(x, y)
#> [1] 1 5 NA 2 3 NA
Timing on larger vectors:
set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)
system.time(z1 <- matchall(x, y))
#> user system elapsed
#> 0.06 0.00 0.01
system.time(z2 <- ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]))
#> user system elapsed
#> 0.88 0.43 1.31
identical(z1, z2)
#> [1] TRUE
A variant in base using split
.
x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)
a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
. <- do.call(rbind, Map((a, b) cbind(a, b[seq_along(a)]), a, b))
`[<-`(.[,2], .[,1], .[,2])
#[1] 1 5 NA 2 3 NA
An RCPP version might look like:
Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <unordered_map>
#include <queue>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pm(const std::vector<int>& a, const std::vector<int>& b) {
IntegerVector idx(no_init(a.size()));
std::unordered_map<int, std::queue<int> > lut;
for(int i = 0; i < b.size(); ++i) lut[b[i]].push(i+1);
for(int i = 0; i < idx.size(); ++i) {
auto search = lut.find(a[i]);
if(search != lut.end() && search->second.size() > 0) {
idx[i] = search->second.front();
search->second.pop();
} else {idx[i] = NA_INTEGER;}
}
return idx;
}
)")
pm(x, y)
#[1] 1 5 NA 2 3 NA
Benchmark
set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)
library(data.table)
matchall <- function(x, y) {
data.table(y, rowid(y))[
data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
]
}
bench::mark(
ave = ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]),
make.name = match(make.names(x, TRUE), make.names(y, TRUE)),
make.unique = match(make.unique(as.character(x)), make.unique(as.character(y))),
split = {a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
. <- do.call(rbind, Map((a, b) cbind(a, b[seq_along(a)]), a, b))
`[<-`(.[,2], .[,1], .[,2])},
data.table = matchall(x, y),
RCPP = pm(x, y) )
Result
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 ave 1.56s 1.56s 0.641 3.73GB 60.9 1 95
2 make.name 145.74ms 150.78ms 6.59 14.06MB 3.29 4 2
3 make.unique 74.39ms 78.68ms 12.5 9.49MB 3.57 7 2
4 split 33.32ms 43.27ms 19.8 9.9MB 25.2 11 14
5 data.table 6.32ms 7.22ms 125. 5.13MB 21.9 63 11
6 RCPP 3.1ms 3.32ms 290. 393.16KB 1.99 146 1
In this case the C++ version is the fastest and allocates the lowest amount of memory. In case using base the split variant is the fastest and allocates a little bit more memory than make.unique.
If you have some extra memory to spare, you can speed up the process by sorting the values and basically doing a two-pointer walk through to match up the data. Here’s what what would look like
rmatch <- function(x, y) {
xp <- cbind(seq_along(x), x)[order(x),]
yp <- cbind(seq_along(y), y)[order(y),]
result <- numeric(length(x))
xi <- yi <- 1
Nx <- length(x)
Ny <- length(y)
while (xi < Nx) {
if (yi > Ny) {
result[xi] <- NA
xi <- xi + 1
} else if (xp[xi,2] == yp[yi,2]) {
result[xp[xi,1]] = yp[yi,1]
xi <- xi + 1
yi <- yi + 1
} else if (xp[xi,2] < yp[yi,2]) {
result[xp[xi,1]] <- NA
xi <- xi + 1
} else if (xp[xi,2] > yp[yi,2]) {
yi <- yi + 1
}
}
result
}
I tested with some of the other base R options posted here
mbm <- microbenchmark::microbenchmark(
ave = ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]),
rmatch = rmatch(x, y),
pmatch = pmatch(x, y),
times = 20
)
And saw that it seemed to perform well
Unit: milliseconds
expr min lq mean median uq max neval
ave 1227.6743 1247.6980 1283.1024 1264.1485 1324.1569 1349.3276 20
rmatch 198.1744 201.1058 208.3158 204.5933 209.4863 247.7279 20
pmatch 39514.4227 39595.9720 39717.5887 39628.0892 39805.2405 40105.4337 20
These all return the same vector of values.
2
-
1
Might be competitive with
data.table
with a C++ implementation?– s_baldur20 hours ago
-
2
Sure. But my goal was to just use base R, no dependencies (including system tools for compiling C++). Already the data.table approach is much faster because most of the work happens in the C++ back end.
– MrFlick
20 hours ago
Your Answer
Post as a guest
Required, but never shown
Post as a guest
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.
Not the answer you're looking for? Browse other questions tagged
or ask your own question.
or ask your own question.
|