Efficiently match all values of a vector in another vector

Question

I’m looking to find an efficient method of matching all values of vector x in vector y rather than just the first position, as is returned by match(). What I’m after essentially is the default behavior of pmatch() but without partial matching:

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

Expected output:

pmatch(x, y)  
[1]  1  5 NA  2  3 NA

One way is to use ave() however this becomes slow and very memory inefficient as the number of groups increases:

ave(x, x, FUN = (v) which(y == v[1])[1:length(v)])
[1]  1  5 NA  2  3 NA

Can anyone recommend an efficient way to achieve this in preferably (but not mandatory) base R?

Larger dataset for benchmarking:

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

score 6 · Accepted Answer · 2023-06-01 16:16:26Z

Just to point out, you can use match + make.unique to accomplish the same. Speedwise, it might be slower than the data.table approach:

match(make.unique(as.character(x)), make.unique(as.character(y)))

[1]  1  5 NA  2  3 NA

match(make.names(x, TRUE), make.names(y, TRUE))
[1]  1  5 NA  2  3 NA

score 6 · Accepted Answer · 2023-06-01 17:30:30Z

Using a data.table join, inspired by this Q&A.

library(data.table)

matchall <- function(x, y) {
  data.table(y, rowid(y))[
    data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  ]
}

Check behavior

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

matchall(x, y)
#> [1]  1  5 NA  2  3 NA

Timing on larger vectors:

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

system.time(z1 <- matchall(x, y))
#>    user  system elapsed 
#>    0.06    0.00    0.01

system.time(z2 <- ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]))
#>    user  system elapsed 
#>    0.88    0.43    1.31

identical(z1, z2)
#> [1] TRUE

score 5 · Accepted Answer · 2023-06-01 21:47:29Z

A variant in base using split.

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
. <- do.call(rbind, Map((a, b) cbind(a, b[seq_along(a)]), a, b))
`[<-`(.[,2], .[,1], .[,2])
#[1]  1  5 NA  2  3 NA

An RCPP version might look like:

Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <unordered_map>
#include <queue>

using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pm(const std::vector<int>& a, const std::vector<int>& b) {
  IntegerVector idx(no_init(a.size()));
  std::unordered_map<int, std::queue<int> > lut;
  for(int i = 0; i < b.size(); ++i) lut[b[i]].push(i+1);
  for(int i = 0; i < idx.size(); ++i) {
    auto search = lut.find(a[i]);
    if(search != lut.end() && search->second.size() > 0) {
      idx[i] = search->second.front();
      search->second.pop();
    } else {idx[i] = NA_INTEGER;}
  }
  return idx;
}
)")
pm(x, y)
#[1]  1  5 NA  2  3 NA

Benchmark

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

library(data.table)

matchall <- function(x, y) {
  data.table(y, rowid(y))[
    data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  ]
}

bench::mark(
ave = ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]),
make.name = match(make.names(x, TRUE), make.names(y, TRUE)),
make.unique = match(make.unique(as.character(x)), make.unique(as.character(y))),
split = {a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
. <- do.call(rbind, Map((a, b) cbind(a, b[seq_along(a)]), a, b))
`[<-`(.[,2], .[,1], .[,2])},
data.table = matchall(x, y),
RCPP = pm(x, y) )

Result

  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 ave            1.56s    1.56s     0.641    3.73GB    60.9      1    95
2 make.name   145.74ms 150.78ms     6.59    14.06MB     3.29     4     2
3 make.unique  74.39ms  78.68ms    12.5      9.49MB     3.57     7     2
4 split        33.32ms  43.27ms    19.8       9.9MB    25.2     11    14
5 data.table    6.32ms   7.22ms   125.       5.13MB    21.9     63    11
6 RCPP           3.1ms   3.32ms   290.     393.16KB     1.99   146     1

In this case the C++ version is the fastest and allocates the lowest amount of memory. In case using base the split variant is the fastest and allocates a little bit more memory than make.unique.

score 1 · Accepted Answer · 2023-06-01 15:15:12Z

If you have some extra memory to spare, you can speed up the process by sorting the values and basically doing a two-pointer walk through to match up the data. Here’s what what would look like

rmatch <- function(x, y) {
  xp <- cbind(seq_along(x), x)[order(x),]
  yp <- cbind(seq_along(y), y)[order(y),]
  result <- numeric(length(x))
  
  xi <- yi <- 1
  Nx <- length(x)
  Ny <- length(y)
  while (xi < Nx) {
    if (yi > Ny) {
      result[xi] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] == yp[yi,2]) {
      result[xp[xi,1]] = yp[yi,1]
      xi <- xi + 1
      yi <- yi + 1
    } else if (xp[xi,2] < yp[yi,2]) {
      result[xp[xi,1]] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] > yp[yi,2]) {
      yi <- yi + 1
    }
  }
  result  
}

I tested with some of the other base R options posted here

mbm <- microbenchmark::microbenchmark(
  ave = ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]),
  rmatch = rmatch(x, y),
  pmatch = pmatch(x, y),
  times = 20
)

And saw that it seemed to perform well

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
    ave  1227.6743  1247.6980  1283.1024  1264.1485  1324.1569  1349.3276    20
 rmatch   198.1744   201.1058   208.3158   204.5933   209.4863   247.7279    20
 pmatch 39514.4227 39595.9720 39717.5887 39628.0892 39805.2405 40105.4337    20

These all return the same vector of values.

Might be competitive with data.table with a C++ implementation? — 20 hours ago
Sure. But my goal was to just use base R, no dependencies (including system tools for compiling C++). Already the data.table approach is much faster because most of the work happens in the C++ back end. — 20 hours ago

Efficiently match all values of a vector in another vector