Efficiently match all values of a vector in another vector

Efficiently match all values of a vector in another vector

11

I’m looking to find an efficient method of matching all values of vector x in vector y rather than just the first position, as is returned by match(). What I’m after essentially is the default behavior of pmatch() but without partial matching:

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

Expected output:

pmatch(x, y)  
[1]  1  5 NA  2  3 NA

One way is to use ave() however this becomes slow and very memory inefficient as the number of groups increases:

ave(x, x, FUN = (v) which(y == v[1])[1:length(v)])
[1]  1  5 NA  2  3 NA

Can anyone recommend an efficient way to achieve this in preferably (but not mandatory) base R?

Larger dataset for benchmarking:

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

4 Answers
4

Highest score (default)

Trending (recent votes count more)

Date modified (newest first)

Date created (oldest first)

6

Just to point out, you can use match + make.unique to accomplish the same. Speedwise, it might be slower than the data.table approach:

match(make.unique(as.character(x)), make.unique(as.character(y)))

[1]  1  5 NA  2  3 NA

match(make.names(x, TRUE), make.names(y, TRUE))
[1]  1  5 NA  2  3 NA

6

Using a data.table join, inspired by this Q&A.

library(data.table)

matchall <- function(x, y) {
  data.table(y, rowid(y))[
    data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  ]
}

Check behavior

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

matchall(x, y)
#> [1]  1  5 NA  2  3 NA

Timing on larger vectors:

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

system.time(z1 <- matchall(x, y))
#>    user  system elapsed 
#>    0.06    0.00    0.01

system.time(z2 <- ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]))
#>    user  system elapsed 
#>    0.88    0.43    1.31

identical(z1, z2)
#> [1] TRUE

5

A variant in base using split.

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
. <- do.call(rbind, Map((a, b) cbind(a, b[seq_along(a)]), a, b))
`[<-`(.[,2], .[,1], .[,2])
#[1]  1  5 NA  2  3 NA

An RCPP version might look like:

Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <unordered_map>
#include <queue>

using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pm(const std::vector<int>& a, const std::vector<int>& b) {
  IntegerVector idx(no_init(a.size()));
  std::unordered_map<int, std::queue<int> > lut;
  for(int i = 0; i < b.size(); ++i) lut[b[i]].push(i+1);
  for(int i = 0; i < idx.size(); ++i) {
    auto search = lut.find(a[i]);
    if(search != lut.end() && search->second.size() > 0) {
      idx[i] = search->second.front();
      search->second.pop();
    } else {idx[i] = NA_INTEGER;}
  }
  return idx;
}
)")
pm(x, y)
#[1]  1  5 NA  2  3 NA

Benchmark

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

library(data.table)

matchall <- function(x, y) {
  data.table(y, rowid(y))[
    data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  ]
}

bench::mark(
ave = ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]),
make.name = match(make.names(x, TRUE), make.names(y, TRUE)),
make.unique = match(make.unique(as.character(x)), make.unique(as.character(y))),
split = {a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
. <- do.call(rbind, Map((a, b) cbind(a, b[seq_along(a)]), a, b))
`[<-`(.[,2], .[,1], .[,2])},
data.table = matchall(x, y),
RCPP = pm(x, y) )

Result

  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 ave            1.56s    1.56s     0.641    3.73GB    60.9      1    95
2 make.name   145.74ms 150.78ms     6.59    14.06MB     3.29     4     2
3 make.unique  74.39ms  78.68ms    12.5      9.49MB     3.57     7     2
4 split        33.32ms  43.27ms    19.8       9.9MB    25.2     11    14
5 data.table    6.32ms   7.22ms   125.       5.13MB    21.9     63    11
6 RCPP           3.1ms   3.32ms   290.     393.16KB     1.99   146     1

In this case the C++ version is the fastest and allocates the lowest amount of memory. In case using base the split variant is the fastest and allocates a little bit more memory than make.unique.

1

If you have some extra memory to spare, you can speed up the process by sorting the values and basically doing a two-pointer walk through to match up the data. Here’s what what would look like

rmatch <- function(x, y) {
  xp <- cbind(seq_along(x), x)[order(x),]
  yp <- cbind(seq_along(y), y)[order(y),]
  result <- numeric(length(x))
  
  xi <- yi <- 1
  Nx <- length(x)
  Ny <- length(y)
  while (xi < Nx) {
    if (yi > Ny) {
      result[xi] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] == yp[yi,2]) {
      result[xp[xi,1]] = yp[yi,1]
      xi <- xi + 1
      yi <- yi + 1
    } else if (xp[xi,2] < yp[yi,2]) {
      result[xp[xi,1]] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] > yp[yi,2]) {
      yi <- yi + 1
    }
  }
  result  
}

I tested with some of the other base R options posted here

mbm <- microbenchmark::microbenchmark(
  ave = ave(x, x, FUN = (v) which(y == v[1])[1:length(v)]),
  rmatch = rmatch(x, y),
  pmatch = pmatch(x, y),
  times = 20
)

And saw that it seemed to perform well

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
    ave  1227.6743  1247.6980  1283.1024  1264.1485  1324.1569  1349.3276    20
 rmatch   198.1744   201.1058   208.3158   204.5933   209.4863   247.7279    20
 pmatch 39514.4227 39595.9720 39717.5887 39628.0892 39805.2405 40105.4337    20

These all return the same vector of values.

Recognized by R Language Collective

2

  • 1

    Might be competitive with data.table with a C++ implementation?

    – s_baldur

    20 hours ago

  • 2

    Sure. But my goal was to just use base R, no dependencies (including system tools for compiling C++). Already the data.table approach is much faster because most of the work happens in the C++ back end.

    – MrFlick

    20 hours ago

Your Answer

Draft saved
Draft discarded

Post as a guest

Required, but never shown


By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

Not the answer you're looking for? Browse other questions tagged

or ask your own question.

Leave a Reply

Your email address will not be published. Required fields are marked *