Efficiently finding all matches of vector in lookup table, with repeats

Question

I want to find indices of all matches of a vector x in another lookup vector table.

table = rep(1:5, each=3)
x = c(2, 5, 2, 6)

Standard base R methods don’t quite give me what I want. For example using which(table %in% x) we only get the matching indices once, even though 2 appears twice in x

which(table %in% x)
# [1]  4  5  6 13 14 15

On the other hand, match returns values for every occurrence of x that has a match, but only returns the first index in the lookup table.

match(x, table)
# [1]  4 13  4 NA

What I want is a function that returns the indices for "all x and all y". I.e. it should return the following desired result:

mymatch(x, table)
# c(4, 5, 6, 13, 14, 15, 4, 5, 6)

We can, of course, do this with a loop in R:

mymatch = function(x, table) {
  matches = sapply(x, (xx) which(table %in% xx)) 
  unlist(matches)
}

mymatch(x, table)
# [1]  4  5  6 13 14 15  4  5  6

But this is horribly slow on larger data (I need to do this operation many times on big data)

table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)
system.time(mymatch(x, table))
#  user  system elapsed 
# 3.279   2.881   6.157

This is very slow if we compare it to e.g. which %in%:

system.time(which(table %in% x))
#  user  system elapsed 
# 0.003   0.004   0.008

Hoping there is a fast way to do this in R? Otherwise, maybe RCpp is the way to go.

I asked a similar questions some time back: Most efficient way to determine if element exists in a vector — yesterday
When I run mymatch on the larger sample data, I get a 10×1000 matrix, is that the desired output structure? — 23 hours ago
@JonSpring – no it was not intended. Well spotted. The desired output format is a simple vector. — 23 hours ago
Is table always asorted array of 1:n? with m repetitions or is it a random vector — 23 hours ago
@Onyambu No, table is not necessarily 1:n. In my current use case, it's ok to assume it is sorted, although more general solutions are also of interest. — 23 hours ago

score 12 · Accepted Answer · 2023-10-26 03:29:47Z

Another way is to use split:

unlist(split(seq(table), table)[as.character(x)],use.names = FALSE)
[1]  4  5  6 13 14 15  4  5  6

Edit:

Note that if table is sorted, then you could use rle + sequence:

faster <- function(x, table){
  a <- rle(table)
  n <- length(a$lengths)
  idx <- match(x, a$values, 0)
  sequence(a$lengths[idx], cumsum(c(1,a$lengths[-n]))[idx])
}

set.seed(42)
table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)
bench::mark(
   faster(x, table),
   #mymatch(x, table) |> as.vector(),
   join_match(x, table),
   #unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),
   check = TRUE
 )


# A tibble: 2 × 13
  expression     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 faster(x,…  54.4ms  252ms      3.97    54.9MB     1.99     2     1      503ms <int>  <Rprofmem>
2 join_matc… 127.7ms  254ms      3.93    88.8MB     5.90     2     3      508ms <int>  <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>

The function works as long as table is sorted. Not necessarily from 1:n.

table = c(rep(1:5, each=3), 7,7,7,7,10,10)
x = c(10, 2, 5,7, 2, 6)

microbenchmark::microbenchmark(
   faster(x, table),
   #mymatch(x, table) |> as.vector(),
   join_match(x, table),
   #unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),
   check = 'equal'
 )
Unit: microseconds
                 expr      min       lq       mean   median       uq       max neval
     faster(x, table)   23.001   32.751   56.95703   56.400   66.201   222.901   100
 join_match(x, table) 4216.201 4925.302 6616.51401 5572.951 7842.200 21153.402   100

I would like to award this the "most elegant solution", which is also impressively fast. But will probably give the check mark to the join version, because it is (slightly) faster. Worth noting though that the memory allocation for this is one third that of the join_match answer, so if memory rather than speed were my bottleneck, I would use this solution. — 23 hours ago
For my current use case, yes it is a sorted vector with n repetitions. But the values are not contiguous sequences. A faster solution that uses these constraints would certainly be great. — 23 hours ago
@dww Note that if x has alot of repetitions within it, you rather only match the unique values and the extract for the specific x values. ie instead of doing match(c(2,2,2,2,2), table) just match(2,table) then index accordingly — 22 hours ago

jared_mamrotjared_mamrot 23.3k4 gold badges21 silver badges47 bronze badges · Accepted Answer · 2023-10-26 04:23:59Z

Maybe data.table would be an option? If you have relatively large table/vector you might see an improvement in speed, especially if you go with something along the lines of Jon Spring’s "join" approach:

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:lubridate':
#> 
#>     hour, isoweek, mday, minute, month, quarter, second, wday, week,
#>     yday, year
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
library(microbenchmark)

onyambu_faster <- function(x, table){
  a <- rle(table)
  n <- length(a$lengths)
  idx <- match(x, a$values, 0)
  sequence(a$lengths[idx], cumsum(c(1,a$lengths[-n]))[idx])
}

jon_spring_join_match = function(x, table) {
  t <- data.frame(table, index = 1:length(table))
  xt <- data.frame(x, index = 1:length(x))
  t |>
    left_join(xt, join_by(table == x), relationship = 'many-to-many') |>
    arrange(index.y) %>%
    filter(!is.na(index.y)) %>%
    pull(index.x)
}

jared_mamrot_dt <- function(x, table){
  table_dt <- data.table(table, index = 1:length(table))
  x_dt <- data.table(x, index = 1:length(x))
  return(na.omit(table_dt[x_dt, on = .(table == x)][,index]))
}

table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)

all.equal(onyambu_faster(x, table), jared_mamrot_dt(x, table))
#> [1] TRUE
all.equal(jon_spring_join_match(x, table), jared_mamrot_dt(x, table))
#> [1] TRUE

res <- microbenchmark(onyambu_faster(x, table),
                      jon_spring_join_match(x, table),
                      jared_mamrot_dt(x, table),
                      times = 10)
res
#> Unit: milliseconds
#>                             expr       min       lq      mean   median
#>         onyambu_faster(x, table) 38.196317 45.08884  65.22651 52.40748
#>  jon_spring_join_match(x, table) 48.697968 74.54407 105.79551 83.11473
#>        jared_mamrot_dt(x, table)  9.441176 11.34315  12.99648 11.76324
#>         uq       max neval cld
#>   64.88688 129.38505    10 a  
#>  131.50681 221.16477    10  b 
#>   14.05289  21.84779    10   c
autoplot(res)

^{Created on 2023-10-26 with reprex v2.0.2}

Now I’m curious if duckdb, arrow, or collapse might be even faster… — 11 hours ago

score 8 · Accepted Answer · 2023-10-26 14:36:32Z

Should be faster as a join. This is >100x faster.

library(dplyr)
join_match = function(x, table) {
  t <- data.frame(table, index = 1:length(table))
  xt <- data.frame(x, index = 1:length(x))
  t |>
    left_join(xt, join_by(table == x), relationship = 'many-to-many') |>
    arrange(index.y) %>%
    filter(!is.na(index.y)) %>%
    pull(index.x)
}

Same output, 100-200x as fast, and ~3x as fast as the base R suggestion from @Onyambu (note: that approach has since updated to be similar speed, and a data.table solution is even faster. Using duckdb, or arrow, or collapse to do the join might be faster still. But my observation remains that you can get dramatic speed improvements + legibility by thinking of this as a join):

set.seed(42)
table = rep(1:1e5, each=10)
x = sample(1:100, 1000, replace = TRUE)
bench::mark(
  mymatch(x, table) |> as.vector(),
  join_match(x, table),
  unlist(split(seq(table), table)[as.character(x)],use.names = FALSE),
  check = TRUE
)

# A tibble: 3 × 13
  expression                                                                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                                                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 as.vector(mymatch(x, table))                                            13.8s    13.8s    0.0727    14.9GB     2.83     1    39      13.8s <int> 
2 join_match(x, table)                                                   48.7ms   62.2ms   13.8       88.8MB     3.95     7     2    506.3ms <int> 
3 unlist(split(seq(table), table)[as.character(x)], use.names = FALSE)  183.6ms  184.5ms    5.31      29.8MB     0        3     0    564.9ms <int>

mymatch was producing a matrix. If ok to flatten that to a vector, the output matches my function. — 23 hours ago

score 5 · Accepted Answer · 2023-10-26 02:38:02Z

Using the data in the question this is twice as fast based on the median time on my machine.

table = rep(1:5, each=3)
x = c(2, 5, 2, 6)

mymatch = function(x, table) {
  matches = sapply(x, (xx) which(table %in% xx)) 
  unlist(matches)
}

outer_match <- function(x, table) {
  z1 <- outer(table, x, "==") 
  z2 <- z1 * row(z1)
  z2[z2 != 0]
}

outer_match(x, table)
## [1]  4  5  6 13 14 15  4  5  6

library(microbenchmark)
microbenchmark(
 mymatch(x, table),
 outer_match(x, table)
)
## Unit: microseconds
##                   expr  min    lq    mean median   uq    max neval cld
##      mymatch(x, table) 77.0 79.15 166.696  82.75 84.3 8384.9   100   a
##  outer_match(x, table) 35.1 36.75 115.783  41.95 43.1 7410.1   100   a

score 2 · Accepted Answer · 2023-10-26 07:55:56Z

2

You can simply run outer + row (short code but might be not that efficient, due to outer), e.g.,

> row(d <- outer(table, x, `==`))[d]
[1]  4  5  6 13 14 15  4  5  6

edited 18 hours ago

answered 18 hours ago

ThomasIsCoding

81.6k9 gold badges28 silver badges83 bronze badges

Recognized by R Language Collective

anjamaanjama 4103 silver badges9 bronze badges · Accepted Answer · 2023-10-26 19:54:01Z

If the values being matched are integers, then you can use them as index values in a list containing the index values you want (as long as the max integer isn’t so big that the list exceeds your RAM capacity).

# Process the table vector every time
anjama_list <- function(x, table) {
  l = vector("list", max(table))
  
  i = 0
  for (val in table) {
    i = i + 1
    l[[val]] = c(l[[val]], i)
  }
  
  return(unlist(l[x]))
}

Now, this isn’t as fast as other solutions mentioned here, but since you are reusing the table for multiple lookups, we can precalculate the list creation and reuse it across iterations:

# If the table vector is being reused, only need to process it once
l = vector("list", max(table))

i = 0
for (val in table) {
  i = i + 1
  l[[val]] = c(l[[val]], i)
}

anjama_list_cache <- function(x, l) {
  return(unlist(l[x]))
}

It turns out the list lookup and unlist part is really cheap:

Figure created using the code from jared_mamrot’s answer.

So, it depends on how many times reusing the same table vector makes up for the initial setup. In terms of memory usage, I think the lookup and unlist should also be pretty efficient (the lookup being basically nothing and the unlist being related to the size of your x vector), but I haven’t tried profiling them.

Efficiently finding all matches of vector in lookup table, with repeats

Efficiently finding all matches of vector in lookup table, with repeats

6 Answers 6

6 Answers
6