I want to match similar words between columns

I want to match similar words between columns

7

1.0 2.0 3.0
loud complaint problems
pain stress confused
dull pain stress

this is my data set and I would like to reorganize the rows so that if there is a word that appears in each column it is transferred to a corresponding row. For example

1.0 2.0 3.0
loud NA NA
pain pain NA
dull NA NA
NA complaint NA
NA stress stress
NA NA confused
NA NA problems

etc., so that every word gets its own row and if the word in that row appears in the column it is matched up with the other columns

I have been looking how to do this but cannot find good code. One idea i was given was create a list of all appearing words and then try to match them to each column but still havent found code for this.

Share
Improve this question

New contributor

Michael Allen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

4 Answers
4

Reset to default

Highest score (default)

Trending (recent votes count more)

Date modified (newest first)

Date created (oldest first)

8

We may use match in base R – get the unique elements from the unlisted data as a vector, loop over the columns, get the index of the matching elements, and replace the index with the matched elements and convert to a data.frame after taking care of the lengths

v1 <- unique(unlist(df1))
lst1 <- lapply(df1, (x) 
   {i1 <- match(x, v1)
   replace(rep(NA, max(i1)), i1, v1[i1])
 })
list2DF(lapply(lst1, `length<-`, max(lengths(lst1))))
 1.0       2.0      3.0
1 loud      <NA>     <NA>
2 pain      pain     <NA>
3 dull      <NA>     <NA>
4 <NA> complaint     <NA>
5 <NA>    stress   stress
6 <NA>      <NA> problems
7 <NA>      <NA> confused

data

df1 <- structure(list(`1.0` = c("loud", "pain", "dull"), `2.0` = c("complaint", 
"stress", "pain"), `3.0` = c("problems", "confused", "stress"
)), class = "data.frame", row.names = c(NA, -3L))

Share
Improve this answer

1

  • 1

    The second lapply in list2DF isn't necessary; you can just use the nrow argument: list2DF(lst1, nrow = max(lengths(lst1))

    – rps1227

    21 hours ago

4

Here is a tidyverse version.

suppressMessages(library(tidyverse))
x = tibble(`1.0` = c("loud", "pain", "dull"),
           `2.0` = c("complaint", "stress", "pain"),
           `3.0` = c("problems", "confused", "stress"))

x %>% 
  gather("version", "value") %>% 
  mutate(id = value) %>% 
  spread(version, value) %>% 
  select(-id)
#> # A tibble: 7 x 3
#>   `1.0` `2.0`     `3.0`   
#>   <chr> <chr>     <chr>   
#> 1 <NA>  complaint <NA>    
#> 2 <NA>  <NA>      confused
#> 3 dull  <NA>      <NA>    
#> 4 loud  <NA>      <NA>    
#> 5 pain  pain      <NA>    
#> 6 <NA>  <NA>      problems
#> 7 <NA>  stress    stress

Created on 2023-04-11 by the reprex package (v2.0.0)

If you need to arrange the rows in the order of occurrence you can change the second statement to

mutate(id = fct_inorder(value)) %>% 

NOTE: The functions gather and spread are superseded by pivot_longer and pivot_wider functions. In my opinion, the older are a bit easier to use and are good enough for this case. The new functions are much more powerful.

Share
Improve this answer

4

Fast + efficient data.table solution:

x <- data.table(`1.0` = c("loud", "pain", "dull"),
           `2.0` = c("complaint", "stress", "pain"),
           `3.0` = c("problems", "confused", "stress"))

dcast(unique(melt(x, measure.vars = names(x))), value ~ variable)

       value  1.0       2.0      3.0
1: complaint <NA> complaint     <NA>
2:  confused <NA>      <NA> confused
3:      dull dull      <NA>     <NA>
4:      loud loud      <NA>     <NA>
5:      pain pain      pain     <NA>
6:  problems <NA>      <NA> problems
7:    stress <NA>    stress   stress

Share
Improve this answer

2

Here is a base R option using stack + reshape

reshape(
  transform(stack(df), v = values),
  direction = "wide",
  idvar = "values",
  timevar = "ind"
)[-1]

which gives

  v.x1      v.x2     v.x3
1 loud      <NA>     <NA>
2 pain      pain     <NA>
3 dull      <NA>     <NA>
4 <NA> complaint     <NA>
5 <NA>    stress   stress
7 <NA>      <NA> problems
8 <NA>      <NA> confused

Data

> dput(df)
structure(list(x1 = c("loud", "pain", "dull"), x2 = c("complaint", 
"stress", "pain"), x3 = c("problems", "confused", "stress")), class = "data.frame", row.names = c(NA,
-3L))

Share
Improve this answer

Your Answer

Michael Allen is a new contributor. Be nice, and check out our Code of Conduct.

Draft saved
Draft discarded

Post as a guest

Required, but never shown


By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged

or ask your own question.

Leave a Reply

Your email address will not be published. Required fields are marked *