I have a string similar to the following
my_string <- "apple,banana,orange,"
And I want to split by ,
to produce the output:
list(c('apple', 'banana', 'orange', ""))
I thought strsplit would accomplish this but it treats the trailing ‘,’ like it doesn’t exist
my_string <- "apple,banana,orange,"
strsplit(my_string, split = ',')
#> [[1]]
#> [1] "apple" "banana" "orange"
Created on 2023-11-15 by the reprex package (v2.0.1)
What is the simplest approach to achieve the desired output?
Some more test cases with example strings and desired outputs
string1 = "apple,banana,orange,"
output1 = list(c('apple', 'banana', 'orange', ''))
string2 = "apple,banana,orange,pear"
output2 = list(c('apple', 'banana', 'orange', 'pear'))
string3 = ",apple,banana,orange"
output3 = list(c('', 'apple', 'banana', 'orange'))
## Examples of non-comma separated strings
# '|' separator
string4 = "|apple|banana|orange|"
output4 = list(c('', 'apple', 'banana', 'orange', ''))
# 'x' separator
string5 = "xapplexbananaxorangex"
output5 = list(c('', 'apple', 'banana', 'orange', ''))
EDIT:
Ideally solution should generalize to any splitting character
Would also prefer a base-R solution (although do still link any packages which supply this functionality since their source code might be useful to look through!)
4
4 Answers
Why strsplit
Doesn’t Give Desired Output?
When you type ?strsplit
, you will read the following statement
Note that this means that if there is a match at the beginning of a
(non-empty) string, the first element of the output is "", but if
there is a match at the end of the string, the output is the same as
with the match removed.
That is the reason you don’t see the trailing ""
when you use strsplit
.
Below are some demonstrations
> strsplit("apple,banana,orange,", ",")
[[1]]
[1] "apple" "banana" "orange"
> strsplit(",apple,banana,orange,", ",")
[[1]]
[1] "" "apple" "banana" "orange"
> strsplit(",apple,banana,orange", ",")
[[1]]
[1] "" "apple" "banana" "orange"
> strsplit("apple,banana,orange", ",")
[[1]]
[1] "apple" "banana" "orange"
A Base R Workaround
If you want to make a coding practice, one base R option can be defining a custom function (recursion) like below
f <- function(x, sep = ",") {
pat <- sprintf("^(.*?)%s.*", sep)
s1 <- sub(pat, "\1", x)
s2 <- sub(paste0("^.*?", sep), "", x)
if (s2 == x) {
return(x)
}
c(s1, Recall(s2, sep))
}
such that
> f("apple,banana,orange,")
[1] "apple" "banana" "orange" ""
> f(",apple,banana,orange,")
[1] "" "apple" "banana" "orange" ""
> f(",apple,banana,orange")
[1] "" "apple" "banana" "orange"
> f("apple,banana,orange")
[1] "apple" "banana" "orange"
3
-
That is likely demonstrated above. but might more usefully be reinforced through a couple of code examples with starting with "". vs ending with "", as a big fan of
strsplit
, but hadn't given this much thought.– Chris19 hours ago
-
@Chris see my update with demonstration
– ThomasIsCoding
19 hours ago
-
Much appreciated addition and answers worth saving.
– Chris15 hours ago
Use stringr
library(stringr)
str_split(my_string, ",")
[[1]]
[1] "apple" "banana" "orange" ""
6
-
1
This works (+1), but interestingly enough, still doesn't work for
strsplit
, as opposed tostringr::str_split
.– thelatemail20 hours ago
-
1
which is expected
## Note that final empty strings are not produced
– rawr20 hours ago
-
3
I think this answer can be simplified to just using
stringr::str_split()
since it handles leading and trailing strings,stringr::str_split(",apple,banana,orange,", pattern = ",")
– Selk20 hours ago
-
2
This is a great solution and is likely to be useful for future viewers. Only reason reason it is not marked as the answer is due to the preference for a base-R solution
– Selk20 hours ago
-
1
If simplicity is desired,
stringr::str_split_1(my_string, ",")
will return acharacter vector
instead of of alist
:[1] "apple" "banana" "orange" ""
.– Adriano Mello20 hours ago
Pasting another separator at the end should allow strsplit
to function as intended.
Otherwise, you could fall back to using the scan
function, which underpins the read.csv/table
functions:
strsplit(paste0(string1, ","), ",")
##[[1]]
##[1] "apple" "banana" "orange" ""
Generalisably taking into account regex replacement:
L <- list(string1, string2, string3, string4, string5)
mapply(
function(x,s) strsplit(paste0(x, gsub("\\", "", s)), split=s),
L,
c(",", ",", ",", "\|", "x")
)
##[[1]]
##[1] "apple" "banana" "orange" ""
##
##[[2]]
##[1] "apple" "banana" "orange" "pear"
##
##[[3]]
##[1] "" "apple" "banana" "orange"
##
##[[4]]
##[1] "" "apple" "banana" "orange" ""
##
##[[5]]
##[1] "" "apple" "banana" "orange" ""
scan
option:
scan(text=string1, sep=",", what="")
##Read 4 items
##[1] "apple" "banana" "orange" ""
Generalising:
mapply(
function(x,s) scan(text=x, sep=s, what=""),
L,
c(",", ",", ",", "|", "x")
)
2
-
I think
scan
is the cheapest base R workaround for this question, cheers!– ThomasIsCoding
19 hours ago
-
Marking as answer as meets all criteria (base R implementation, outputs exactly as described in question). For future reference, the answer by ThomasIsCoding describes an alternative baseR solution thats also really nice. Anyone not requiring a baseR implementation should see GuedesBF answer for a simple solution using stringr
– Selk4 hours ago
i used this
my_string <- "apple,banana,orange,"
# Now, i Append an extra character (here I use 'X') and then splitting
result <- strsplit(paste0(my_string, "X"), ",X")
result
Then for the use case
split_string <- function(s) {
# Add a special character at the beginning and end if the string starts or ends with a comma
if (startsWith(s, ",")) {
s <- paste0("SPECIALCHAR", s)
}
if (endsWith(s, ",")) {
s <- paste0(s, "SPECIALCHAR")
}
# Split the string by comma
parts <- strsplit(s, ",", fixed = TRUE)[[1]]
# Replace the special character with an empty string
parts <- gsub("SPECIALCHAR", "", parts)
return(parts)
}
# Test cases
string1 <- "apple,banana,orange,"
string2 <- "apple,banana,orange,pear"
string3 <- ",apple,banana,orange"
output1 <- split_string(string1)
output2 <- split_string(string2)
output3 <- split_string(string3)
output1 # Expected: "apple", "banana", "orange", ""
output2 # Expected: "apple", "banana", "orange", "pear"
output3 # Expected: "", "apple", "banana", "orange"
5
-
2
This doesn't work – it doesn't add a blank string at the end and it also doesn't split the original string.
– thelatemail21 hours ago
-
1
Your original idea was on the right path though I think – just add another separator and then
strsplit
–strsplit(paste0(my_string, ","), ",")
should work.– thelatemail20 hours ago
-
@thelatemail strsplit(paste0(my_string, ","), ",") is another neat solution, but its worth noting this won't generalise to regex / escaped values for split. Can solve all of my test cases but only if for '|' separator you use fixed=TRUE instead of trying to escape it with '\|'
– Selk20 hours ago
-
@Selk – works for me I think –
strsplit(paste(string4, "|"), split="\|")
– thelatemail20 hours ago
-
@thelatemail yep, I meant if you were to generalise this into a function that would work on any separator i.e.
(strsplit2(x, sep))
, you'd have to add some logic to strip out double backslashes from yoursep
string before pasting it, if that makes sense. I agree that this, and your scan solution look like the most promising base-R solutions. If you're interested in putting together an answer that describes both approaches I think that would make a great official answer!– Selk20 hours ago
I can't find anything, but I'm sure this is a dupe. Check out
stringi::stri_split_fixed
21 hours ago
It's a diversion from
strsplit
, butscan(text=my_string, sep=",", what="")
works as intended.20 hours ago
@thelatemail I quite like your scan solution, Its great as quick baseR workaround. Needs to be sat in an apply function to get it working sensibly on vectors but otherwise is a very neat trick
20 hours ago
scan is a good solution, just wrap it into a function, then use lapply as you wish.
20 hours ago