Each character own element - regex

I have a character vector
x <- "nkiLVkqspmLVAydnaVNLSCkys"
I want to split it into a vector with 25 elements, so that:
x[1]
# [1] "n"
x[2]
# [1] "k"
# and so on ...
The only thing I can think of is to do a regex replace any alpha character with alpha and "," and then split on ",". Is there a simpler way to do this?

Try this:
x<-"nkiLVkqspmLVAydnaVNLSCkys"
y<-data.frame(strsplit(x,""))
then do:
y[1,1]
or
y[2,1]
and so on ...

We can also use
library(stringr)
y= data.frame(v1=str_extract_all(x, '.')[[1]])

Related

Regex of consecutive punctuation in R

I have a character vector that looks like this:
z <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
[9992] "./."
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9996] ",/,"
[9997] "wretched/JJ"
I want to remove all entries that consist of three consecutive punctuation marks, resulting in something like this:
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9997] "wretched/JJ"
I've tried different regex expressions:
sub("[:punct:]/[:punct:]", "", z)
and
sub("[:punct:]{3}", "", z)
with either single/double brackets, both yield:
[9992] "./."
[9993] "To"
[9994] "my$"
[9995] "starved"
[9996] ",/,"
[9997] "wretched"
Any ideas? And I apologize in advance if the question is dumb; I'm not very good at this!
Try this:
x <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
grep("[[:punct:]]{3}", x, value = TRUE, invert = TRUE)
## [1] "To/TO" "my/PRP$" "starved/VBN" "wretched/JJ"

String split in R skipping first delimiter if multiple delimiters are present

I have "elephant_giraffe_lion" and "monkey_tiger" strings.
The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at that delimiter. So the results I want to get in this example are "elephant_giraffe" and "monkey".
mystring<-c("elephant_giraffe_lion", "monkey_tiger")
result
"elephant_giraffe" "monkey"
You can anchor your split to the end of the string using $,
unlist(strsplit(mystring, "_[a-z]+$"))
# [1] "elephant_giraffe" "monkey"
Edit
The above only matches the last "_", not accounting for cases where there are more than two "_". For the more general case, you could try
mystring<-c("elephant_giraffe_lion", "monkey_tiger", "dogs", "foo_bar_baz_bap")
tmp <- gsub("([^_]+_[^_]+).*", "\\1", mystring)
tmp[tmp==mystring] <- sapply(strsplit(tmp[tmp==mystring], "_"), `[[`, 1)
tmp
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
You could also use gsubfn, to process the match with a function
library(gsubfn)
f <- function(x,y) if (y==x) strsplit(y, "_")[[1]][[1]] else y
gsubfn("([^_]+_[^_]+).*", f, mystring, backref=1)
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
As I posted an answer on your other related question, a base R solution:
x <- c('elephant_giraffe_lion', 'monkey_tiger', 'foo_bar_baz_bap')
sub('^(?|([^_]*_[^_]*)_.*|([^_]*)_[^_]*)$', '\\1', x, perl=TRUE)
# [1] "elephant_giraffe" "monkey" "foo_bar"

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.
As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )
another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"
You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
data
x <- c("13.23-45-6A", "0.1B", "4.A-A")

Remove end of each number in a vector in R

I have a vector like this:
a <- c(11223344,55667788)
I would like to create a new vector cutting of the last two numbers of each entry in a:
[1] 112233 556677
Do I have to use regex to achieve this or is there a simple indexing trick that I'm not aware of?
Or you could use sub
as.numeric(sub('..$','', a))
#[1] 112233 556677
If they're integers:
> trunc(a / 100)
[1] 112233 556677
Only if they're strictly positive, you could use floor:
> floor(a / 100)
[1] 112233 556677
You can use substr (or substring), cutting the number of characters off at two before the end
a <- c(11223344, 55667788)
substr(a, 1, nchar(a)-2)
# [1] "112233" "556677"
wrapping in as.numeric if necessary.

Remove numbers from alphanumeric characters

I have a list of alphanumeric characters that looks like:
x <-c('ACO2', 'BCKDHB456', 'CD444')
I would like the following output:
x <-c('ACO', 'BCKDHB', 'CD')
Any suggestions?
# dput(tmp2)
structure(c(432L, 326L, 217L, 371L, 179L, 182L, 188L, 268L, 255L,...,
), class = "factor")
You can use gsub for this:
gsub('[[:digit:]]+', '', x)
or
gsub('[0-9]+', '', x)
# [1] "ACO" "BCKDHB" "CD"
If your goal is just to remove numbers, then the removeNumbers() function removes numbers from a text. Using it reduces the risk of mistakes.
library(tm)
x <-c('ACO2', 'BCKDHB456', 'CD444')
x <- removeNumbers(x)
x
[1] "ACO" "BCKDHB" "CD"
Using stringr
Most stringr functions handle regex
str_replace_all will do what you need
str_replace_all(c('ACO2', 'BCKDHB456', 'CD444'), "[:digit:]", "")
A solution using stringi:
# your data
x <-c('ACO2', 'BCKDHB456', 'CD444')
# extract capital letters
x <- stri_extract_all_regex(x, "[A-Z]+")
# unlist, so that you have a vector
x <- unlist(x)
Solution in one line: