Remove end of each number in a vector in R - regex

I have a vector like this:
a <- c(11223344,55667788)
I would like to create a new vector cutting of the last two numbers of each entry in a:
[1] 112233 556677
Do I have to use regex to achieve this or is there a simple indexing trick that I'm not aware of?

Or you could use sub
as.numeric(sub('..$','', a))
#[1] 112233 556677

If they're integers:
> trunc(a / 100)
[1] 112233 556677
Only if they're strictly positive, you could use floor:
> floor(a / 100)
[1] 112233 556677

You can use substr (or substring), cutting the number of characters off at two before the end
a <- c(11223344, 55667788)
substr(a, 1, nchar(a)-2)
# [1] "112233" "556677"
wrapping in as.numeric if necessary.

Related

Each character own element

I have a character vector
x <- "nkiLVkqspmLVAydnaVNLSCkys"
I want to split it into a vector with 25 elements, so that:
x[1]
# [1] "n"
x[2]
# [1] "k"
# and so on ...
The only thing I can think of is to do a regex replace any alpha character with alpha and "," and then split on ",". Is there a simpler way to do this?
Try this:
x<-"nkiLVkqspmLVAydnaVNLSCkys"
y<-data.frame(strsplit(x,""))
then do:
y[1,1]
or
y[2,1]
and so on ...
We can also use
library(stringr)
y= data.frame(v1=str_extract_all(x, '.')[[1]])

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.
As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )
another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"
You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
data
x <- c("13.23-45-6A", "0.1B", "4.A-A")

R: indexing result from a regular expression

I am trying to use the indexes that were returned from searching through a string for every instance of a character. When I use gregexp (pattern, text),
lookfor<-"n"
string<-"ATTnGGCnATTn"
gregexpr(pattern=lookfor,text=string)
I get the following:
[[1]]
[1] 4 8 12
attr(,"match.length")
[1] 1 1 1
attr(,"useBytes")
[1] TRUE
How do I index through the first line to be able to use those locations? Thank you in advance for your help!
Addition (2) : After thinking about this for a while, I came to the conclusion that you could have simply used unlist on your original gregexpr call
> unlist(gregexpr("n", string))
# [1] 4 8 12
From your comment
I am looking for the position of each letter n
it follows that you could do any of these:
> which(strsplit(string, "")[[1]] == "n")
# [1] 4 8 12
> cumsum(nchar(strsplit(string, "n")[[1]])+1)
# [1] 4 8 12
> nc <- 1:nchar(string)
> which(substring(string, nc, nc) == "n")
# [1] 4 8 12
Addition (1) in regards to the similar strings (comment in another answer) : You could use strsplit again, and locate those values with one of the methods above
> string2 <- "ATTTGGCCATTG"
> w <- which(strsplit(string, "")[[1]] == "n")
> strsplit(string2, "")[[1]][w]
[1] "T" "C" "G"
If you want to extract all the matches, you can use the builtin function regmatches()
m <- gregexpr(regexp,string)
regmatches(string,m)
This will return a list of character vectors because string can be greater than length 1. If you're only passing one string in, you can get at the vector of matches bypassing the list with
regmatches(string,m)[[1]]
Here is a step-by-step method to find the indices. I suspect there are more efficient ways to achieve the same result. The argument fixed = TRUE tells R to look for the literal lower case "n" rather than treat it as a regular expression.
Having done so, the [[1]] portion at the end retains only the indices element of the list
To show all indices, use the length function.
string="ATTnGGCnATTn"
index <- gregexpr(pattern = "n", text = string, fixed = TRUE)[[1]]
first.index <- index[1:length(index)]
I think I figured out my own answer!
Using the Biostrings package in Bioconductor:
string<-"ATTnGGCnATTn"
matches<-matchPattern(pattern="n", subject=string)
m<-as.vector(ranges(matches))
Now I can call up each position of of "n" in all similar strings. Thank you for those that took the time to answer.

string manipulations using R

I am working with a string vector.Each element in the vector is in the format "MMYYPub". I want to switch the place of "MM" and "YY" in the string, from "MMYYPub" to "YYMMPub" . Is this feasible in R.
Example :
vec(1)
'0100pub'
vec(2)
'0200pub'
Where the first two digits are month, and the following digits are year. There 10 years data in total from 1994 to 2013.
It might also be useful to know about the yearmon class to represent monthly data. The yearmon object can then printed in a format of choice.
library(zoo)
ym <- as.yearmon("0414pub", format = "%m%ypub")
ym
# [1] "apr 2014"
format(ym, "%y%mpub")
# [1] "1404pub"
You need to read up on regular expressions. Here is one way:
R> val <- "0405pub"
R> gsub("(\\d\\d)(\\d\\d)(.*)", "\\2\\1\\3", val)
[1] "0504pub"
R>
We use the fact that
\d denotes a digit (but need to escape the backslash)
(...) groups arguments, so here we matches one (two digits), two (also two digits) and three (remainder)
we then "simply" create the replacement string as "two before one followed by three"
There are other ways to do it, this will accomplish given the pattern you described.
Edit: Here is a shorter variant using \\d{2} to request two digits:
R> gsub("(\\d{2})(\\d{2})", "\\2\\1", val)
[1] "0504pub"
R>
One way would be using (gsub) to replace all occurrences in the vector.
> vec <- c('0100pub', '0200pub')
> gsub('([0-9]{2})([0-9]{2})', '\\2\\1', vec)
[1] "0001pub" "0002pub"
You might not need regex for this, if you just want to swap the characters around. substring and paste work just as well in this case:
> f <- function(x) paste0(substring(x,3,4), substring(x,1,2), substring(x,5))
> x
[1] "0103pub" "0204pub"
> f(x)
[1] "0301pub" "0402pub"

Getting distance between two words in R

Say I have a line in a file:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know is within 6 words of help.
This is essentially a very crude implementation of Crayon's answer as a basic function:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. You'll need to determine how you want to deal with these cases to have a more useful function.
you won't be able to get this from regex alone. I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern:
\bhelp(\s+[^\s]+){1,5}+\s+know\b
This takes the same "space is the delimiter" concept. First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).
Split your string:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector:
> indices <- 1:length(words)
Name indices:
> names(indices) <- words
Compute distance between words:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text.