string manipulations using R - regex

I am working with a string vector.Each element in the vector is in the format "MMYYPub". I want to switch the place of "MM" and "YY" in the string, from "MMYYPub" to "YYMMPub" . Is this feasible in R.
Example :
vec(1)
'0100pub'
vec(2)
'0200pub'
Where the first two digits are month, and the following digits are year. There 10 years data in total from 1994 to 2013.

It might also be useful to know about the yearmon class to represent monthly data. The yearmon object can then printed in a format of choice.
library(zoo)
ym <- as.yearmon("0414pub", format = "%m%ypub")
ym
# [1] "apr 2014"
format(ym, "%y%mpub")
# [1] "1404pub"

You need to read up on regular expressions. Here is one way:
R> val <- "0405pub"
R> gsub("(\\d\\d)(\\d\\d)(.*)", "\\2\\1\\3", val)
[1] "0504pub"
R>
We use the fact that
\d denotes a digit (but need to escape the backslash)
(...) groups arguments, so here we matches one (two digits), two (also two digits) and three (remainder)
we then "simply" create the replacement string as "two before one followed by three"
There are other ways to do it, this will accomplish given the pattern you described.
Edit: Here is a shorter variant using \\d{2} to request two digits:
R> gsub("(\\d{2})(\\d{2})", "\\2\\1", val)
[1] "0504pub"
R>

One way would be using (gsub) to replace all occurrences in the vector.
> vec <- c('0100pub', '0200pub')
> gsub('([0-9]{2})([0-9]{2})', '\\2\\1', vec)
[1] "0001pub" "0002pub"

You might not need regex for this, if you just want to swap the characters around. substring and paste work just as well in this case:
> f <- function(x) paste0(substring(x,3,4), substring(x,1,2), substring(x,5))
> x
[1] "0103pub" "0204pub"
> f(x)
[1] "0301pub" "0402pub"

Related

How to modify string in R taking into account the number of symbols you want to modify [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 6 years ago.
This question is very easy to understand, but I can't wrap my head around how to get a solution. Let's say I have a vector and I want to modify it so it would have 5 integers at the end, and missing digits are replaced with zeros:
Smth1 Smth00001
Smth22 Smth00022
Smth333 Smth00333
Smth4444 Smth04444
Smth55555 Smth55555
I guess it can be done with regex and functions like gsub, but don't understand how to take into account the length of the replaced string
Here's an idea using stringi:
v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
library(stringi)
d <- stri_extract(v, regex = "[:digit:]+")
a <- stri_extract(v, regex = "[:alpha:]+")
paste0(a, stri_pad_left(d, 5, "0"))
Which gives:
[1] "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Using base R. Someone else can prettify the regex:
sprintf("%s%05d", gsub("^([^0-9]+)..*$", "\\1", x),
as.numeric(gsub("^..*[^0-9]([0-9]+)$", "\\1", x)))
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Here is a simple 1-line solution similar to Zelazny's but using a replace callback method inside a gsubfn using gsubfn library:
> library(gsubfn)
> v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
> gsubfn('[0-9]+$', ~ sprintf("%05d",as.numeric(x)), v)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
The regex [0-9]+$ (see the regex demo) matches 1 or more digits at the end of the string only due to the $ anchor. The matched digits are passed to the callback (~) and sprintf("%05d",as.numeric(x)) pads the number (parsed as a numeric with as.numeric) with zeros.
To only modify strings that have 1+ non-digit symbols at the start and then 1 or more digits up to the end, just use this PCRE-based gsubfn:
> gsubfn('^[^0-9]+\\K([0-9]+)$', ~ sprintf("%05d",as.numeric(x)), v, perl=TRUE)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
where
^ - start of string
[^0-9]+\\K - matches 1+ non-digit symbols and \K will omit them
([0-9]+) - Group 1 passed to the callback
$ - end of string.
Here a solution using the library stringr:
library(stringr)
library(dplyr)
num <- str_extract(v, "[1-9]+")
padding <- 9 - nchar(num)
ouput <- paste0(str_extract(v, "[^0-9]+") %>%
str_pad(width = padding, side = c("right"), pad = "0"), num)
The output is:
"Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
library(stringr)
paste0(str_extract(v,'\\D+'),str_pad(str_extract(v,'\\d+'),5,'left', '0'))
#[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.
As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )
another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"
You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
data
x <- c("13.23-45-6A", "0.1B", "4.A-A")

Sequentially replace multiple places matching single pattern in a string with different replacements

Using stringr package, it is easy to perform regex replacement in a vectorized manner.
Question: How can I do the following:
Replace every word in
hello,world??your,make|[]world,hello,pos
to different replacements, e.g. increasing numbers
1,2??3,4|[]5,6,7
Note that simple separators cannot be assumed, the practical use case is more complicated.
stringr::str_replace_all does not seem to work because it
str_replace_all(x, "(\\w+)", 1:7)
produces a vector for each replacement applied to all words, or it has
uncertain and/or duplicate input entries so that
str_replace_all(x, c("hello" = "1", "world" = "2", ...))
will not work for the purpose.
Here's another idea using gsubfn. The pre function is run before the substitutions and the fun function is run for each substitution:
library(gsubfn)
x <- "hello,world??your,make|[]world,hello,pos"
p <- proto(pre = function(t) t$v <- 0, # replace all matches by 0
fun = function(t, x) t$v <- v + 1) # increment 1
gsubfn("\\w+", p, x)
Which gives:
[1] "1,2??3,4|[]5,6,7"
This variation would give the same answer since gsubfn maintains a count variable for use in proto functions:
pp <- proto(fun = function(...) count)
gsubfn("\\w+", pp, x)
See the gsubfn vignette for examples of using count.
I would suggest the "ore" package for something like this. Of particular note would be ore.search and ore.subst, the latter of which can accept a function as the replacement value.
Examples:
library(ore)
x <- "hello,world??your,make|[]world,hello,pos"
## Match all and replace with the sequence in which they are found
ore.subst("(\\w+)", function(i) seq_along(i), x, all = TRUE)
# [1] "1,2??3,4|[]5,6,7"
## Create a cool ore object with details about what was extracted
ore.search("(\\w+)", x, all = TRUE)
# match: hello world your make world hello pos
# context: , ?? , |[] , ,
# number: 1==== 2==== 3=== 4=== 5==== 6==== 7==
Here a base R solution. It should also be vectorized.
x="hello,world??your,make|[]world,hello,pos"
#split x into single chars
x_split=strsplit(x,"")[[1]]
#find all char positions and replace them with "a"
x_split[gregexpr("\\w", x)[[1]]]="a"
#find all runs of "a"
rle_res=rle(x_split)
#replace run lengths by 1
rle_res$lengths[rle_res$values=="a"]=1
#replace run values by increasing number
rle_res$values[rle_res$values=="a"]=1:sum(rle_res$values=="a")
#use inverse.rle on the modified rle object and collapse string
paste0(inverse.rle(rle_res),collapse="")
#[1] "1,2??3,4|[]5,6,7"

Getting distance between two words in R

Say I have a line in a file:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know is within 6 words of help.
This is essentially a very crude implementation of Crayon's answer as a basic function:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. You'll need to determine how you want to deal with these cases to have a more useful function.
you won't be able to get this from regex alone. I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern:
\bhelp(\s+[^\s]+){1,5}+\s+know\b
This takes the same "space is the delimiter" concept. First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).
Split your string:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector:
> indices <- 1:length(words)
Name indices:
> names(indices) <- words
Compute distance between words:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text.

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"