How to split array of strings from two sides? - regex

I have an array of strings (n=1000) in this format:
strings<-c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz",
"GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz",
"GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
I'm wondering what may be a easy way to get this:
strings2<-c(2201_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL,
2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL,
2203_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL)
which means to trim off "GSM1234567" from the front and ".gz" from the end.

Just a gsub solution that matches strings that starts ^ with digits and alphabetical symbols, zero or more times *, until a _ is encountered and (more precisely "or") pieces or strings that have .gz at the end $.
gsub("^([[:alnum:]]*_)|(\\.gz)$", "", strings)
[1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
[2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
[3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"
Edit
I forget to escape the second point.

strings <- c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz", "GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz", "GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
strings2 <- lapply(strings, function (x) substr(x, 12, 58))

You can do this using sub:
sub('[^_]+_(.*)\\.gz', '\\1', strings)
# [1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
# [2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
# [3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"

Try:
gsub('^[^_]+_|\\.[^.]*$','',strings)

I strongly suggest doing this in two steps. The other solutions work but are completely unreadable: they don’t express the intent of your code. Here it is, clearly expressed:
trimmed_prefix = sub('^GSM\\d+_', '', strings)
strings2 = sub('\\.gz$', '', trimmed_prefix)
But admittedly this can be expressed in one step, and wouldn’t look too badly, as follows:
strings2 = sub('^GSM\\d+_(.*)\\.gz$', '\\1', strings)
In general, think carefully about the patterns you actually want to match: your question says to match the prefix “GSM1234567” but your example contradicts that. I’d generally choose a pattern that’s as specific as possible to avoid accidentally matching faulty input.

Related

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

Find repeated pattern in a string of characters using R

I have a large text that contains expressions such as: "aaaahahahahaha that was a good joke". after processing, I want the "aaaaahahahaha" to disappear, or at least, change it to simply "ha".
At the moment, I am using this:
gsub('(.+?)\\1', '', str)
This works when the string with the pattern is at the beginning of the sentence, but not where is located anywhere else. So:
str <- "aaaahahahahaha that was a good joke"
gsub('(.+?)\\1', '', str)
#[1] "ha that was a good joke"`
But
str <- "that was aaaahahahahaha a good joke"
gsub('(.+?)\\1', '', str)
#[1] "that was aaaahahahahaha a good joke"
This question might relate to this: find repeated pattern in python, but I can't find the equivalence in R.
I am assuming is very simple and perhaps I am missing something trivial, but since regular expressions are not my strength and I have already tried a bunch of things that have not worked, I was wondering if someone could help me. The question is: How to find, and substitute, repeated patterns in a string of characters in R?
Thanks in advance for your time.
\b(\S+?)\1\S*\b
Use this.See demo.
https://regex101.com/r/sJ9gM7/46
For r use \\b(\\S+?)\\1\\S*\\b with perl=TRUE option.

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy
Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.
I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

Subsetting a string based on pre- and suffix

I have a column with these type of names:
sp_O00168_PLM_HUMAM
sp_Q8N1D5_CA158_HUMAN
sp_Q15818_NPTX1_HUMAN
tr_Q6FGH5_Q6FGH5_HUMAN
sp_Q9UJ99_CAD22_HUMAN
I want to remove everything before, and including, the second _ and everything after, and including, the third _.
I do not which to remove based on number of characters, since this is not a fixed number.
The output should be:
PLM
CA158
NPTX1
Q6FGH5
CAD22
I have played around with these, but don't quite get it right..
library(stringer)
str_sub(x,-6,-1)
That’s not really a subset in programming terminology1, it’s a substring. In order to extract partial strings, you’d usually use regular expressions (pretty much regardless of language); in R, this is accessible via sub and other related functions:
pattern = '^.*_.*_([^_]*)_.*$'
result = sub(pattern, '\\1', strings)
1 Aside: taking a subset is, as the name says, a set operation, and sets are defined by having no duplicate elements and there’s no particular order to the elements. A string by contrast is a sequence which is a very different concept.
Another possible regular expression is this:
sub("^(?:.+_){2}(.+?)_.+", "\\1", vec)
# [1] "PLM" "CA158" "NPTX1" "Q6FGH5" "CAD22"
where vec is your vector of strings.
A visual explanation:
> gsub(".*_.*_(.*)_.*", "\\1", "sp_O00168_PLM_HUMAM")
[1] "PLM"

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.