Extract first instance of character and digit mix using regex (R) - regex

I have a string that i would like to extract the first instance of the character/digit mix - ie the first instance of the screen resolution below.
The string to match
scrn <- " dimensions: 1280x800 pixels (338x211 millimeters)"
And i would like to get either a vector or list with entries c(1280, 800)
I can do this rather awkwardly with
strsplit(sapply(strsplit(scrn, " "), "[", 7),"x", scrn)
where i knew the 7 by reviewing the strsplit output.
But i am assuming there is a neat regular expressions way to do this
My attempt fwiw (which i would then need to split a couple of times)
gsub("[[:alpha:]]{2,}|(\\:)*(\\s) ", "", scrn)

Is this what you mean?
sub('scrn\\s*<-\\s*"\\s*dimensions:\\s*(\\d+)x(\\d+)', "c(\\1,\\2)", subject, perl=TRUE);
Output:
c(1280,800)

Following #zx81 hint of (\\d+)x(\\d+) this gets it done fairly neatly
scrn <- " dimensions: 1280x800 pixels (338x211 millimeters)"
g <- regexec("(\\d+)x(\\d+)", scrn)
unlist(regmatches( scrn, g ))[-1]

Related

Get just X number of strings from a comma separated cel

having real trouble finding a succinct solution to this simple problem. Currently I have cells which contain many comma separated items. I just want the first 5.
ie. cell A1 =
text, another string, something else, here's another one, guess what another string here, and another, hello i'm another string, another string etc, etc, etccccc
and I'm trying to grab just the first 5 strings.
Beyond that, I wonder if I can incorporate a formula such as =LEN(A1)>20
Currently I do this with numerous; =IFERROR(INDEX( SPLIT(C31,","),1)) then =IFERROR(INDEX( SPLIT(C31,","),2)) etc. then run the LEN formula above.
Is there a simpler solution? Thanks so much.
Try,
=split(replace(A1, find("|", SUBSTITUTE(A1, ", ", "|", 5)), len(A1), ""), ", ", false)
For Excel, with data in A1, in B1 enter:
=TRIM(MID(SUBSTITUTE($A1,",",REPT(" ",999)),COLUMNS($A:A)*999-998,999))
and copy across:
To get all 5 substrings into a single cell, use:
=LEFT(A1,FIND(CHAR(1),SUBSTITUTE(A1,",",CHAR(1),5))-1)
=ARRAY_CONSTRAIN(SPLIT(A1,","),1,5)
=REGEXEXTRACT(A1,"((?:.*?,){5})")
=REGEXEXTRACT(A1,REPT("(.*?),",5))
SPLIT to split by delimiter
ARRAY_CONSTRAIN to constrain the array
REGEX1 to extract 5 comma separated values
. Any character
.*?, Any character repeated unlimited number of times (? as little as possible) followed by a ,
{5} Quantifier
REPT to repeat strings

gsub/regex: deleting begining and end special characters in a factor variable

I'm working with the following vector:
vec <- c("[0.81, 1]", "0.00 - 0.03", "0.04 - 0.27", "0.28 - 0.5", "0.51 - 0.8")
I'm interested in amending the value in the value "[0.81, 1]" so it corresponds to the format number - number
Working solution
Presently I address this requirement in a following manner:
vec <- gsub("\\[", "", vec, perl = TRUE)
vec <- gsub("\\]", "", vec, perl = TRUE)
vec <- gsub(",", " - ", vec, fixed = TRUE)
The code produces desired requirements:
> vec
[1] "0.81 - 1" "0.00 - 0.03" "0.04 - 0.27" "0.28 - 0.5" "0.51 - 0.8"
Problem
I would like to achieve my solution using a more complex gsub with a more fancy regex. I would like to come up with regex syntax that would:
Match first [ and the last ] and delete them / replace them with nothing
Or even better, delete the [ and ] and signs and insert the - instead of the , in the middle. I'm guessing that this may involve making use of gsubfn so I'm less keen on this solution
On principle I would like to achieve reduce the number of gsub calls.
Attempts
I tried something like that:
\[(?![[:alnum:]])\] - it doesn't match anything
\[(.*)\] - appears to matching the whole thing
What I would like to achieve:
Merge first two gsub calls into one
If possible, merge all 3 calls into one
Use capture groups like this:
sub("\\[(.*), (.*)\\]", "\\1 - \\2", vec)
## [1] "0.81 - 1" "0.00 - 0.03" "0.04 - 0.27" "0.28 - 0.5" "0.51 - 0.8"
Here is a visualization of the regular expression used:
\[(.*), (.*)\]
Debuggex Demo
It's not one regex, maybe a regexpert will give one, but I combined your first two calls with the second in a oneliner:
v1 <- gsub("\\[|\\]","",gsub(","," -",vec))
Note that I replaced with " -", and not with " - " as there are already spaces after your comma.

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

Splitting a string by space except when contained within quotes

I've been trying to split a space delimited string with double-quotes in R for some time but without success. An example of a string is as follows:
rainfall snowfall "Channel storage" "Rivulet storage"
It's important for us because these are column headings that must match the subsequent data. There are other suggestions on this site as to how to go about this but they don't seem to work with R. One example:
Regex for splitting a string using space when not surrounded by single or double quotes
Here is some code I've been trying:
str <- 'rainfall snowfall "Channel storage" "Rivulet storage"'
regex <- "[^\\s\"']+|\"([^\"]*)\""
split <- strsplit(str, regex, perl=T)
what I would like is
[1] "rainfall" "snowfall" "Channel storage" "Rivulet storage"
but what I get is:
[1] "" " " " " " "
The vector is the right length (which is encouraging) but of course the strings are empty or contain a single space. Any suggestions?
Thanks in advance!
scan will do this for you
scan(text=str, what='character', quiet=TRUE)
[1] "rainfall" "snowfall" "Channel storage" "Rivulet storage"
As mplourde said, use scan. that's by far the cleanest solution (unless you want to keep the \", that is...)
If you want to use regexes to do this (or something not solved that easily by scan), you are still looking at it the wrong way. Your regex returns what you want, so if you use that in your strsplit it will cut out everything you want to keep.
In these scenarios you should look at the function gregexp, which returns the starting positions of your matches and adds the lengths of the match as an attribute. The result of this can be passed to the function regmatches(), like this:
str <- 'rainfall snowfall "Channel storage" "Rivulet storage"'
regex <- "[^\\s\"]+|\"([^\"]+)\""
regmatches(str,gregexpr(regex,str,perl=TRUE))
But if you just needs the character vector as the solution of mplourde returns, go for that. And most likely that's what you're after anyway.
You can use strapply from package gsubfn. In strapply you can define matching string rather than splitting string.
str <- "rainfall snowfall 'Channel storage' 'Rivulet storage'"
strapply(str,"\\w+|'\\w+ \\w+'",c)[[1]]
[1] "rainfall" "snowfall" "'Channel storage'" "'Rivulet storage'"

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.