R regex to validate user input is correct - regex

I'm trying to practice writing better code, so I wanted to validate my input sequence with regex to make sure that the first thing I get is a single letter A to H only, and the second is a number 1 to 12 only. I'm new to regex and not sure what the expression should look like. I'm also not sure what type of error R would throw if this is invalidated?
In Perl it would be something like this I think: =~ m/([A-M]?))/)
Here is what I have so far for R:
input_string = "A1"
first_well_row = unlist(strsplit(input_string, ""))[1] # get the letter out
first_well_col = unlist(strsplit(input_string, ""))[2] # get the number out

In R code, using David's regex: [edited to reflect Marek's suggestion]
validate.input <- function(x){
match <- grepl("^[A-Ha-h]([0-9]|(1[0-2]))$",x,perl=TRUE)
## as Marek points out, instead of testing the length of the vector
## returned by grep() (which will return the index of the match and integer(0)
## if there are no matches), we can use grepl()
if(!match) stop("invalid input")
list(well_row=substr(x,1,1), well_col=as.integer(substr(x,2,nchar(x))))
}
This simply produces an error. If you want finer control over error handling, look up the documentation for tryCatch, here's a primitive usage example (instead of getting an error as before we'll return NA):
validate.and.catch.error <- function(x){
tryCatch(validate.input(x), error=function(e) NA)
}
Finally, note that you can use substr to extract your letters and numbers instead of doing strsplit.

You asked specifically for "A through H, then 0-9 or 10-12". Call the exception "InvalidInputException" or any similarly named object- "Not Valid" "Input" "Exception"
/^[A-H]([0-9]|(1[0-2]))$/
In Pseudocode:
validateData(String data)
if not data.match("/^[A-H]([0-9]|(1[0-2]))$/")
throw InvalidInputException

Related

Incrementing a number in a string using sub

There's a string with a (single) number somewhere in it. I want to increment the number by one. Simple, right? I wrote the following without giving it a second thought:
sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), string)
... and got an NA.
> sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), "x is 5")
[1] NA
Warning message:
In sub("([[:digit:]]+)", as.character(as.numeric("\\1") + 1), "x is 5") :
NAs introduced by coercion
Why doesn't it work? I know other ways of doing this, so I don't need a "solution". I want to understand why this method fails.
The point is that the backreference is only evaluated during a match operation, and you cannot pass it to any function before that.
When you write as.numeric("\\1") the as.numeric function accepts a \1 string (a backslash and a 1 char). Thus, the result is expected, NA.
This happens because there is no built-in backreference interpolation in R.
You may use a gsubfn package:
> library(gsubfn)
> s <- "x is 5"
> gsubfn("\\d+", function(x) as.numeric(x) + 1, s)
[1] "x is 6"
It does not work because the arguments of sub are evaluated before they are passed to the regex engine (which gets called by .Internal).
In particular, as.numeric("\\1") evaluates to NA ... after that you're doomed.
It might be easier to think of it differently. You are getting the same error that you would get if you used:
print(as.numeric("\\1")+1)
Remember, the strings are passed to the function, where they are interpreted by the regex engine. The string \\1 is never transformed to be 5, since this calculation is done within the function.
Note that \\1 is not something that works as a number. NA seems to be similar to null in other languages:
NA... is a product of operation when you try to access something that is not there
From mpiktas' answer here.

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

'R: Invalid use of repetition operators'

I'm writing a small function in R as follows:
tags.out <- as.character(tags.out)
tags.out.unique <- unique(tags.out)
z <- NROW(tags.out.unique)
for (i in 1:10) {
l <- length(grep(tags.out.unique[i], x = tags.out))
tags.count <- append(x = tags.count, values = l) }
Basically I'm looking to take each element of the unique character vector (tags.out.unique) and count it's occurrence in the vector prior to the unique function.
This above section of code works correctly, however, when I replace for (i in 1:10) with for (i in 1:z) or even some number larger than 10 (18000 for example) I get the following error:
Error in grep(tags.out.unique[i], x = tags.out) :
invalid regular expression 'c++', reason 'Invalid use of repetition operators
I would be extremely grateful if anyone were able to help me understand what's going on here.
Many thanks.
The "+" in "c++" (which you're passing to grep as a pattern string) has a special meaning. However, you want the "+" to be interpreted literally as the character "+", so instead of
grep(pattern="c++", x="this string contains c++")
you should do
grep(pattern="c++", x="this string contains c++", fixed=TRUE)
If you google [regex special characters] or something similar, you'll see that "+", "*" and many others have a special meaning. In your case you want them to be interpreted literally -- see ?grep.
It would appear that one of the elements of tags.out_unique is c++ which is (as the error message plainly states) an invalid regular expression.
You are currently programming inefficiently. The R-inferno is worth a read, noting especially that Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
Given you are simply trying to count the number of times each value occurs there is no need for the loop or regex
counts <- table(tags.out)
# the unique values
names(counts)
should give you the results you want.

Explain this R regex

Recently here an R question was answered by mrdwab that used a regex that was pretty cool (LINK). I liked the response but can't generalize it because I don't understand what's happening (I fooled with the different numeric values being supplied but that didn't really yield anything useful). Could someone break the regex down piece by piece and explain what's happening?
x <- c("WorkerId", "pio_1_1", "pio_1_2", "pio_1_3", "pio_1_4", "pio_2_1",
"pio_2_2", "pio_2_3", "pio_2_4")
gsub("([a-z])_([0-9])_([0-9])", "\\1_\\3\\.\\2", x) #Explain me please
Thank you in advance.
Anywhere you have a character and two numbers separated by underscores (e.g., a_1_2) the regex will select the matched character and numbers and make them available as variables. \\1, \\2, and \\3 refer to the matched arguments in the original expression:
\\1 <- a
\\2 <- 1
\\3 <- 2
The result of running gsub as you have it above is to search an expression for matches and flip the order of the numbers wherever they appear. So, for example, a_1_2 would become a_2.1.
"\\1_\\3\\.\\2"
# a_ 2 . 1

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.