Incrementing a number in a string using sub - regex

There's a string with a (single) number somewhere in it. I want to increment the number by one. Simple, right? I wrote the following without giving it a second thought:
sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), string)
... and got an NA.
> sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), "x is 5")
[1] NA
Warning message:
In sub("([[:digit:]]+)", as.character(as.numeric("\\1") + 1), "x is 5") :
NAs introduced by coercion
Why doesn't it work? I know other ways of doing this, so I don't need a "solution". I want to understand why this method fails.

The point is that the backreference is only evaluated during a match operation, and you cannot pass it to any function before that.
When you write as.numeric("\\1") the as.numeric function accepts a \1 string (a backslash and a 1 char). Thus, the result is expected, NA.
This happens because there is no built-in backreference interpolation in R.
You may use a gsubfn package:
> library(gsubfn)
> s <- "x is 5"
> gsubfn("\\d+", function(x) as.numeric(x) + 1, s)
[1] "x is 6"

It does not work because the arguments of sub are evaluated before they are passed to the regex engine (which gets called by .Internal).
In particular, as.numeric("\\1") evaluates to NA ... after that you're doomed.

It might be easier to think of it differently. You are getting the same error that you would get if you used:
print(as.numeric("\\1")+1)
Remember, the strings are passed to the function, where they are interpreted by the regex engine. The string \\1 is never transformed to be 5, since this calculation is done within the function.
Note that \\1 is not something that works as a number. NA seems to be similar to null in other languages:
NA... is a product of operation when you try to access something that is not there
From mpiktas' answer here.

Related

Regex Multiple rows [duplicate]

I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as  from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("&regex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)

Replacing the first vowel-consonent occurence with consonent-vowel using sub in R

I know that it should be something like this but definitely I am missing something in the syntax:
yy=sub(r'\b[aeiou][^aeiou]*',r'\b[^aeiou][aeiou]*',"abmmmm")
I expect to have "bammmm" as output
Error: unexpected string constant in "yy=sub(r'\b[aeiou][^aeiou]*'"
I am not sure how is the exact syntax.
Please run your code in RStudio or any R compiler. I am new to regex and you giving me Python code wouldn't help me to understand the situation. Thanks!
This is what you want
yy=sub("\\b([aeiou])([^aeiuos])","\\2\\1","abmm")
I'll explain how it works:
If you ask me to substitute any vowel-consonent with any consonent-vowel? It doesn't make much sense. Should I change ab to ba, ce, or da? It can be any one of them. You never specified any relationship between the vowel in vowel-consonent and the vowel in consonent-vowel. Therefore, it doesn't make sense to put a regular expression in the 2nd argument. As a result, you are not allowed to.
If you want to achieve what you asked for. You can add brackets to the regular expression in the 1st argument. The first ( marks group 1, second ( marks group 2, etc. (note, group 0 is the whole matched string.) You can use \1, \2, ... in the second argument to put the matched group there.
As an alternative to using a regular expression for this, there's a nice string reversal function in example(strsplit)
> strReverse <- function(x)
sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
> dd <- "abmmmm"
> paste(strReverse(substr(dd, 1, 2)), substr(dd, 3, nchar(dd)), sep = "")
[1] "bammmm"

'R: Invalid use of repetition operators'

I'm writing a small function in R as follows:
tags.out <- as.character(tags.out)
tags.out.unique <- unique(tags.out)
z <- NROW(tags.out.unique)
for (i in 1:10) {
l <- length(grep(tags.out.unique[i], x = tags.out))
tags.count <- append(x = tags.count, values = l) }
Basically I'm looking to take each element of the unique character vector (tags.out.unique) and count it's occurrence in the vector prior to the unique function.
This above section of code works correctly, however, when I replace for (i in 1:10) with for (i in 1:z) or even some number larger than 10 (18000 for example) I get the following error:
Error in grep(tags.out.unique[i], x = tags.out) :
invalid regular expression 'c++', reason 'Invalid use of repetition operators
I would be extremely grateful if anyone were able to help me understand what's going on here.
Many thanks.
The "+" in "c++" (which you're passing to grep as a pattern string) has a special meaning. However, you want the "+" to be interpreted literally as the character "+", so instead of
grep(pattern="c++", x="this string contains c++")
you should do
grep(pattern="c++", x="this string contains c++", fixed=TRUE)
If you google [regex special characters] or something similar, you'll see that "+", "*" and many others have a special meaning. In your case you want them to be interpreted literally -- see ?grep.
It would appear that one of the elements of tags.out_unique is c++ which is (as the error message plainly states) an invalid regular expression.
You are currently programming inefficiently. The R-inferno is worth a read, noting especially that Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
Given you are simply trying to count the number of times each value occurs there is no need for the loop or regex
counts <- table(tags.out)
# the unique values
names(counts)
should give you the results you want.

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.

R regex to validate user input is correct

I'm trying to practice writing better code, so I wanted to validate my input sequence with regex to make sure that the first thing I get is a single letter A to H only, and the second is a number 1 to 12 only. I'm new to regex and not sure what the expression should look like. I'm also not sure what type of error R would throw if this is invalidated?
In Perl it would be something like this I think: =~ m/([A-M]?))/)
Here is what I have so far for R:
input_string = "A1"
first_well_row = unlist(strsplit(input_string, ""))[1] # get the letter out
first_well_col = unlist(strsplit(input_string, ""))[2] # get the number out
In R code, using David's regex: [edited to reflect Marek's suggestion]
validate.input <- function(x){
match <- grepl("^[A-Ha-h]([0-9]|(1[0-2]))$",x,perl=TRUE)
## as Marek points out, instead of testing the length of the vector
## returned by grep() (which will return the index of the match and integer(0)
## if there are no matches), we can use grepl()
if(!match) stop("invalid input")
list(well_row=substr(x,1,1), well_col=as.integer(substr(x,2,nchar(x))))
}
This simply produces an error. If you want finer control over error handling, look up the documentation for tryCatch, here's a primitive usage example (instead of getting an error as before we'll return NA):
validate.and.catch.error <- function(x){
tryCatch(validate.input(x), error=function(e) NA)
}
Finally, note that you can use substr to extract your letters and numbers instead of doing strsplit.
You asked specifically for "A through H, then 0-9 or 10-12". Call the exception "InvalidInputException" or any similarly named object- "Not Valid" "Input" "Exception"
/^[A-H]([0-9]|(1[0-2]))$/
In Pseudocode:
validateData(String data)
if not data.match("/^[A-H]([0-9]|(1[0-2]))$/")
throw InvalidInputException