I would like to use regex to replace everything (except a specific pattern) with empty string in BigQuery. I have following values:
AX/88/8888888
AX/99/999999
AX/11/222222 - AX/22/33333 - AX/999/99999
BX/99/9999
1234455121
AX/00/888888 // BX/890/90890
NULL
[XYZ-ASA
BX/890/90890 + AX/10/1010101
AX/99/9999M
AX/111/111,AX-99
AX/11/222222 BX/99/99 AX/22/33333
The pattern will always have "AX" in the beginning, then a slash (/) and some numbers and slash(/) again and some numbers after it. (The pattern would always be AX/\d+/\d+)
I would like to replace anything (any character,brackets,digit etc) that doesn't follow that pattern mention above.
For the cases where the pattern doesn't match at all for example (BX/99/9999,1234455121, NULL,[XYZ-ASA) are the only cases from the above dataset.
** doesn't match at all means cases where the entire values doesn't have any value
that matches with the AX/\d+/\d+. In those situations, I would like to return then original text as final output.
The case where we have matching pattern for example AX/00/888888 // BX/890/90890, AX/111/111,AX-99 the pattern matches but the latter part needs to be replaced i.e [// BX/890/90890] and [,AX-99] , which should then return only the AX/00/888888, and AX/111/111 as final output.
The expected output from the above example is following:
AX/88/8888888
AX/99/999999
AX/11/222222 AX/22/33333 AX/999/99999
BX/99/9999
1234455121
AX/00/888888
NULL
[XYZ-ASA
AX/10/1010101
AX/99/9999
AX/111/111
AX/11/222222 AX/22/33333
Later I would like to split all the values by space, to get each AX/xx/xx on a different row where I have multiple of those for example case 3 from above would produce 3 rows.
AX/88/8888888
AX/99/999999
AX/11/222222
AX/22/33333
AX/999/99999
BX/99/9999
1234455121
AX/00/888888
NULL
[XYZ-ASA
AX/10/1010101
AX/99/9999
AX/111/111
AX/11/222222
AX/22/33333
Use below
select coalesce(result, col) as col
from your_table
left join unnest(regexp_extract_all(col, r'AX/\d+/\d+')) result
if applied to sample data in your question
output is
I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("®ex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)
In have following code:
set[str] noNnoE = { v | str v <- eu, (/\b[^eEnN]*\b/ := v) };
The goal is to filter out of a set of strings (called 'eu'), those strings that have no 'e' or 'n' in them (both upper- and lowercase). The regular expression I've provided:
/\b[^eEnN]?\b/
seems to work like it should, when I try it out in an online regex-tester.
When trying it out in the Rascel terminal it doesn't seem to work:
rascal>/\b[^eEnN]*\b/ := "Slander";
bool: true
I expected no match. What am I missing here? I'm using the latest (stable) Rascal release in Eclipse Oxygen1a.
Actually, the online regex-tester is giving the same match that we are giving. You can look at the match as follows:
if (/<w1:\b[^eEnN]?\b>/ := "Slander")
println("The match is: |<w1>|");
This is assigning the matched string to w1 and then printing it between the vertical bars, assuming the match succeeds (if it doesn't, it returns false, so the body of the if will not execute). If you do this, you will get back a match to the empty string:
The match is: ||
The online regex tester says the same thing:
Match 1
Full match 0-0 ''
If you want to prevent this, you can force at least one occurrence of the characters you are looking for by using a +, versus a ?:
rascal>/\b[^eEnN]+\b/ := "Slander";
bool: false
Note that you can also make the regex match case insensitive by following it with an i, like so:
/\b[^en]+\b/i
This may make it easier to write if you need to add more characters into the character class.
This solution (/\b[^en]+\b/i) doesn't work for strings consisting of two words, such as the Czech Republic.
Try /\b[^en]+\b$/i. That seems to work for me.
I am currently using this regular expression to loosely validate a DNS address:
^[A-Za-z0-9_]+(\.[A-Za-z0-9_]+)*$
Which would match things like hello.com, hello, and hello.com.com.com. I was trying to replicate it exactly as it is into a Lua pattern. I came up with the following Lua pattern:
^([%d%a_]+(%.[%d%a_]+)*)$
So that I can use the following code to validate the DNS address:
local s = "hello.com"
print(s:match("^([%d%a_]+(%.[%d%a_]+)*)$"))
For some reason this always fails, although it looks like a 1:1 copy of the regular expression.
Any ideas why?
Lua patterns are not regular expressions. You cannot translate the ^[A-Za-z0-9_]+(\.[A-Za-z0-9_]+)*$ to ^([%d%a_]+(%.[%d%a_]+)*)$, because you cannot apply quantifiers to groups in Lua (see Limitations of Lua patterns).
Judging by the ^[A-Za-z0-9_]+(\.[A-Za-z0-9_]+)*$ regex, the rules are:
String can consist of one or more alphanumeric or underscore or dot characters
String cannot start with a dot
String cannot end with a dot
String cannot contain 2 consecutive dots
You can use the following work-around:
function pattern_checker(v)
return string.match(v, '^[%d%a_.]+$') ~= nil and -- check if the string only contains digits/letters/_/., one or more
string.sub(v, 0, 1) ~= '.' and -- check if the first char is not '.'
string.sub(v, -1) ~= '.' and -- check if the last char is not '.'
string.find(v, '%.%.') == nil -- check if there are 2 consecutive dots in the string
end
See IDEONE demo:
-- good examples
print(pattern_checker("hello.com")) -- true
print(pattern_checker("hello")) -- true
print(pattern_checker("hello.com.com.com")) -- true
-- bad examples
print(pattern_checker("hello.com.")) -- false
print(pattern_checker(".hello")) -- false
print(pattern_checker("hello..com.com.com")) -- false
print(pattern_checker("%hello.com.com.com")) -- false
You can translate the pattern to ^[%w_][%w_%.]+[%w_]$, although that still allows for double dots. When using that pattern while checking for double dots, you end up with this:
function pattern_checker(v)
-- Using double "not" because we like booleans, no?
return not v:find("..",1,true) and not not v:match("^[%w_][%w_%.]+[%w_]$")
end
I used the same testcode as Wiktor Stribiżew (since it's good testcode) and it produces the same results. Mine is also 2 to 3 times faster, if that matters. (Doesn't mean I don't like Wiktor's code, his code also works. He also has a link to the limitations page, a nice touch to his answer)
(I like playing with string patterns in Lua)
I'm trying to build a regex that will identify parameters inside brackets and ignore pl/sql comments (single line --, and multiple lines /* */)
For instance:
create or replace table_name ---sfjdslkfjslkfjslkfjdsfsdf
**(var1 in out number, var2 number)**
/* sdfls
sfdsd jfs
sfs f
sd f
sfsf */
AS
BEGIN
(var1 in out number, var2 number) should be matched only. It should also account for cases where:
There are not comments (single or multiple lines)
There are only single line comments either before or after the parameters
There are only multiple line comments either before or after the parameters
There are no parameters
Assumptions:
Parameters are always enclosed in brackets ()
Procedures can sometimes have no parameters but have comments (either single or multiple line comments) before the AS BEGIN clause
Procedures start with create or replace table_name
We're only interested to read the until the AS BEGIN clause
In other words, I need to find the index of the first opening bracket '(' that is outside any comments (single or multiple lines) and that comes before the AS BEGIN clause.
UPDATE:
I have managed to match the comments using the following regex:
(?:\/\*(?:[\s\S]*?)\*\/)|(?:\-\-(?:.*)$)
For instance here it will match all the comments:
create or replace table_name
-- sdlfksl kjs slkjslds js
/* lsdjfdkj
s fskjfs
sf sf
sdf;;''
sfs fs
*/
(hello number, var2 number)
--sdflksf
/*sl --sdflks s kdjfls())({fsfs */
AS
BEGIN
I can do this now in Java to identify the first opening bracket outside of any matching group. However it would be easier if I could just ignore the matching group and match the one parameters in the brackets only instead.
EDIT
This is not asking for a solution with pl/sql or sqlplus or whatever. I have a few pl/sql procedures stored in files that I need to modify and add new parameters to. I'm using Java to do that and inside Java im using a combination of loops and regexs.
Not sure why you want to reinvent the wheel. In Oracle, you could use the *_ARGUMENTS view to get the list of all the arguments.
For example,
SQL> CREATE OR REPLACE
2 PROCEDURE p(
3 in_date DATE,
4 out_text VARCHAR2)
5 AS
6 BEGIN
7 NULL;
8 END;
9 /
Procedure created.
SQL>
SQL> SELECT object_name,
2 package_name,
3 argument_name,
4 position,
5 data_type
6 FROM USER_ARGUMENTS
7 WHERE OBJECT_NAME = 'P';
OBJECT_NAM PACKAGE_NA ARGUMENT_NAME POSITION DATA_TYPE
---------- ---------- --------------- ---------- ---------
P OUT_TEXT 2 VARCHAR2
P IN_DATE 1 DATE
SQL>
I'm unfamiliar with Oracle's flavor of regexes, but I made a pattern that works in most regex-engines. I know MySQL has a wildly different syntax so this is quite possibly also the case in Oracle, but the general idea is this:
create or replace table_name(?:(?!AS\s+BEGIN)(?:--[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\/|(?!\/\*|--)[^(]))*\(([^)]+)
Debuggex Demo
It's a "match this except in contexts A, B or C" approach also used here and explained in depth in this SO answer. Combined with the requirement that whatever we're matching isn't followed by AS\s+BEGIN.
You can see in the Debuggex Demo that the correct part between brackets is matched in capture group 1. As well, the 2nd create doesn't have parameters, and indeed nothing is matched (even though there's brackets in comments and after the AS BEGIN.
The task at hand is porting this to Oracle's regex flavor.