Regex for custom GROUP BY statement - regex

I'm trying to write a Java-compatible regex for custom GROUP BY statement to parse expressions like this:
GROUP BY table1.feild1, table2.feild2 UNDER table3
The idea is to get multiple "group by" tables somehow, along with a single "under" table.
I've tried something like this, but it does not work -
^\s*group\s*by\s*([,]*[\s]*([A-Za-z0-9_]+\.[A-Za-z0-9_]+){1,})\s{1,}under\s{1,}([A-Za-z0-9_]+)$
I'm not even sure that it can be done in a single regex. Maybe it should be split?

Try Regex: ^\s*group\s+by\s+([A-Za-z0-9_]+\.[A-Za-z0-9_]+(?:,\s*[A-Za-z0-9_]+\.[A-Za-z0-9_]+)*\s+)under\s+([A-Za-z0-9_]+)$
Demo

Related

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.
Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

How to combine multiple RegEx commands for Notepad++ using capture groups and alternations?

I am converting exported SQL views as files to a different syntax using a separate specialized conversion tool. This tool can't handle certain commands and formatting so I'm using Notepad++ with RegEx to alter the files ahead of time.
So far I am getting the results that I want, but it takes three separate Find/Replace actions. I'd like to reduce these three RegEx actions down to one if possible.
Find: (.*)(CREATE VIEW.*\nGO)(.*)
Replace: \2
Find: (CREATE VIEW )(.*)(\r\nAS)
Replace: \1"\2"\3
Find: (oldschema1\.|\[oldschema1\]\.|\[|\]|TOP \(100\) PERCENT|oldschema2\.)|(^GO$)|(\A^(.*?))
Replace: (?1)(?2\;)(?3SET SCHEMA schemaname\; \n\n\1)```
I'm using Notepad++ 7.7.1 64-bit, Find/Replace with Regular Expression search mode - ". matches newline" check on.
You'll see in my code that I'm already using capture groups with alternation. I thought I could combine the first two RegEx steps as additional capture groups to Step 3 but it doesn't work out, possibly because they are nested.
I tried referencing the nested groups by incrementing the referencing number accordingly, but it doesn't work (blanks out the result).
Here is an example SQL view file. It's not a working view because I added "oldschema2" so the RegEx would have something to find for one of the replacements, but it's representative as an example here.
garbage
text
beforehand
CREATE VIEW [oldschema1].[viewname]
AS
SELECT DISTINCT
TOP (100) PERCENT oldschema1.TABLENAME.FIELD1, oldschema1.TABLENAME.FIELD2
FROM oldschema1.TABLENAME
WHERE (oldschema1.TABLENAME.FIELD3 = N'Z003') AND oldschema2.TABLENAME.FIELD2 = 1
ORDER BY oldschema1.TABLENAME.FIELD1
GO
garbage
text
after
Here is some additional details of what I'm trying to achieve with each pass.
Notepad++ RegEx Step 1 - isolate view block from CREATE VIEW to GO
Find:
(.*)(CREATE VIEW.*\nGO)(.*)
Replace:
\2
Step 2 - put quotes around view name
Find:
(CREATE VIEW )(.*)(\r\nAS)
Replace:
\1"\2"\3
Step 3 - remove/replace various texts and insert a line at the beginning of the file
Find:
(oldschema1\.|\[oldschema1\]\.|\[|\]|TOP \(100\) PERCENT|oldschema2\.)|(^GO$)|(\A^(.*?))
Replace:
(?1)(?2\;)(?3SET SCHEMA schemaname\; \n\n\1)
The expected output from the above example would be:
SET SCHEMA schemaname;
CREATE VIEW "viewname"
AS
SELECT DISTINCT
TABLENAME.FIELD1, TABLENAME.FIELD2
FROM TABLENAME
WHERE (TABLENAME.FIELD3 = N'Z003') AND TABLENAME.FIELD2 = 1
ORDER BY TABLENAME.FIELD1
;
which I achieve with the above three steps, but I'd like to do it in one Find/Replace if possible.
I'm pretty new to RegEx, and StackOverflow for that matter. Your help is greatly appreciated.
Step 1
I'm not so sure about it, but I'm guessing that maybe we would want an expression similar to:
[\s\S]*?(CREATE VIEW[\s\S]*GO\s*)[\s\S]*
to be replaced with $1, where our desired data is in this capturing group:
(CREATE VIEW[\s\S]*GO\s*)
and we can even remove \s*:
(CREATE VIEW[\s\S]*GO)
and just try:
[\s\S]*?(CREATE VIEW[\s\S]*GO)[\s\S]*
with an m flag.
In the right panel of this demo, the expression is further explained, if you might be interested.
Step 2
We can likely try:
(CREATE VIEW)(.*)
and replace with:
SET SCHEMA schemaname;\n\n$1 "viewname"
Demo
Step 3
This step would probably be done with an expression similar to:
TOP \(100\) PERCENT |oldschema1\.
being replaced with an empty string.
Demo
Step 4:
\s*GO being replaced with \n; or just ; and we might likely have the desired output, not sure though.
Demo

Issue with REGEXP_SUBSTR

I have text in a column like /AB/25MAR92/ and /AB/25MAR1992/. I am trying to extract just 25MAR92 and 25MAR1992 from the column for a date calculation that I have to work on. Can you please help with the REGEXP_SUBSTR function for this issue?
Thanks!
You could try:
\b\d{1,2}[A-Z]{3}\d{2,4}\b
but this will also match 02MAR992. To exclude this possibility use:
\b\d{1,2}[A-Z]{3}(?:\d{2}|\d{4})\b
This will match 02MAR1992 and02MAR92 but will not match02MAR992.
I suggest using a pattern like this:
\/(\d{2}[A-Z]{3}(19|20)?\d{2})\/
Years are limited to 1900-2099.
Demo
If you do not want to allow any 2-digit value for the day \d{2},
you could add this pattern instead (0[1-9]|[12][0-9]|3[01]) that matches 01-31;
\/((0[1-9]|[12][0-9]|3[01])[A-Z]{3}(19|20)?\d{2})\/
Or if you allow dates like /AB/2MAR92/ that have days without a leading zero
add (0[1-9]|[12][0-9]|3[01]|[1-9]) instead:
\/((0[1-9]|[12][0-9]|3[01]|[1-9])[A-Z]{3}(19|20)?\d{2})\/
I've used / as anchors. If you don't like that, you can use \b.
In reaction to your latest comments, my recommended pattern looks like this:
\b\d{1,2}[A-Z]{3}(?:19|20)?\d{2}\b

how to extract out a string with SYMBOLS after a pattern in a URL string in Google BigQuery

i have two possible forms of a URL string
http://www.abcexample.com/landpage/?pps=[Y/lyPw==;id_1][Y/lyP2ZZYxi==;id_2];[5403;ord];
http://www.abcexample.com/landpage/?pps=Y/lyPw==;id_1;unknown;ord;
I want to get out the Y/lyPw== in both examples
so everything before ;id_1 between the brackets
will always come after the ?pps= part
What is the best way to approach this? I want to use the big query language as this is where my data sits
Here is one way to build a regular expression to do it:
SELECT REGEXP_EXTRACT(url, r'\?pps=;[\[]?([^;]*);') FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];"
AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
AS url)
You can use this regex:
pps=\[?([^;]+)
Working demo
The idea behind this regex is:
pps= -> Look for the pps= pattern
\[? -> might have a [ or not
([^;]+) -> store the content up to the first semi colon
So, for your both url this regex will match (in blue) and capture (in green) as below:
For BigQuery you have to use
REGEXP_EXTRACT('str', 'reg_exp')
Quoting its documentation:
REGEXP_EXTRACT: Returns the portion of str that matches the capturing group within the regular expression.
You have to use a code like this:
SELECT
REGEXP_EXTRACT(word,r'pps=\[?([^;]+)') AS fragment
FROM
...
For a working example code you can use:
SELECT
REGEXP_EXTRACT(url,r'pps=\[?([^;]+)') AS fragment
FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];"
AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
AS url)
This regex should work for you
(\w+);id_1
It will extract XYZXYZ
It uses the concept of Group capture
See this Demo