Regex to extract text between two character patterns - regex

I have multiple rows of data that look like the following:
dgov-nonprod-adp-personal.groups
dgov-prod-gcp-sensitive.groups
I want to get the text between the last hyphen and before the period so:
personal
sensitive
I have this regex (?:prod-(.*)-)(.*).groups however it gives two groups and in bigquery I can only extract if there is one group, what would the regex be to just extract the text i want?
Note: after the second hyphen and before the third it will always be prod or nonprod, that's why in my original regex i use prod- since that will be a constant

Assuming the BigQuery function you are using supports a capture group, I would phrase your requirement as:
([^-]+)\.groups$
Demo

For the example data, you can make the pattern a bit more specific matching -nonprod or -prod with a single capture group:
-(?:non)?prod-[^-]+-([^-]+)\.groups$
See a regex demo.
If there can be more occurrences of the hyphen:
-(?:non)?prod(?:-[^-]+)*-([^-]+)\.groups$
The pattern matches
-(?:non)?prod Match either -nonprod or -prod
(?:-[^-]+)* Optionally match - followed by 1+ chars other than -
- Match literally
([^-]+) Capture group 1, match 1+ chars other than -
\.groups Match .groups
$ End of string
See another regex demo.

Related

Regex - extract last term between _ and before . from path

This is the regex that I'm currently testing
[\w\. ]+(?=[\.])
My ultimate goal is to include a regex expression to extract using regexp_extract in Impala/Hive query.
regexp_extract(col, '[\w\. ]+(?=[\.])', 1)
This doesn't work in Impala however.
Examples of path to extract from:
D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua
The regex I've tested extracts the term I'm looking for but it's not good enough, for the second and third, fourth cases I would need to extract only the part after the last underscore.
My expections are:
Program1
Test-program
program
Case-general
Any suggestions? I'm also open to using something other than regexp_extract.
Note that Impala regex does not support lookarounds, and thus you need a capturing group to get a submatch out of the overall match. Also, if you use escaping \ in the pattern, make sure it is doubled.
You can use
regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)
See the regex demo.
The regex means
([^-_\\]+) - Group 1: one or more chars other than -, _ and \
\. - a dot
\w+ - one or more word chars
$ - end of string.
Using \w also matches an underscore, instead you can use [a-zA-Z0-9] instead.
Add matching a dot and hyphen in the character class, capture that in group 1 and match the expected trailing dot.
Note that you don't have to escape dots in a character class.
([a-zA-Z0-9.-]+)[.]
See a regex101 demo
Example using regexp_extract where the , 1 gets the group 1 value:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.]', 1)
If it should be at the end of the string only, matching the last dot without matching any backslashes in between:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.][^\\\\.]+$', 1)

How can I remove something from the middle of a string with regex?

I have strings which look like this:
/xxxxx/xxxxx-xxxx-xxxx-338200.html
With my regex:
(?<=-)(\d+)(?=\.html)
It matches just the numbers before .html.
Is it possible to write a regex that matches everything that surrounds the numbers (matches the .html part and the part before the numbers)?
In your current pattern you already use a capturing group. In that case you might also match what comes before and after instead of using the lookarounds
-(\d+)\.html
To get what comes before and after the digits, you could use 2 capturing groups:
^(.*-)\d+(\.html)$
Regex demo
In the replacement use the 2 groups.
This should do the job:
.*-\d+\.html
Explanation: .* will match anything until -\d+ say it should match a - followed by a sequence of digits before a \.html (where \. represents the character .).
To capture groups, just do (.*-)(\d+)(\.html). This will put everything before the number in a group, the number in another group and everything after the number in another group.

How can I get two variables from a string with Regex

I have the following string that I need to extract two variables from with Regex. I've been looking at tutorials, but not getting the results I need. Can anyone help?
AWS-010062347904-uptree-base-prod-admin
I need the 12 digits between the first two hyphens as accountid (ie. 010062347904)
and everything after the 2nd hyphen as role (ie. uptree-base-prod-admin)
Try Regex: (?:[^-]*-)(?<accountId>\d{12})(?:[^-]*-)(?<role>.*)
Demo
From your pattern in the comments ^AWS\-(?<accountid>\d+) you are matching AWS- from the start ^ of the string and after that you are using a named capturing group accountId to capture 1+ times a digit.
You could get that accountId by referring to that named capturing group. To capture the role you can use another named capturing group:
^AWS-(?<accountId>\d+)-(?<role>.*)$
Regex demo
If you want the digits after the first hyphen without taking into account what is at the start of the string you could use a negated character class [^-]* at the start matching 0+ times not a hyphen:
^[^-]*-(?<accountId>\d+)-(?<role>.*)$
Regex demo

Regular expression to split by extension but keep the extension

Hi all I would like to split a string which has an extension .ps1I used the following regex
var regex = Regex.Split(text, ".ps1");
but I need the extension to exists in the first string. assume I have my script as follows c:\Test\test.ps1 -Arg -Arg1, when I split it I need the string as c:\Test\test.ps1 and -Arg -Arg1 as second string how can I do this
Use a positive lookbehind (?<=\.ps1):
(?<=\.ps1)\s+
See the regex demo
Details:
(?<=\.ps1) - require a .ps1 to be immediately before the current location
\s+ - 1+ whitespace symbols
This will give you the first part in group one and second part in group two
(.+[.]ps1)(.+)
Explanation
(.+[.]ps1) - first group with anything followed by ps1 extension
(.+) - second group with anything after first group

Regex Optional Match

I have this regex pattern which I made myself (I'm a noob though, and made it through following tutorials):
^([a-z0-9\p{Greek}].*)\s(Ε[0-9\p{Greek}]+|Θ)\s[\(]([a-z1-9\p{Greek}]+.*)[\)]\s-\s([a-z0-9\p{Greek}]+$)
And I'm trying to match the following sentences:
ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ Ε2 (Ε.Β.Δ.) - ΔΗΜΗΤΡΙΟΥ
ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ 1 Θ (ΑΜΦ) - ΜΑΣΤΟΡΟΚΩΣΤΑΣ
ΕΙΣΑΓΩΓΗ ΣΤΗΝ ΠΛΗΡΟΦΟΡΙΚΗ Θ (ΑΜΦ) - ΒΟΛΟΓΙΑΝΝΙΔΗΣ
And so on.
This pattern splits the string into 4 parts.
For example, for the string:
ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ Ε2 (Ε.Β.Δ.) - ΔΗΜΗΤΡΙΟΥ
The first match is: ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ (Subject's Name)
Second match is: Ε2 (Class)
Third match is: Ε.Β.Δ. (Room)
And the forth match is: ΔΗΜΗΤΡΙΟΥ (Teacher)
Now in some entries E*/Θ is not defined, and I want to get the 3 matches without the E*/Θ. How should I modify my pattern so that (Ε[0-9\p{Greek}]+|Θ) is an optional match?
I tried ? so far, but because in my previous matches i'm defining \s and \s it requires 2 whitespaces to get 3 matches and i only have one in my string.
I think you need to do two things:
Make .* lazy (i.e. .*?)
Enclose (?:\s(Ε[0-9\p{Greek}]+|Θ))? with a non-capturing optional group.
The regex will look like
^([a-z0-9\p{Greek}].*?)(?:\s(Ε[0-9\p{Greek}]+|Θ))?\s[\(]([a-z1-9\p{Greek}]+.*)[\)]\s-\s([a-z0-9\p{Greek}]+)$
^^ ^^ ^
See demo
If you do not make the first .* lazy, it will eat up the second group that is optional. Making it lazy will ensure that if there is some text that can be matched by the second capturing group, it will be "set".
Note you call capture groups matches, which is wrong. Matches are whole texts matched by the entire regular expression and captures are just substrings matched by parts of regexp enclosed in unescaped round brackets. See more on capture groups at regular-expressions.info.
You can use something like:
(E[0-9\p{Greek}]+|0)?
The whole group will be optional (?).