Regex - extract last term between _ and before . from path - regex

This is the regex that I'm currently testing
[\w\. ]+(?=[\.])
My ultimate goal is to include a regex expression to extract using regexp_extract in Impala/Hive query.
regexp_extract(col, '[\w\. ]+(?=[\.])', 1)
This doesn't work in Impala however.
Examples of path to extract from:
D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua
The regex I've tested extracts the term I'm looking for but it's not good enough, for the second and third, fourth cases I would need to extract only the part after the last underscore.
My expections are:
Program1
Test-program
program
Case-general
Any suggestions? I'm also open to using something other than regexp_extract.

Note that Impala regex does not support lookarounds, and thus you need a capturing group to get a submatch out of the overall match. Also, if you use escaping \ in the pattern, make sure it is doubled.
You can use
regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)
See the regex demo.
The regex means
([^-_\\]+) - Group 1: one or more chars other than -, _ and \
\. - a dot
\w+ - one or more word chars
$ - end of string.

Using \w also matches an underscore, instead you can use [a-zA-Z0-9] instead.
Add matching a dot and hyphen in the character class, capture that in group 1 and match the expected trailing dot.
Note that you don't have to escape dots in a character class.
([a-zA-Z0-9.-]+)[.]
See a regex101 demo
Example using regexp_extract where the , 1 gets the group 1 value:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.]', 1)
If it should be at the end of the string only, matching the last dot without matching any backslashes in between:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.][^\\\\.]+$', 1)

Related

Regex to extract text between two character patterns

I have multiple rows of data that look like the following:
dgov-nonprod-adp-personal.groups
dgov-prod-gcp-sensitive.groups
I want to get the text between the last hyphen and before the period so:
personal
sensitive
I have this regex (?:prod-(.*)-)(.*).groups however it gives two groups and in bigquery I can only extract if there is one group, what would the regex be to just extract the text i want?
Note: after the second hyphen and before the third it will always be prod or nonprod, that's why in my original regex i use prod- since that will be a constant
Assuming the BigQuery function you are using supports a capture group, I would phrase your requirement as:
([^-]+)\.groups$
Demo
For the example data, you can make the pattern a bit more specific matching -nonprod or -prod with a single capture group:
-(?:non)?prod-[^-]+-([^-]+)\.groups$
See a regex demo.
If there can be more occurrences of the hyphen:
-(?:non)?prod(?:-[^-]+)*-([^-]+)\.groups$
The pattern matches
-(?:non)?prod Match either -nonprod or -prod
(?:-[^-]+)* Optionally match - followed by 1+ chars other than -
- Match literally
([^-]+) Capture group 1, match 1+ chars other than -
\.groups Match .groups
$ End of string
See another regex demo.

RegEx - Match a String Between Last '\' and Second '_'

I am trying to extract part of a filename out of a file path so that I can use it in the filename of a modified file. I'm having a little trouble trying to get RegEx to give me the part of the filename that I need, though. Here is the file path that I'm working with:
X:\\folder1\\folder2\\folder3\\folder4\\folder5\\Wherever-Place_2555025_Monthly-Report_202209150000.csv
Within this path, the drive name, the number of folders, the number of dashes in "Wherever-Place", and the information after the second underscore in the filename may vary. The important part is that I need to extract the following information:
Wherever-Place_2555025
from the path. Basically, I need to match everything between the last backslash and the second underscore. I can come up with the following RegEx to match everything after the last backslash:
[^\\]+$
And, if I run the output of that first RegEx through this next RegEx, I can get a match that includes the beginning of the string through the last character before the second underscore:
[^_]+_[^_]+
But, that also gives me another match that starts after the second underscore and goes through the end of the filename. This is not desirable - I need a single match, but I can't figure out how to get it to stop after it finds one match. I'd also really like to do all of this in one single RegEx, if that is possible. My RegEx has never been that good, and on top of that what I had is rusty...
Any help would be much appreciated.
If Lookarounds are supported, you may use:
(?<=\\)[^\\_]*_[^\\_]*(?=_[^\\]*$)
Demo.
For this match:
Basically, I need to match everything between the last backslash and
the second underscore.
You can use a capture group:
.*\\\\([^\s_]+_[^\s_]+)
The pattern matches:
.*\\\\ Match the last occurrence of \\
( Capture group 1
[^\s_]+_[^\s_]+ Match 1+ chars other than _ and \, then match the first _ and again match 1+ chars other than _ and \
) Close group 1
Regex demo
Or if supported with lookarounds and a match only:
(?<=\\)[^\s_\\]+_[^\s_]+(?![^\\]*\\)
The pattern matches:
(?<=\\) Positive lookbehind, assert \ to the left
[^\s_\\]+_[^\s_]+ Match 1+ chars other than _ and \, then match the first _ and again match 1+ chars other than _ and \
(?![^\\]*\\) Negative lookahead, assert not \ to the right
Regex demo

Match a part of a string using regex

I have a string and would like to match a part of it.
The string is Accept: multipart/mixedPrivacy: nonePAI: <sip:4168755400#1.1.1.238>From: <sip:4168755400#1.1.1.238>;tag=5430960946837208_c1b08.2.3.1602135087396.0_1237422_3895152To: <sip:4168755400#1.1.1.238>
I want to match PAI: <sip:4168755400#
the whitespace can be a word so i would like to use .* but if i used that it matches most of the string
The example on that link is showing what i'm matching if i use the whitespace instead of .*
(PAI: <sip:)((?:\([2-9]\d{2}\)\ ?|[2-9]\d{2}(?:\-?|\ ?))[2-9]\d{2}[- ]?\d{4})#
The example on that link is showing what i'm trying to achieve with .* but it should only match PAI: <sip:4168755400#
(PAI:.*<sip:)((?:\([2-9]\d{2}\)\ ?|[2-9]\d{2}(?:\-?|\ ?))[2-9]\d{2}[- ]?\d{4})#
I tried lookaround but failing.
Any idea?
thanks
Matching the single space can be updated by using a character class matching either a space or a word character and repeat that 1 or more times to match at least a single occurrence.
Note that you don't have to escape the spaces, and in both occasions you can use an optional character class matching either a space or hyphen [ -]?
If you want the match only, you can omit the 2 capturing groups if you want to.
(PAI:[ \w]+<sip:)((?:\([2-9]\d{2}\) ?|[2-9]\d{2}[ -]?)[2-9]\d{2}[- ]?\d{4})#
Regex demo
The regex should be like
PAI:.*?(<sip:.*?#)
Explanation:
PAI:.*? find the word PAI: and after the word it can be anything (.*) but ? is used to indicate that it should match as few as possible before it found the next expression.
(<sip:.*?#) capturing group that we want the result.
<sip:.*?# find <sip: and after the word it can be anything .*? before it found #.
Example

Regex to match ISO languages ISO

I have the following languages or language locale codes in a URL and i am trying to identify through REGEX. I was partially successful in identifying them but it is failing for some scenarios
Languages that i am testing with
en-us -- Passes
us -- Fails
Here is the REGEX that i have
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/)c\/(deals-and-tips\/)?
For instance:
https://forum.leasehackr.com/en-us/c/deals-and-tips (passes)
https://forum.leasehackr.com/us/c/deals-and-tips (fails)
What am I missing in the above REGEX?
The regex you wanted is:
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips\/)?
The difference from your regex is that I moved the first \/ from inside the parenthesis to outside (to sit with c\/).
Test here.
The last / fails the match in any case since your urls doesn't have it, in any way I would rewrite your regex as this: ([a-zA-Z]{2})(-[a-zA-Z]{2})?\/c\/(deals-and-tips)?.
This way it always looks for the first part (en) and consider the second (-us) as optional.
Alternatively use (\w{2})(-\w{2})?\/c\/(deals-and-tips)?, if you don't mind risking to match underscores and similar simbols
The reason your pattern does not match us is because the alternation ([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/) only matches the \/ in the second part of the alternation.
Also it does not match the last group with deals-and-tips because there is no trailing \/ in the example data.
Your updated pattern might look like
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips)?
Regex demo
You could shorten the pattern a bit by using an optional non capturing group (?:-[a-zA-Z]{2})? inside the first capturing group to optionally match the part starting with a hyphen.
As in the example data you could match the leading \/ in front of the capturing group to get a more efficient match.
\/([a-zA-Z]{2}(?:-[a-zA-Z]{2})?)\/c\/(deals-and-tips)?
In parts
\/ To be a bit more precise, match the leading /
( Capture group 1
[a-zA-Z]{2} Match 2 chars a-z
(?:-[a-zA-Z]{2})? Optionally match - and 2 chars a-z
) Close group
\/c\/ Match /c/deals-and-tips`
(deals-and-tips)? Optional capture group 2 match deals-and-tips
Regex demo
Note that if you use another delimiter than / you don't have to escape the forward slash.

Regex needed for optional last digit

I am trying to figure out a regex for a version parser.
I need to parse a version containing a major.minor.patch.build version string with 3 to 4 digits with the last (4th) digit optional.
For example the version could be:
1.2.3.4
or
1.2.3
I have my regex as the following, but it fails for 1.2.3 version string:
regex = "(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)?"
Also, do I need the double back slashes ?
The following should do what you want:
(\d+)\.(\d+)\.(\d+)(\.(\d+))?
\d matches any single number <=> [0-9]
\. to match the . character (a single . in a regex matches any single character)
You can prepend '^' and append '$' to the regex to ensure there's no garbage before or after your version.
In your regex you have to make the last part \.(\d+) including the dot optional or else it would match 1.2.3.4 but also 1.2.3.
Try it like this with an optional last group where the dot and the digits are optional:
^\d+\.\d+\.\d+(?:\.\d+)?$
Or with capturing groups and the last is a non capturing group with a dot and a capturing group for the last digits:
^(\d+)\.(\d+)\.(\d+)(?:\.(\d+))?$
Instead of using anchors ^ and $ you could use a word boundary \b
There is no programming language specified concerning the double back slashes but what might help is when you open the regex101 demo link , there is a link under tools -> code generator where you can select a programming language. Perhaps that could be helpful.