REGEX: Select all text between last underscore and dot - regex

I'm having trouble retrieving specific information of a string.
The string is as follows:
20190502_PO_TEST.pdf
This includes the .pdf part. I need to retrieve the part between the last underscore (_) and the dot (.) leaving me with TEST
I've tried this:
[^_]+$
This however, returns:
TEST.PDF
I've also tried this:
_(.+)\.
This returns:
PO_TEST

This pattern [^_]+$ will match not an underscore until the end of the string and will also match the .
In this pattern _(.+). you have to escape the dot to match it literally like _(.+)\. see demo and then your match will be in the first capturing group.
What you also might use:
^.*_\K[^.]+
^.*_ Match the last underscore
\K Forget what was matched
[^.]+ Match 0+ times not a dot
Regex demo

Related

Regex that doesn't recognise a pattern

I want to make a regex that recognize some patterns and some not.
_*[a-zA-Z][a-zA-Z0-9_][^-]*.*(?<!_)
The sample of patterns that i want to recognize:
a100__version_2
_a100__version2
And the sample of patterns that i dont want to recognize:
100__version_2
a100__version2_
_100__version_2
a100--version-2
The regex works for all of them except this one:
a100--version-2
So I don't want to match the dashes.
I tried _*[a-zA-Z][a-zA-Z0-9_][^-]*.*(?<!_)
so the problem is at [^-]
You could write the pattern like this, but [^-]* can also match newlines and spaces.
To not match newlines and spaces, and matching at least 2 characters:
^_*[a-zA-Z][a-zA-Z0-9_][^-\s]*$(?<!_)
Regex demo
Or matching only word characters, matching at least a single character repeating \w* zero or more times:
^_*[a-zA-Z]\w*$(?<!_)
^ Start of string
_* Match optional underscores
[a-zA-Z] Match a single char a-zA-Z
\w* Match optional word chars (Or [a-zA-Z0-9_]*)
$ End of string
(?<!_) Assert not _ to the left at the end of the string
Regex demo

Regex - All before an underscore, and all between second underscore and the last period?

How do I get everything before the first underscore, and everything between the last underscore and the period in the file extension?
So far, I have everything before the first underscore, not sure what to do after that.
.+?(?=_)
EXAMPLES:
111111_SMITH, JIM_END TLD 6-01-20 THR LEWISHS.pdf
222222_JONES, MIKE_G URS TO 7.25 2-28-19 SA COOPSHS.pdf
DESIRED RESULTS:
111111_END TLD 6-01-20 THR LEWISHS
222222_G URS TO 7.25 2-28-19 SA COOPSHS
You can match the following regular expression that contains no capture groups.
^[^_]*|(?!.*_).*(?=\.)
Demo
This expression can be broken down as follows.
^ # match the beginning of the string
[^_]* # match zero or more characters other than an underscore
| # or
(?! # begin negative lookahead
.*_ # match zero or more characters followed by an underscore
) # end negative lookahead
.* # match zero or more characters greedily
(?= # begin positive lookahead
\. # match a period
) # end positive lookahead
.*_ means to match zero or more characters greedily, followed by an underscore. To match greedily (the default) means to match as many characters as possible. Here that includes all underscores (if there are any) before the last one. Similarly, .* followed by (?=\.) means to match zero or more characters, possibly including periods, up to the last period.
Had I written .*?_ (incorrectly) it would match zero or more characters lazily, followed by an underscore. That means it would match as few characters as possible before matching an underscore; that is, it would match zero or more characters up to, but not including, the first underscore.
If instead of capturing the two parts of the string of interest you wanted to remove the two parts of the string you don't want (as suggested by the desired results of your example), you could substitute matches of the following regular expression with empty strings.
_.*_|\.[^.]*$
Demo
This regular expression reads, "Match an underscore followed by zero of more characters followed by an underscore, or match a period followed by zero or more characters that are not periods, followed by the end of the string".
You could use 2 capture groups:
^([^_\n]+_).*\b([^\s_]*_.*)(?=\.)
^ Start of string
([^_\n]+_) Capture group 1, match any char except _ or a newline followed by matching a _
.*\b Match the rest of the line and match a word boundary
([^\s_]*_.*) Capture group 2, optionally match any char except _ or a whitespace char, then match _ and the rest of the line
(?=\.) Positive lookahead, assert a . to the right
See a regex demo.
Another option could be using a non greedy version to get to the first _ and make sure that there are no following underscores and then match the last dot:
^([^_\n]+_).*?(\S*_[^_\n]+)\.[^.\n]+$
See another regex demo.
Looks like you're very close. You could eliminate the names between the underscores by finding this
(_.+?_)
and replacing the returned value with a single underscore.
I am assuming that you did not intend your second result to include the name MIKE.

Match all except specific group

I have a test string repo-2019-12-31-14-30-11.gz and I want to exclude 2019-12-31-14-30-11.gz from that string and match everything else. Digits with date and hour can be different. String at the beginning of text can be any word, can contain digits, dashes or underscores. Constant characters are:
dash between repo name and date
.gz at end of text
I tried following regex:
^.*(?!-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}.gz$)
but it always matches whole text
The pattern that you tried ^.*(?!-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}.gz$) always matches the whole text because .* will first match until the end of the string. Then at the end of the string, it will assert that what is directly on the right is not the date like pattern.
That assertion will succeed as it is at the end of the string.
You could use a capturing group with a character class matching word characters or a hyphen and use that in the replacement:
^([\w-]+)-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}\.gz$
Regex demo
If the beginning can not start with an underscore and can not contain consecutive underscores, you could repeat matching a hyphen and a word character in a grouping stucture \w+(?:-\w+)*
^(\w+(?:-\w+)*)-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}\.gz$
Regex demo

REGEX for Google Analytics Filters

I have a link with a lot of parameters and want to exclude most of them (only keep one) and replace it for a name
The link has the structure below:
https://www.abc.ab/something/somethingelse?card_type=status&a-bunch-of-trailing
Card_type can have different "status" values
Ideally I would like to keep:
https://www.abc.ab/something/somethingelse?card_type=status and replace ?card_type=status by "/card_type"
I attempted this on GA:
Search string
/*(.*?)\card_type\=*
Replace string:
/card_type
But this isn't working at all
You could match ?card_type= followed by matching any char 0+ times .* If it should be from the start of the string you could use an anchor ^ at the start of the pattern.
In the replacement use the first capturing group followed by the replacement string.
(.*?)\?card_type=.*
(.*?) Capture group 1 matching any char 0+ times non greedy
\? Match a ? by escaping it
card_type= Match literally
.* Match any char 0+ times non greedy
Replace with
$1/card_type
Regex demo
To get a bit more precise match for the url instead of using .*?, you might match the protocol:
^(https?:\/\/\S+\/[^?]*)\?card_type=.*
Regex demo

Regex : Match everything after first dash

I have a string which contains the rego number of the car like
1FX9JE - 2012 Audi A3 Ambition Sportback MY12 Stronic
I would like to match everything except the rego number, so anything after the dash.
The regex I came up with is (php)
\s.[^-]*$
My initial regex which i came up can match anything after the dash only if the string contains only 1 dash. For example https://regex101.com/r/Jao8W0/1
However, if the string has more than 1 dash. The regex is not usable.
For example : https://regex101.com/r/Jao8W0/2
Is there anyway for me to match anything after the first dash even though the string contains additional dash after the first dash.
Thank you
Try this Regex:
^[^-\r\n]+-\s*\K.*$
Click for Demo
Explanation:
^ - asserts the start of the string
[^-\r\n]+ - matches 1+ occurrences of any character that is neither a - or nor a newline
-\s* - matches the first - in the string followed by 0+ whitespaces
\K - forgets everything matched so far
.* - matches 0+ occurrences of any character
$ - asserts the end of the string
if only has one space, you can use this pattern:
(?<=\-\s)(.*)
else if there may have more than one space, get the group(1) from match
(?<=\-)\s*(.*)
(?<=...) Ensures that the given pattern will match, ending at the
current position in the expression. The pattern must have a fixed
width. Does not consume any characters.