I'm trying to formulate a regex that captures everything after last period, up until (not including) underscore number 3 AFTER the period.
For example:
ABC_Simple_DEF.dbo.GDE_1_1_Contact_test
should return GDE_1_1.
I've tried using [^.]+$ which includes everything after the last period.
The expression _[^_]+$ includes last underscore and everything after, which is close, but not exactly what I'm looking for.
Kinda stuck here and would appreciate any help
You may use
[^._]+(?:_[^._]+){2}(?=_[^.]*$)
Or, capturing approach (you will need to grab Group 1 value from the result):
([^._]+(?:_[^._]){2})_[^.]*$
See regex demo #1 and regex demo #2.
Details
[^._]+ - 1+ chars other than . and _
(?:_[^._]+){2} - two repetitions of
_ - an underscore
[^._]+ - 1+ chars other than . and _
(?=_[^.]*$) - a positive lookahead that requires _ and 0+ chars other than . up to the end of string immediately to the right of the current position.
If a negative lookbehind is supported, one option could be to assert what is on the left is a dot and use a negative lookahead to assert no more dots after the matched one:
(?<=\.)(?!.*\.)(?:[^_]+_){2}[^_]+
Explanation
(?<=\.) Negative behind, assert what is directly on the left is not a dot
(?!.*\.) Negative lookahead, assert not more dots following
(?: Non capturing group
[^_]+_ match 1+ times not an underscore, then an _
){2} Close non capturing group and repeat 2 times
[^_]+ Match 1+ times not an _
Regex demo
A slight variation over Wiktor's answer, that requires a last period and captures everything until the third underscore, or until the end if there are less than three (non-capturing groups dropped for clarity, test here) :
\.([^._]*(_[^._]*){0,2})[^.]*$
The target capture group is 1. To better visualize, suppose your input contains only underscores, periods, and the character c, then it becomes :
\.(c*(_c*){0,2})c*$
The straight "dumb" regex is:
([^.]*\.)*([^_]*_[^_]*_[^_]*).*
and you need group \1
Test here.
Related
I've got 2 strings in the format:
Some_thing_here_1234 Match Me 1 & 1234 Match Me 1_1
In both cases I want the resultant match to be 1234 Match Me 1
So far I've got (?<=^|_)\d{4}\s.+ which works but in the case of string 2 also captures the _1 at the end. I thought I could use a lookahead at the end with an optional such as (?<=^|_)\d{4}\s.+(?=_\d{1}$|$) but it always seems to revert to the second option and so the _1 gets through.
Any help would be great
You can use
(?<=^|_)\d{4}\s[^_]+
See the regex demo.
Details:
(?<=^|_) - a positive lookbehind that matches a location that is immediately preceded with either start of string or a _ char (equal to (?<![^_]))
\d{4} - four digits
\s - a whitespace
[^_]+ - one or more chars other than _.
Your second pattern (?<=^|_)\d{4}\s.+(?=_\d{1}$|$) is greedy and at the end of the string the second alternative |$ will match so you will keep matching the whole line.
Note that you can omit {1}
If you want to use an optional part in the lookahad, you can make the match non greedy and optionally match :_\d in the lookahead followed by the end of the string.
(?<=^|_)\d{4}\s.+?(?=(?:_\d)?$)
See a regex demo.
I've stumbled upon a regex question.
How to validate a subtract equation like this?
A string subtract another string equals to whatever remains(all the terms are just plain strings, not sets. So ab and ba are different strings).
Pass
abc-b=ac
abcde-cd=abe
ab-a=b
abcde-a=bcde
abcde-cde=ab
Fail
abc-a=c
abcde-bd=ace
abc-cd=ab
abcde-a=cde
abc-abc=
abc-=abc
Here's what I tried and you may play around with it
https://regex101.com/r/lTWUCY/1/
Disclaimer: I see that some of the comments were deleted. So let me start by saying that, though short (in terms of code-golf), the following answer is not the most efficient in terms of steps involved. Though, looking at the nature of the question and its "puzzle" aspect, it will probably do fine. For a more efficient answer, I'd like to redirect you to this answer.
Here is my attempt:
^(.*)(.+)(.*)-\2=(?=.)\1\3$
See the online demo
^ - Start line anchor.
(.*) - A 1st capture group with 0+ non-newline characters right upto;
(.+) - A 2nd capture group with 1+ non-newline characters right upto;
(.*) - A 3rd capture group with 0+ non-newline characters right upto;
-\2= - An hyphen followed by a backreference to our 2nd capture group and a literal "=".
(?=.) - A positive lookahead to assert position is followed by at least a single character other than newline.
\1\3 - A backreference to what was captured in both the 1st and 3rd capture group.
$ - End line anchor.
EDIT:
I guess a bit more restrictive could be:
^([a-z]*)([a-z]+)((?1))-\2=(?=.)\1\3$
You may use this more efficient regex with a lookahead at the start with a capture group that matches text on the right hand side of - i.e. substring between - and = and captures it in group #1. Then in the main body of regex we just check presence of capture group #1 and capture text before and after \1 in 2 separate groups.
^(?=[^-]+-([^=]+)=.)([^-]*?)\1([^-]*)-[^=]+=\2\3$
RegEx Demo
RegEx Demo:
^: Start
(?=[^-]+-([^=]+)=.): Lookahead to make sure we have expression structure of pqr-pq=r and also more importantly capture substring between - and = in capture group #1. . after = is there for a reason to disallow any empty string after =.
([^-]*?): Match 0 or more non-- characters in capture group #2
\1: Back-reference to group #1 to make sure we match same value as in capture group #1
([^-]*): Match 0 or more non-- characters in capture group #3
-: Match a -
[^=]+: Match 0 or more non-= characters
=: Match a =
\2\3: Back-reference to group #2 and #3 which is difference of substraction
$: End
(it must be something trivial and answered many times already - but I can't formulate the right search query, sorry!)
From the text like prefix start.then.123.some-more.text. All the rest I need to extract start.then.123.some-more.text - i.e. string that has no spaces, have periods in the middle and may have or not the trailing period (and that trailing period should not be included). I struggle to build a regex that would catch both cases:
prefix (start[0-9a-zA-Z\.\-]+)\..* - this works correctly only if there's a trailing period,
prefix (start[0-9a-zA-Z\.\-]+)\.?.* - I thought adding ? after \. will make it optional - but it doesn't...
P.S. My environment is MS VBA script, I'm using CreateObject("vbscript.regexp") - but I guess the question is relevant to other regex engines as well.
If you don’t want to include “prefix” you can use:
(?<=prefix )\S*?(?=\.?\s)
Demo
EDIT:
Even simpler, without lookbehinds or lookaheads, if you're using capturing groups anyway:
prefix (\S*\w)
This will stop at the last letter, number, or underscore. If you want to be able to capture a hyphen as the last character, you can change \w above to [\w-].
Demo 2
You could match prefix, and use a capturing group to first match chars A-Za-z0-9.
Then you can repeat the previous pattern in a group preceded by either a . or - using a character class.
prefix ([0-9a-zA-Z]+(?:[.-][0-9a-zA-Z]+)+)
In parts
prefix Match literally
( Capture group 1
[0-9a-zA-Z]+ Match 1+ times any of the listed chars
(?: Non capture group
[.-][0-9a-zA-Z]+ match either a . or - and again match 1+ times any of the listed chars
)+ Close group and repeat 1+ times to match at least a dot or hyphen
) Close group
Regex demo
If the value in the capturing group should begin with start:
prefix (start(?:[.-][0-9a-zA-Z]+)+)
Regex demo
I have fields which contain data in the following possible formats (each line is a different possibility):
AAA - Something Here
AAA - Something Here - D
Something Here
Note that the first group of letters (AAA) can be of varying lengths.
What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:
- (.*) which works fine for cases 1 and 2 but obviously not 3;
(?<= - )(.*) which also works fine for cases 1 and 2;
(?! - )(.+)| - (.+) works for cases 2 and 3 but not 1.
I feel like I'm on the verge of it but I can't seem to crack it.
Thanks in advance for your help.
Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.
About the patterns that you tried:
- (.*)This pattern will match the first occurrence of - followed by matching the rest of the line. It will match too much for the second example as the .* will also match the second occurrence of -
(?<= - )(.*)This pattern will match the same as the first example without the - as it asserts that is should occur directly to the left
(?! - )(.+)| - (.+) This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - ). As none of the example start with - , the whole line will be matched directly after the negative lookahead due to .+ and the second part after the alternation | will not be evaluated
If the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+ or 1+ word characters \w+.
To get a more broad match, you could match 1 or more non whitespace characters using \S+
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
Explanation
^ Start of string
(?:\S+\h-\h)? Optionally match the first group of non whitespace chars followed by - between horizontal whitespace chars
\K Clear the match buffer (Forget what is currently matched)
\S+ Match 1+ non whitespace characters
(?: Non capture group
\h(?!-\h) Match a horizontal whitespace char and assert what is directly to the right is not - followed by another horizontal whitespace char
\S+ Match 1+ non whitespace chars
)* Close non capture group and repeat 1+ times to match more "words" separated by spaces
Regex demo
Edit
To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$ and assert the end of the string if the pattern should match the whole string:
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$
Regex demo
You may use
^(?:.*? - )?\K.*?(?= - | *$)
^(?:.*?\h-\h)?\K.*?(?=\h-\h|\h*$)
See the regex demo
Details
^ - start of string
-(?:.*? - )? - an optional non-capturing group matching any 0+ chars other than line break chars as few as possible up to the first space-space
\K - match reset operator
.*? - any 0+ chars other than line break chars as few as possible
(?= - | *$) - space-space or 0+ spaces till the end of string should follow immediately on the right.
Note that \h matches any horizontal whitespace chars.
^(?:[A-Z]+ - \K)?.*\S
demo
Since "Something Here" can be anything, there's no reason to specially describe the eventual last letter in the pattern. You don't need something more complicated.
With this pattern I assume that you are not interested by the trailing spaces, that's why I ended it with \S. If you want to keep them, remove the \S and change the previous quantifier to +.
I need to extract 1234567 from below URLs
http://www.test.in/some--wonders-1234567---2
http://www.test.in/some--wonders-1234567
I tried with .*\-([0-9]+)(?:-{2,}2)?.
but for the first URL it returned 2, but this is in non-capturing group.
Please give me a solution. I am digging it for so long. not getting any idea.
Try this one:
.*?\-([0-9]+)(?:-{2,}2|$)
It sets lazy mode for first .* pattern, you can also remove it at all with same effect:
\-([0-9]+)(?:-{2,}2|$)
If your regex engine supports negative look behinds (some do not), you can do it this way:
(?<!\d+-+)\d+
It gives you any non-empty digit string, which is not preceded by (minuses followed by digits).
Big advantage is that you don't have to use groups here - regex itself returns what you want.
You could match a - followed by one or more digits which you could capture in a group ([0-9]+). This group will contain the value you want to extract.
Then an optional part (?:-{2,}[0-9]+)? that would match ---2 followed by asserting the end of the line $.
-(\d+)(?:-{2,}\d+)?$
Explanation
- Match literally
(\d+) Capture one or more digits in a group
(?: Non capturing group
-{2,} Match 2 or more times -
\d+ Match one or more digits
)? close non capturing group and make it optional
$ Assert position at the end of the line